Kubernetes Deployment
Overview
With infrastructure and cluster access ready, you now deploy the Confident AI application services. This step covers:
- Preparing Kubernetes manifests from the
confident-k8srepository - Updating container image tags to the versions provided by Confident AI
- Configuring the ingress with your domain names and ACM certificate ARN
- Setting up External Secrets to sync credentials from AWS Secrets Manager
- Deploying services in the correct order: base config → Redis → application services
- Monitoring deployment status and troubleshooting common issues
After completion, all Confident AI services will be running in your EKS cluster.
How Kubernetes deployments work
Kubernetes uses YAML manifests to describe what you want to run. A Deployment tells Kubernetes:
- What container image to use
- How many replicas (copies) to run
- What environment variables and secrets to inject
- Resource limits (CPU, memory)
When you kubectl apply a manifest, Kubernetes:
- Reads your desired state
- Compares it to current state
- Creates, updates, or deletes resources to match
- Continuously ensures the desired state is maintained
Repository structure
The confident-k8s repository contains Kubernetes manifests organized by service:
Prepare manifests
Navigate to the Kubernetes repository:
Update image tags
Container images are stored in Confident AI’s ECR. Your Confident AI representative will provide the specific version tags to use.
Open each deployment file and update the image field:
Example change in confident-backend/deployment.yaml:
Image tag format: <ecr-account>.dkr.ecr.<region>.amazonaws.com/confidentai/<service>:<version>
The ECR account ID and region are the values provided by Confident AI for ECR access. The version tag (e.g., v1.2.3) is what your representative will provide.
Use exact tags, not “latest.” Always use specific version tags (e.g.,
v1.2.3), not latest. Specific tags ensure reproducible deployments and
make rollbacks possible. Using latest can cause unexpected behavior when
images are updated.
Configure ingress
The ingress defines how external traffic reaches your services. Edit base/network/ingress.yaml:
Add the ACM certificate ARN
Find the certificate-arn annotation and add your certificate ARN (from the SSL Certificates step):
The certificate ARN must be exact. Copy it carefully from terraform output -raw acm_certificate_arn. Don’t include any trailing % characters
your terminal might display.
Set ALB scheme
The scheme annotation controls whether the load balancer is internal or internet-facing:
Internal vs. internet-facing: - Internal: ALB only gets a private IP. Users must be on VPN or connected network to access. More secure for enterprise deployments. - Internet-facing: ALB gets a public IP. Anyone with the URL can reach the login page (authentication still required). Simpler but larger attack surface.
Configure External Secrets
External Secrets syncs credentials from AWS Secrets Manager into Kubernetes. The configuration needs to match your deployment.
Update the secret store region
Edit base/common/external-secrets/secret-store.yaml:
Update the secret name
Edit base/common/external-secrets/external-secrets.yaml:
The key field in each secret reference should match the Terraform-created secret name. The pattern is:
For example, if you deployed with confident_environment = "stage":
Secret name must exactly match. If the name doesn’t match what Terraform
created, External Secrets won’t find it and pods won’t start. You can verify
the secret name in AWS Secrets Manager console or via: bash aws secretsmanager list-secrets --query 'SecretList[*].Name'
Deploy services
Deploy in order to ensure dependencies are available when needed.
Apply network and secrets configuration
The namespace already exists (Terraform created it), but apply the network and secrets configuration:
Wait for secrets to sync
External Secrets needs to pull credentials from Secrets Manager before pods can start. Watch the sync status:
Wait until STATUS shows SecretSynced:
Press Ctrl+C to stop watching once synced.
Verify the Kubernetes secret was created:
Don’t proceed until secrets are synced. Pods reference this secret for database URLs, API keys, and other credentials. If you deploy before sync completes, pods will fail to start with “secret not found” errors.
Monitor deployment
Watch pod status
All pods should eventually reach Running status with all containers ready:
Backend may restart once during initial deployment. The backend runs database migrations on startup. If the database isn’t ready immediately, it may fail once and then succeed on retry. One or two restarts is normal.
Check events
If pods aren’t starting, check recent events:
This shows what Kubernetes is doing and any errors it encountered.
Common deployment issues
ImagePullBackOff
The cluster can’t pull the container image. Causes:
- ECR credentials not synced: The CronJob that refreshes ECR tokens hasn’t run yet
- Wrong image tag: The specified version doesn’t exist
- Network issues: Nodes can’t reach ECR
Fix: Manually trigger ECR credential sync:
Wait 30 seconds, then check if pods start pulling images.
ECR tokens expire every 12 hours. The CronJob refreshes them automatically, but on initial deployment, you may need to trigger it manually. If you see ImagePullBackOff after the cluster has been running for a while, the CronJob may have failed—check its logs.
CrashLoopBackOff
The container starts but crashes. Check logs to see why:
Common causes:
External Secrets not syncing
Look at the Status section for error messages. Common issues:
Ingress not creating ALB
After applying the ingress, check if an ALB is being created:
The ADDRESS column should show an ALB hostname. If it’s empty after a few minutes:
Look for errors in the events. Common issues:
- Missing subnet tags: ALB Controller can’t find subnets tagged for ELB
- IAM permission errors: The ALB Controller service account lacks permissions
- Invalid certificate ARN: The specified ACM certificate doesn’t exist or isn’t issued
Scaling services
Once everything is running, you can scale based on load:
The evals service is most resource-intensive during evaluation runs. If evaluations are slow or timing out, scaling this service usually helps most.
Next steps
After all services are running and healthy, proceed to Verification to test the deployment end-to-end.