With infrastructure and cluster access ready, you now deploy the Confident AI application services. This step covers:
After completion, all Confident AI services will be running in your AKS cluster.
Kubernetes uses YAML manifests to describe what you want to run. A Deployment tells Kubernetes:
When you kubectl apply a manifest, Kubernetes:
The Kubernetes manifests are organized by service:
Container images are stored in Confident AI’s AWS ECR. Your Confident AI representative will provide the specific version tags to use.
Open each deployment file and update the image field:
Example change in confident-backend/deployment.yaml:
Image tag format: <ecr-account>.dkr.ecr.<region>.amazonaws.com/confidentai/<service>:<version>
The ECR account ID and region are the values provided by Confident AI for ECR access. The version tag (e.g., v1.2.3) is what your representative will provide.
Use exact tags, not “latest.” Always use specific version tags (e.g.,
v1.2.3), not latest. Specific tags ensure reproducible deployments and
make rollbacks possible. Using latest can cause unexpected behavior when
images are updated.
The ingress defines how external traffic reaches your services. For Azure, this uses NGINX Ingress (not AWS ALB). Edit base/network/ingress.yaml:
The tls section and cert-manager.io/cluster-issuer annotation tell cert-manager to automatically request and renew certificates:
cert-manager watches for Ingress resources with this annotation and automatically creates Certificate resources, requests certificates from the issuer, and stores them in the specified Kubernetes Secret.
External Secrets syncs credentials from Azure Key Vault into Kubernetes. The configuration needs to match your deployment.
Edit base/common/external-secrets/secret-store.yaml:
The vaultUrl must match your Key Vault URI (from terraform output key_vault_uri):
Edit base/common/external-secrets/external-secrets.yaml:
The key field in each secret reference should match the Key Vault secret names created by Terraform. The secrets use hyphenated names (e.g., DATABASE-URL, BETTER-AUTH-SECRET):
Key Vault secret names use hyphens, not underscores. Azure Key Vault doesn’t allow underscores in secret names. The External Secrets config maps hyphenated Key Vault names to underscored Kubernetes secret keys.
Deploy in order to ensure dependencies are available when needed.
The namespace already exists (Terraform created it), but apply the network and secrets configuration:
External Secrets needs to pull credentials from Key Vault before pods can start. Watch the sync status:
Wait until STATUS shows SecretSynced:
Press Ctrl+C to stop watching once synced.
Verify the Kubernetes secret was created:
Don’t proceed until secrets are synced. Pods reference this secret for database URLs, API keys, and other credentials. If you deploy before sync completes, pods will fail to start with “secret not found” errors.
All pods should eventually reach Running status with all containers ready:
Backend may restart once during initial deployment. The backend runs database migrations on startup. If the database isn’t ready immediately, it may fail once and then succeed on retry. One or two restarts is normal.
If pods aren’t starting, check recent events:
The cluster can’t pull the container image. Causes:
Fix: Manually trigger ECR credential sync:
Wait 30 seconds, then check if pods start pulling images.
ECR tokens expire every 12 hours. The CronJob refreshes them automatically, but on initial deployment, you may need to trigger it manually. If you see ImagePullBackOff after the cluster has been running for a while, the CronJob may have failed—check its logs.
The container starts but crashes. Check logs to see why:
Common causes:
Look at the Status section for error messages. Common issues:
After applying the ingress, check if it has an address:
The ADDRESS column should show the NGINX Ingress load balancer IP. If it’s empty:
The EXTERNAL-IP should show the Azure Load Balancer IP. If it shows <pending>:
Once everything is running, you can scale based on load:
The evals service is most resource-intensive during evaluation runs. If evaluations are slow or timing out, scaling this service usually helps most.
After all services are running and healthy, proceed to Verification to test the deployment end-to-end.