For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Trust CenterStatusSupportGet a demoPlatform
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
    • Self-Hosting
    • Security & Compliance
  • AWS Deployment
    • Overview
    • Quickstart
    • Requirements
  • Azure Deployment
    • Overview
    • Quickstart
    • Requirements
  • GCP Deployment
    • Overview
    • Quickstart
    • Requirements
      • Prerequisites
      • Configuration
      • Provisioning
      • TLS Certificates
      • Cluster Access
      • Kubernetes Deployment
      • Verification
LogoLogo
Trust CenterStatusSupportGet a demoPlatform
On this page
  • Overview
  • How Kubernetes deployments work
  • Manifest structure
  • Prepare manifests
  • Update image tags
  • Configure ingress
  • Configure External Secrets
  • Update the secret store
  • Update the secret references
  • Deploy services
  • Monitor deployment
  • Watch pod status
  • Check events
  • Common deployment issues
  • ImagePullBackOff
  • CrashLoopBackOff
  • External Secrets not syncing
  • Ingress not getting external IP
  • Scaling services
  • Next steps
GCP DeploymentStep-by-step guide

Kubernetes Deployment

Was this page helpful?
Previous

Verification

Next
Built with

Overview

With infrastructure and cluster access ready, you now deploy the Confident AI application services. This step covers:

  • Preparing Kubernetes manifests provided by Confident AI
  • Updating container image tags to the versions provided by Confident AI
  • Configuring the ingress with your domain names and TLS settings
  • Setting up External Secrets to sync credentials from Google Secret Manager
  • Deploying services in the correct order: base config → Redis → application services
  • Monitoring deployment status and troubleshooting common issues

After completion, all Confident AI services will be running in your GKE cluster.

How Kubernetes deployments work

Kubernetes uses YAML manifests to describe what you want to run. A Deployment tells Kubernetes:

  • What container image to use
  • How many replicas (copies) to run
  • What environment variables and secrets to inject
  • Resource limits (CPU, memory)

When you kubectl apply a manifest, Kubernetes:

  1. Reads your desired state
  2. Compares it to current state
  3. Creates, updates, or deletes resources to match
  4. Continuously ensures the desired state is maintained

Manifest structure

The Kubernetes manifests are organized by service:

base/ # Shared configuration
├── namespace.yaml # The confident-ai namespace (already created)
├── network/
│ └── ingress.yaml # Load balancer and routing rules
└── common/
└── external-secrets/ # Secrets sync configuration
confident-backend/ # API service
├── deployment.yaml
└── service.yaml
confident-frontend/ # Web dashboard
confident-evals/ # Evaluation service
confident-otel/ # OpenTelemetry collector
redis/ # Cache service

Prepare manifests

Update image tags

Container images are stored in Confident AI’s AWS ECR. Your Confident AI representative will provide the specific version tags to use.

Open each deployment file and update the image field:

FileWhat to update
confident-backend/deployment.yamlBackend API image
confident-frontend/deployment.yamlFrontend dashboard image
confident-evals/deployment.yamlEvaluation service image
confident-otel/deployment.yamlOTEL collector image

Example change in confident-backend/deployment.yaml:

1containers:
2 - name: confident-backend
3 image: 128045499490.dkr.ecr.us-east-1.amazonaws.com/confidentai/confident-backend:v1.2.3

Image tag format: <ecr-account>.dkr.ecr.<region>.amazonaws.com/confidentai/<service>:<version>

The ECR account ID and region are the values provided by Confident AI for ECR access. The version tag (e.g., v1.2.3) is what your representative will provide.

Use exact tags, not “latest.” Always use specific version tags (e.g., v1.2.3), not latest. Specific tags ensure reproducible deployments and make rollbacks possible. Using latest can cause unexpected behavior when images are updated.

Configure ingress

The ingress defines how external traffic reaches your services. For GCP, this uses NGINX Ingress (not AWS ALB). Edit base/network/ingress.yaml:

1

Set the ingress class

Ensure the ingress uses NGINX:

1apiVersion: networking.k8s.io/v1
2kind: Ingress
3metadata:
4 name: confident-ingress
5 namespace: confident-ai
6 annotations:
7 cert-manager.io/cluster-issuer: letsencrypt-prod
8spec:
9 ingressClassName: nginx
2

Set your domain names

Find and replace the placeholder hostnames with your actual domains:

1spec:
2 tls:
3 - hosts:
4 - app.yourdomain.com
5 - api.yourdomain.com
6 - deepeval.yourdomain.com
7 - otel.yourdomain.com
8 secretName: confident-tls
9 rules:
10 - host: app.yourdomain.com
11 http:
12 paths:
13 - path: /
14 pathType: Prefix
15 backend:
16 service:
17 name: confident-frontend
18 port:
19 number: 3000
20
21 - host: api.yourdomain.com
22 # ...
23
24 - host: deepeval.yourdomain.com
25 # ...
26
27 - host: otel.yourdomain.com
28 # ...
3

Configure TLS

The tls section and cert-manager.io/cluster-issuer annotation tell cert-manager to automatically request and renew certificates:

1metadata:
2 annotations:
3 cert-manager.io/cluster-issuer: letsencrypt-prod # or your ClusterIssuer name
4spec:
5 tls:
6 - hosts:
7 - app.yourdomain.com
8 - api.yourdomain.com
9 secretName: confident-tls

cert-manager watches for Ingress resources with this annotation and automatically creates Certificate resources, requests certificates from the issuer, and stores them in the specified Kubernetes Secret.

Configure External Secrets

External Secrets syncs credentials from Google Secret Manager into Kubernetes. The configuration needs to match your deployment.

Update the secret store

Edit base/common/external-secrets/secret-store.yaml:

The projectID must match your GCP project (from terraform output gcp_project_id):

1spec:
2 provider:
3 gcpsm:
4 projectID: "your-gcp-project-id"
5 auth:
6 workloadIdentity:
7 clusterLocation: us-central1
8 clusterName: confidentai-stage-gke
9 serviceAccountRef:
10 name: external-secrets-sa

Update the secret references

Edit base/common/external-secrets/external-secrets.yaml:

The key field in each secret reference should match the Secret Manager secret names created by Terraform. The secrets use hyphenated names (e.g., DATABASE-URL, BETTER-AUTH-SECRET):

1data:
2 - secretKey: DATABASE_URL
3 remoteRef:
4 key: DATABASE-URL
5 - secretKey: REDIS_URL
6 remoteRef:
7 key: REDIS-URL

Secret Manager secret IDs use hyphens, not underscores. Google Secret Manager doesn’t allow underscores in secret IDs. The External Secrets config maps hyphenated Secret Manager IDs to underscored Kubernetes secret keys.

Deploy services

Deploy in order to ensure dependencies are available when needed.

1

Apply network and secrets configuration

The namespace already exists (Terraform created it), but apply the network and secrets configuration:

$kubectl apply -f base/network/
$kubectl apply -f base/common/external-secrets/
2

Wait for secrets to sync

External Secrets needs to pull credentials from Secret Manager before pods can start. Watch the sync status:

$kubectl get externalsecret -n confident-ai -w

Wait until STATUS shows SecretSynced:

NAME STORE REFRESH STATUS
confident-externalsecret confident-clustersecretstore 1h SecretSynced

Press Ctrl+C to stop watching once synced.

Verify the Kubernetes secret was created:

$kubectl get secret confident-externalsecret -n confident-ai

Don’t proceed until secrets are synced. Pods reference this secret for database URLs, API keys, and other credentials. If you deploy before sync completes, pods will fail to start with “secret not found” errors.

3

Deploy Redis

Redis must be running before the backend, which uses it for caching and job queues:

$kubectl apply -f redis/

Verify it’s running:

$kubectl get pods -n confident-ai -l app=redis

Wait for Running status before continuing.

4

Deploy application services

Now deploy the Confident AI services:

$kubectl apply -f confident-backend/
$kubectl apply -f confident-frontend/
$kubectl apply -f confident-evals/
$kubectl apply -f confident-otel/

Monitor deployment

Watch pod status

$kubectl get pods -n confident-ai -w

All pods should eventually reach Running status with all containers ready:

NAME READY STATUS RESTARTS AGE
confident-backend-xxx-yyy 1/1 Running 0 2m
confident-frontend-xxx-yyy 1/1 Running 0 2m
confident-evals-xxx-yyy 1/1 Running 0 2m
confident-otel-xxx-yyy 1/1 Running 0 2m
confident-redis-xxx-yyy 1/1 Running 0 3m

Backend may restart once during initial deployment. The backend runs database migrations on startup. If the database isn’t ready immediately, it may fail once and then succeed on retry. One or two restarts is normal.

Check events

If pods aren’t starting, check recent events:

$kubectl get events -n confident-ai --sort-by='.lastTimestamp' | tail -20

Common deployment issues

ImagePullBackOff

confident-backend-xxx 0/1 ImagePullBackOff 0 2m

The cluster can’t pull the container image. Causes:

  1. ECR credentials not synced: The CronJob that refreshes ECR tokens hasn’t run yet
  2. Wrong image tag: The specified version doesn’t exist
  3. Network issues: Nodes can’t reach ECR

Fix: Manually trigger ECR credential sync:

$kubectl create job --from=cronjob/ecr-credentials-sync manual-ecr-sync -n confident-ai

Wait 30 seconds, then check if pods start pulling images.

ECR tokens expire every 12 hours. The CronJob refreshes them automatically, but on initial deployment, you may need to trigger it manually. If you see ImagePullBackOff after the cluster has been running for a while, the CronJob may have failed—check its logs.

CrashLoopBackOff

confident-backend-xxx 0/1 CrashLoopBackOff 5 10m

The container starts but crashes. Check logs to see why:

$kubectl logs deployment/confident-backend -n confident-ai

Common causes:

Error in logsCauseFix
”ECONNREFUSED” to databaseCloud SQL not accessibleCheck firewall allows traffic from GKE to PSA range
”Secret not found”External Secrets not syncedWait for sync or check ExternalSecret status
”Invalid DATABASE_URL”Wrong format in Secret ManagerVerify secret value in GCP Console
”OPENAI_API_KEY not set”Missing secretVerify Secret Manager has all required secrets

External Secrets not syncing

$kubectl describe externalsecret confident-externalsecret -n confident-ai

Look at the Status section for error messages. Common issues:

ErrorCauseFix
”PermissionDenied”Service account lacks roleCheck Workload Identity has Secret Manager Secret Accessor
”SecretNotFound”Secret name doesn’t matchVerify secret IDs in Secret Manager match ExternalSecret
”Invalid ClusterSecretStore”Store misconfiguredCheck projectID in secret-store.yaml matches your project

Ingress not getting external IP

After applying the ingress, check if it has an address:

$kubectl get ingress -n confident-ai

The ADDRESS column should show the NGINX Ingress load balancer IP. If it’s empty:

$kubectl get svc -n ingress-nginx ingress-nginx-controller

The EXTERNAL-IP should show the GCP Load Balancer IP. If it shows <pending>:

  • Firewall blocking: Firewall rules may not allow inbound traffic on ports 80/443
  • Quota issues: External IP quota may be exhausted
  • Org Policy: Policies may block external IPs or load balancer creation

Scaling services

Once everything is running, you can scale based on load:

$# Scale backend for more API capacity
$kubectl scale deployment confident-backend --replicas=3 -n confident-ai
$
$# Scale frontend if dashboard is slow
$kubectl scale deployment confident-frontend --replicas=2 -n confident-ai
$
$# Scale evals for more concurrent evaluations
$kubectl scale deployment confident-evals --replicas=3 -n confident-ai

The evals service is most resource-intensive during evaluation runs. If evaluations are slow or timing out, scaling this service usually helps most.

Next steps

After all services are running and healthy, proceed to Verification to test the deployment end-to-end.