Configuration

Overview

This step configures all the variables that Terraform uses to provision your infrastructure. You will:

Copy an environment template (staging or production)
Configure VPC settings (new or existing VPC)
Set GKE node sizing and scaling parameters
Configure Cloud SQL for PostgreSQL settings
Provide domain URLs and authentication secrets
Set up ECR cross-account access credentials
Configure the Terraform state backend

After completing this page, your terraform.tfvars file will contain all values needed to provision infrastructure.

How Terraform configuration works

Terraform uses variables to customize deployments. Instead of editing the Terraform code directly, you provide values in a terraform.tfvars file. This keeps your configuration separate from the code, making updates easier.

The repository includes template files with sensible defaults. You copy a template and fill in your specific values.

Setup

Navigate to the GCP Terraform directory:

$ cd confident-terraform/gcp

Copy the appropriate environment template:

$ # For staging/development environments
$ cp vars/staging.vars terraform.tfvars
$ 
$ # For production environments
$ cp vars/production.vars terraform.tfvars

What’s the difference? Both templates use the same default instance sizes. The key differences are the environment name (stage vs prod), which affects resource naming. You can adjust all values after copying.

Open terraform.tfvars in your editor. The following sections explain each variable group.

Environment identification

These variables name and identify your deployment:

1 confident_application_name = "confidentai"
2 confident_environment      = "stage"  # or "prod"
3 confident_gcp_project_id   = "<your-project-id>"
4 confident_gcp_region       = "us-central1"
5 confident_gcp_zone         = "us-central1-a"

Variable	What it does
`confident_application_name`	Prefix for all GCP resource names (e.g., `confidentai-stage-gke`)
`confident_environment`	Must be exactly `stage` or `prod`—used in resource names and affects some defaults
`confident_gcp_project_id`	GCP project where everything deploys
`confident_gcp_region`	GCP region where everything deploys
`confident_gcp_zone`	Primary zone within the region for zonal resources

Region selection matters. Choose a region close to your users and compliant with your data residency requirements. Once deployed, you cannot easily change regions—it requires a full redeployment.

Organization region restrictions: Some organizations only allow deployments in specific regions (via Org Policy gcp.resourceLocations). Verify your region is approved before proceeding.

VPC configuration

Option A: Create a new VPC (recommended)

If you’re creating a new VPC, configure the address spaces:

1 confident_vpc_enabled                  = true
2 confident_vpc_address_space            = "10.0.0.0/16"
3 confident_gke_subnet_cidr              = "10.0.1.0/24"
4 confident_gke_pods_cidr                = "10.4.0.0/14"
5 confident_gke_services_cidr            = "10.0.32.0/20"
6 confident_database_psa_cidr            = "10.0.6.0/24"
7 confident_public_subnet_cidr           = "10.0.101.0/24"
8 confident_private_endpoint_subnet_cidr = "10.0.7.0/24"

Setting	Default	What it means
`confident_vpc_address_space`	`10.0.0.0/16`	The overall IP range for the VPC
`confident_gke_subnet_cidr`	`10.0.1.0/24`	256 IPs for GKE nodes
`confident_gke_pods_cidr`	`10.4.0.0/14`	Secondary range for VPC-native pod IPs
`confident_gke_services_cidr`	`10.0.32.0/20`	Secondary range for cluster service IPs
`confident_database_psa_cidr`	`10.0.6.0/24`	PSA allocation for Cloud SQL
`confident_public_subnet_cidr`	`10.0.101.0/24`	256 IPs for public-facing resources
`confident_private_endpoint_subnet_cidr`	`10.0.7.0/24`	256 IPs for PSC endpoints

CIDR conflicts cause connectivity failures. If your corporate network uses the same IP range (e.g., 10.0.x.x), you’ll have problems connecting via VPN. Common conflict-free alternatives:

172.16.0.0/16 (172.16.x.x)
192.168.0.0/16 (192.168.x.x)
10.100.0.0/16 (10.100.x.x)

Check with your network team before choosing.

Option B: Use an existing VPC

If deploying into an existing VPC, disable VPC creation and provide the existing resource IDs:

1 confident_vpc_enabled            = false
2 external_vpc_id                  = "projects/<project-id>/global/networks/<vpc-name>"
3 external_vpc_address_space       = "10.0.0.0/16"
4 external_gke_subnet_id           = "projects/<project-id>/regions/<region>/subnetworks/<subnet-name>"
5 external_database_psa_range_name = "<existing-psa-range-name>"
6 external_public_subnet_id        = "projects/<project-id>/regions/<region>/subnetworks/<subnet-name>"

Using an existing VPC requires coordination with your network team. You need:

Subnet IDs with available IP addresses and secondary ranges for pods/services
An existing Private Service Access allocation for Cloud SQL
Firewall rules that don’t block required traffic
Cloud NAT or outbound internet access for pulling images

Many existing VPCs have restrictive firewall rules or missing Cloud NAT configurations that will cause deployment failures.

GKE node configuration

These settings control the VMs that run your Kubernetes workloads:

1 confident_node_machine_type      = "n2-standard-8"
2 confident_node_group_min_size    = 2
3 confident_node_group_max_size    = 8
4 confident_node_group_desired_size = 4
5 confident_kubernetes_version     = "1.31"

Variable	What it controls
`confident_node_machine_type`	Machine type—determines CPU and memory per worker node
`confident_node_group_min_size`	Cluster autoscaler won’t go below this
`confident_node_group_max_size`	Cluster autoscaler won’t exceed this
`confident_node_group_desired_size`	How many worker nodes to start with
`confident_kubernetes_version`	GKE Kubernetes version

Recommended sizes:

Environment	Machine Type	vCPU	Memory	Min/Max Nodes
Staging	`n2-standard-8`	8	32 GB	2-8
Production	`n2-standard-8`	8	32 GB	2-8
High volume	`n2-standard-16`	16	64 GB	4-12

GKE also creates a fixed system pool with 2x n2-standard-4 nodes for Kubernetes system components. This is separate from the worker pool configured above.

GCP CPU quotas can block deployment. GCP projects have default limits on how many CPUs you can run per VM family per region.

Check your quotas: GCP Console → IAM & Admin → Quotas → Filter by “N2 CPUs” in your region

Request an increase if your limit is below: (system_nodes × 4) + (desired_worker_nodes × vCPUs per node)

Database configuration

Cloud SQL for PostgreSQL settings:

1 confident_psql_tier                  = "db-custom-4-16384"
2 confident_psql_disk_size_gb          = 64
3 confident_psql_db_name               = "confident_db"
4 confident_psql_username              = "confident_admin"
5 confident_psql_password              = "<your-generated-password>"
6 confident_psql_version               = "POSTGRES_17"
7 confident_psql_backup_retention_days = 7
8 confident_psql_point_in_time_recovery = true
9 confident_psql_high_availability     = true

Variable	What it controls
`confident_psql_tier`	Cloud SQL tier—determines CPU and memory
`confident_psql_disk_size_gb`	Storage size in GB
`confident_psql_db_name`	Name of the database created
`confident_psql_username`	Administrator username
`confident_psql_password`	Administrator password
`confident_psql_version`	PostgreSQL version
`confident_psql_backup_retention_days`	Number of days to retain automated backups (1-365)
`confident_psql_point_in_time_recovery`	Enable write-ahead log for PITR
`confident_psql_high_availability`	Enable regional HA (standby in another zone)

Generate a strong database password. Use openssl rand -base64 24 to create a secure random password. This value is stored in Google Secret Manager by Terraform. Do not commit it to version control.

Regional HA is enabled by default. This creates a standby replica in a different zone for automatic failover. Disable it for development/testing environments to reduce costs.

Domain and URL configuration

1 confident_frontend_url = "https://app.yourdomain.com"
2 confident_backend_url  = "https://api.yourdomain.com"
3 confident_subdomain    = "yourdomain.com"

Variable	Purpose	Example
`confident_frontend_url`	Full URL users type in browser	`https://app.confidentai.acme.com`
`confident_backend_url`	Full URL for API calls	`https://api.confidentai.acme.com`
`confident_subdomain`	Root domain for auth cookies	`acme.com`

The subdomain must be the root domain, not a subdomain.

Correct: confident_subdomain = "acme.com" Wrong: confident_subdomain = "confidentai.acme.com"

Authentication cookies are set on the subdomain and must be accessible by both frontend and backend. If you use the full subdomain, cookies won’t work correctly.

Authentication secrets

1 confident_better_auth_secret          = "<your-generated-secret>"
2 confident_better_auth_trusted_origins = "https://app.yourdomain.com"
3 confident_google_client_id            = "<google-oauth-client-id>"
4 confident_google_client_secret        = "<google-oauth-client-secret>"

Variable	What it’s for
`confident_better_auth_secret`	Encrypts authentication tokens—use the value you generated in Prerequisites
`confident_better_auth_trusted_origins`	URLs allowed to make authenticated requests—typically your frontend URL
`confident_google_client_id`	Google OAuth Client ID (if using Google SSO)
`confident_google_client_secret`	Google OAuth Client Secret (if using Google SSO)

Trusted origins must include the protocol. Use https://app.yourdomain.com not app.yourdomain.com. Missing protocol causes authentication to fail silently.

External services

1 openai_api_key                = "<your-openai-api-key>"
2 confident_clickhouse_password = "<your-generated-password>"
3 argocd_admin_password         = "<your-generated-password>"

Variable	What it’s for
`openai_api_key`	API key for running LLM evaluations
`confident_clickhouse_password`	Password for the analytics database
`argocd_admin_password`	Admin password for the ArgoCD GitOps dashboard

OpenAI API key requires sufficient quota. Evaluations can consume significant tokens. Ensure your OpenAI account has appropriate rate limits and spending caps configured.

Resource naming

These variables control internal naming conventions for GCP resources. The defaults are suitable for most deployments:

1 confident_application_code = "cai"
2 confident_environment_code = "s"   # "s" for stage, "p" for prod
3 confident_region_prefix    = "gcuc"
4 confident_labels = {
5   environment = "stage"
6   project     = "confidentai"
7 }

Variable	What it controls
`confident_application_code`	Short code used in resource identifiers (e.g., node pool name)
`confident_environment_code`	`s` for stage, `p` for prod—used in resource identifiers
`confident_region_prefix`	Region abbreviation for resource identifiers
`confident_labels`	Labels applied to all GCP resources

GKE access configuration

1 confident_public_gke                  = false
2 confident_gke_admin_group_emails      = []

Variable	What it controls
`confident_public_gke`	When `true`: GKE API is publicly accessible, NGINX Ingress is internet-facing. When `false` (default): private GKE cluster with internal load balancers.
`confident_gke_admin_group_emails`	Google Group emails to grant GKE cluster admin access

Public GKE is only recommended for testing. Setting confident_public_gke = true makes the GKE API server and ingress accessible from the internet. Never use this in production.

Storage configuration

1 confident_test_cases_bucket = "testcases"
2 confident_payloads_bucket   = "payloads"

Variable	What it controls
`confident_test_cases_bucket`	Name suffix for the test cases GCS bucket
`confident_payloads_bucket`	Name suffix for the payloads GCS bucket

Bucket names are constructed as <application_name>-<environment>-<suffix> (e.g., confidentai-stage-testcases).

ClickHouse configuration

ClickHouse serves as the analytics database, deployed via the ClickHouse Operator on GKE:

1 confident_clickhouse_user             = "default"
2 confident_clickhouse_password         = "<your-generated-password>"
3 confident_clickhouse_database         = "confident_db"
4 confident_clickhouse_operator_version = "0.0.1"
5 confident_clickhouse_version          = "25.12"
6 confident_clickhouse_shards           = 1
7 confident_clickhouse_replicas         = 2
8 confident_clickhouse_storage_size     = "500Gi"
9 confident_clickhouse_storage_class    = "clickhouse-premium"

Variable	What it controls
`confident_clickhouse_user`	ClickHouse username (keep as `default`)
`confident_clickhouse_password`	ClickHouse password
`confident_clickhouse_database`	Database name within ClickHouse
`confident_clickhouse_operator_version`	Helm chart version for the ClickHouse Operator
`confident_clickhouse_version`	ClickHouse server version
`confident_clickhouse_shards`	Number of shards in the cluster
`confident_clickhouse_replicas`	Replicas per shard (2 recommended for HA)
`confident_clickhouse_storage_size`	Persistent volume size per ClickHouse pod
`confident_clickhouse_storage_class`	StorageClass for ClickHouse PVCs (GCP pd-ssd persistent disks)

Do not change confident_clickhouse_user from default. The ClickHouse Operator expects this username. Changing it will cause connectivity failures.

ClickHouse backup

A GCS bucket is created for ClickHouse backups:

1 confident_clickhouse_backup_bucket   = "chbackups"
2 confident_clickhouse_backup_schedule = "0 2 * * *"

Variable	What it controls
`confident_clickhouse_backup_bucket`	Name suffix for the ClickHouse backup bucket
`confident_clickhouse_backup_schedule`	Cron schedule for automated backups (UTC)

Backup lifecycle policy: Terraform configures a lifecycle policy that automatically deletes ClickHouse backup objects older than 30 days and snapshots older than 7 days.

ECR cross-account access

These credentials allow your GKE cluster to pull Confident AI container images from AWS ECR:

1 ecr_aws_access_key_id     = "<provided-by-confident-ai>"
2 ecr_aws_secret_access_key = "<provided-by-confident-ai>"
3 ecr_aws_account_id        = "<provided-by-confident-ai>"
4 ecr_aws_region            = "us-east-1"

These values are provided by your Confident AI representative. Don’t modify them unless instructed.

Terraform state backend

Terraform tracks what resources it created in a “state file.” This should be stored remotely so multiple team members can collaborate and state isn’t lost.

Edit provider.tf to configure your GCS backend:

1 terraform {
2   backend "gcs" {
3     bucket = "your-company-tfstate"
4     prefix = "confident-ai/staging"
5   }
6 }

Setting	What it does
`bucket`	GCS bucket name for state storage
`prefix`	Path within the bucket (use different prefixes for staging vs. production)

If the bucket doesn’t exist, create it:

$ gcloud storage buckets create gs://your-company-tfstate \
>   --project=<your-project-id> \
>   --location=us-central1 \
>   --uniform-bucket-level-access
$ 
$ gcloud storage buckets update gs://your-company-tfstate \
>   --versioning

Your organization may have existing Terraform state infrastructure. Many companies have:

Centralized state buckets managed by a platform team
Required naming conventions
Required encryption settings (CMEK)
Org Policy requirements on GCS buckets

Check with your infrastructure team before creating a new bucket.

Never delete or modify the state file manually. Terraform state tracks the mapping between your configuration and real GCP resources. Corrupting it can cause Terraform to lose track of resources, leading to orphaned infrastructure or accidental deletions.

Security review checklist

Before proceeding, verify these security considerations:

terraform.tfvars is in .gitignore (never commit secrets)
State bucket has versioning enabled (for recovery from mistakes)
State bucket is encrypted at rest (default with Google-managed keys)
Identity used has least-privilege permissions
CIDR blocks don’t conflict with corporate network
OpenAI API key has appropriate spending limits

$ # Add tfvars to gitignore if not already present
$ echo "terraform.tfvars" >> .gitignore
$ echo "*.tfvars" >> .gitignore

Next steps

Once configuration is complete, proceed to Provisioning to create the GCP infrastructure.