Requirements | Confident AI Docs

Overview

Before starting deployment, review these requirements with your infrastructure and security teams. This page covers:

Technologies that need approval in your environment
Resource sizing for staging and production
GCP services that will be provisioned
Permissions required for deployment
Estimated costs and considerations

Understanding these requirements upfront prevents delays caused by missing approvals or insufficient quotas.

Technologies

Confident AI uses the following technologies. Your organization may require approval before deploying new technologies:

Technology	Version	Purpose
PostgreSQL	17.x	Primary database (via Cloud SQL for PostgreSQL)
ClickHouse	Latest	Analytics database (via ClickHouse Operator)
Redis	7.x	Session cache and job queues
Kubernetes	1.31+	Container orchestration (via GKE)
Node.js	20.x	Backend and frontend runtime
Python	3.11+	Evaluations service runtime
Terraform	1.5+	Infrastructure provisioning
Helm	3.x	Kubernetes package management
ArgoCD	2.x	GitOps continuous delivery
External Secrets Operator	0.9+	Secrets sync from Google Secret Manager
cert-manager	1.17+	TLS certificate automation

Why this tech stack? PostgreSQL is the application’s source of truth. Redis provides fast caching and manages background job queues. Kubernetes enables reliable, scalable container orchestration. External Secrets keeps credentials in Secret Manager (your security team’s preferred location) while making them available to pods.

Technology approval processes: Many enterprises have technology review boards or approved software lists. If PostgreSQL, Kubernetes, or Terraform aren’t already approved in your environment, initiate that process early—it can take weeks.

Resource allocation

Default resource configurations for staging and production environments. These represent starting points—adjust based on your expected workload.

Resource	Staging	Production
GKE System Pool	2x `n2-standard-4` (4 vCPU, 16GB)	2x `n2-standard-4` (4 vCPU, 16GB)
GKE Worker Pool	4x `n2-standard-8` (8 vCPU, 32GB)	4x `n2-standard-8` (8 vCPU, 32GB)
GKE Worker Autoscaling	2-8 nodes	2-8 nodes
Cloud SQL	`db-custom-4-16384` (4 vCPU, 16GB), 64GB storage	`db-custom-4-16384` (4 vCPU, 16GB), 64GB storage

Understanding resource sizing

GKE worker nodes run your application containers. More nodes = more capacity for concurrent users and evaluations. The autoscaler adds nodes during high load and removes them when idle.

GKE system pool runs Kubernetes system components (kube-dns, kube-proxy, etc.) on a fixed set of 2 nodes.

Cloud SQL for PostgreSQL stores all application data. The machine type affects query performance; storage grows as you accumulate data.

Which service is most resource-intensive? The evaluations service (confident-evals) consumes the most CPU during evaluation runs—it processes LLM outputs and computes metrics. If evaluations are slow, scale this service first before adding nodes.

GCP CPU quotas can block deployment. GCP projects have default limits on CPUs per region and per VM family. A typical deployment needs ~40 vCPUs of N2_CPUS (2×4 system + 4×8 worker).

Check your quotas before starting:

GCP Console → IAM & Admin → Quotas → Filter by “N2 CPUs” in your target region

If your limit is low, request an increase—this can take hours to days.

GCP services

The deployment provisions the following GCP services:

Service	Purpose	Why it’s needed
GCP Project	Logical container for all resources	Organizes and manages lifecycle
VPC Network	Isolated network with multiple subnets	Network isolation and security
GKE	Managed Kubernetes control plane and nodes	Runs application containers
Cloud SQL for PostgreSQL	Managed PostgreSQL instance	Stores all application data
Cloud Storage	GCS buckets for application data	File uploads, exports, and backups
Secret Manager	Secure credential storage	Stores API keys, DB passwords, secrets
Service Accounts	IAM service accounts for Workload Identity	Grants pods specific GCP permissions
Cloud NAT	Outbound internet for GKE subnet	Allows pods to call external APIs
VPC Firewall Rules	Firewall rules for GKE subnet	Controls inbound/outbound traffic
Cloud DNS Private Zone	DNS resolution for Cloud SQL	Private database connectivity
Private Service Connect	Private access to GCS	Storage traffic stays on Google backbone
Cloud Load Balancing	Network LB for NGINX Ingress	Routes traffic to services

Some organizations restrict which GCP services can be used. Organization policies or folder-level constraints may prohibit certain services. Verify the services above are allowed in your project before proceeding.

Common restrictions that cause issues:

Cloud NAT (some orgs require shared NAT infrastructure)
Secret Manager (some orgs require centrally managed vaults)
Service Account creation (some orgs require pre-provisioned identities)
External IP allocation (some orgs restrict public IPs)

Outbound network requirements

Confident AI needs to reach external services. Ensure your network allows outbound HTTPS (port 443) to:

Service	Why
`api.openai.com`	Running LLM evaluations
`*.ecr.amazonaws.com`	Pulling container images
`github.com`	ArgoCD GitOps syncing

Corporate proxies and firewalls: If your organization routes traffic through a proxy or inspects HTTPS, you may need to:

Allowlist the domains above
Configure proxy settings in the deployment
Get certificate exceptions for HTTPS inspection

Network restrictions are a common cause of deployment failures that appear as timeouts or SSL errors.

Permissions

The identity running Terraform needs the following GCP IAM roles or equivalent permissions:

Editor on the project (or a custom equivalent)
Project IAM Admin for creating IAM bindings
Secret Manager Admin for managing Secret Manager secrets

Terraform creates and manages:

Projects (if creating new), VPCs, subnets, firewall rules, Cloud NAT
GKE clusters and node pools
Cloud SQL instances and databases
GCS buckets and IAM
Secret Manager secrets
Service Accounts and Workload Identity bindings
IAM role bindings

Permissions are a common cause of deployment failures. Most organizations don’t grant broad permissions by default.

Options:

Use Editor + Project IAM Admin temporarily — Simplest for initial deployment. Restrict after success.
Request specific permissions — Work with your cloud security team to create a deployment service account with the permissions above.
Have a platform team deploy — If you can’t get permissions, have someone who does run Terraform.

Estimated costs

GCP costs vary by region and usage. Approximate monthly costs for always-on infrastructure:

Component	Staging	Production
GKE Control Plane	$75 (Standard)	$75 (Standard)
GKE Nodes (VMs)	$800-1600	$800-1600
Cloud SQL	$200	$200
Cloud NAT	$35	$35
Storage + Data Transfer	~$10	$10-50
Load Balancer	$20	$20
Total (approx)	$1140-1940	$1140-1940

These are estimates. Actual costs depend on:

Region (us-central1 is typically cost-effective)
Usage (more evaluations = more compute = higher cost)
Data volume (Cloud SQL storage, GCS objects)
Committed-use discounts (can reduce VM costs 30-50%)

Use GCP Billing and Cost Management after deployment to track actual spending.

Pre-deployment checklist

Before proceeding to Prerequisites, verify:

Technologies listed above are approved for use
GCP services above can be provisioned (no Org Policy blocks)
Permissions available or obtainable
CPU quotas sufficient for desired node count
Outbound network access available or exceptions requested
Budget approved for estimated costs
Security team aware of deployment plan

Next steps

Once requirements are understood and approved, proceed to Prerequisites to set up your deployment environment.