Cluster Access

Overview

With infrastructure provisioned, you now need to configure access to the GKE cluster. This step covers:

Updating your kubeconfig to authenticate with GKE
Verifying cluster connectivity and node status
Confirming Terraform-deployed resources (Helm releases, service accounts, External Secrets)
Understanding private cluster access options (VPN, bastion, public API)
Granting access to additional team members

After this step, you will have working kubectl access and can verify all infrastructure components are healthy.

How GKE authentication works

GKE uses Google Cloud IAM for authentication. When you run kubectl, it:

Uses your GCP credentials to get an OAuth token
Sends the token to the GKE API server
GKE verifies the token against Cloud IAM
If authorized, your command executes

This is why you need:

Working GCP credentials (configured earlier)
Your identity to be authorized in GKE (via IAM roles)
Network connectivity to the GKE API endpoint

Terraform automatically grants you access because it creates the cluster using your credentials. The cluster creator is automatically an admin. Other team members need to be added separately (covered below).

Configure kubectl

Update your kubeconfig file with the GKE cluster credentials:

$ gcloud container clusters get-credentials \
>   $(terraform output -raw cluster_name) \
>   --region $(terraform output -raw gcp_region) \
>   --project $(terraform output -raw gcp_project_id)

This command:

Retrieves cluster connection information from GKE
Adds a new context to your ~/.kube/config file
Configures token generation using your GCP credentials (via gke-gcloud-auth-plugin)

Expected output:

Fetching cluster endpoint and auth data.
kubeconfig entry generated for confidentai-stage-gke.

“Could not connect to the endpoint URL” error?

This usually means:

Wrong region/project: Ensure the region and project match your deployment
GKE not ready: The cluster may still be provisioning—wait a few minutes
Network issues: Your network may block HTTPS to Google APIs

Verify your configuration matches your Terraform outputs.

Verify cluster access

Test that you can communicate with the cluster:

$ kubectl get nodes

Expected output:

NAME                                                STATUS   ROLES    AGE   VERSION
gke-confidentai-stage-system-12345678-aaaa          Ready    <none>   30m   v1.31.x
gke-confidentai-stage-system-12345678-bbbb          Ready    <none>   30m   v1.31.x
gke-confidentai-stage-workers-12345678-cccc         Ready    <none>   30m   v1.31.x
gke-confidentai-stage-workers-12345678-dddd         Ready    <none>   30m   v1.31.x

You should see 2 system nodes plus your worker nodes (depending on confident_node_group_desired_size) in Ready status.

Timeout or connection refused?

This typically means the GKE API is not accessible from your network:

Unable to connect to the server: dial tcp x.x.x.x:443: i/o timeout

If confident_public_gke = false (default): The GKE API is only accessible from within the VPC (or authorized networks). You need VPN access or VPC peering to your corporate network. See “Private cluster access” below.

If confident_public_gke = true: The API should be publicly accessible. Check your authorized networks and connectivity.

Check system pods

Verify core Kubernetes components are running:

$ kubectl get pods -n kube-system

You should see pods for:

kube-dns — DNS resolution within the cluster
kube-proxy — Network routing
gke-metadata-server — Workload Identity metadata server

All pods should be Running with all containers ready.

Verify Terraform-deployed resources

Terraform deployed several Kubernetes resources. Let’s verify they’re working correctly.

Helm releases

Check that all Helm charts installed successfully:

$ helm list -A

Name	Namespace	Expected Status
ingress-nginx	ingress-nginx	deployed
external-secrets	confident-ai	deployed
argocd	argocd	deployed
cert-manager	cert-manager	deployed
clickhouse-operator	clickhouse-operator-system	deployed

Helm release shows “failed” or “pending-install”?

This sometimes happens when GKE wasn’t fully ready. Usually fixable by re-running:

$ terraform apply

Terraform will retry the failed Helm installations.

Confident AI namespace

Verify the namespace exists:

$ kubectl get namespace confident-ai

Service accounts

Check that the required service accounts are created:

$ kubectl get serviceaccounts -n confident-ai

Expected service accounts:

Service Account	Purpose
`confident-storage-sa`	Allows pods to access GCS buckets via Workload Identity
`external-secrets-sa`	Allows External Secrets Operator to read from Secret Manager
`ecr-credentials-sync`	Used by the ECR credential rotation CronJob

Why service accounts? Service accounts enable GCP Workload Identity, which gives pods fine-grained GCP permissions. Instead of giving the whole cluster access to GCS, only pods using confident-storage-sa can access the buckets. This follows the principle of least privilege.

External Secrets

External Secrets Operator syncs credentials from Google Secret Manager into Kubernetes secrets. Verify it’s working:

$ kubectl get clustersecretstore

Expected:

NAME                           AGE   STATUS   CAPABILITIES   READY
confident-clustersecretstore   30m   Valid    ReadWrite      True

Check the ExternalSecret:

$ kubectl get externalsecret -n confident-ai

Expected status: SecretSynced

NAME                       STORE                          REFRESH   STATUS
confident-externalsecret   confident-clustersecretstore   1h        SecretSynced

ExternalSecret shows “SecretSyncedError”?

This means it couldn’t read from Secret Manager. Common causes:

Permissions: The external-secrets-sa GCP service account may not have Secret Manager Secret Accessor role
VPC Service Controls: A perimeter may be blocking access to Secret Manager
Secret name mismatch: The ExternalSecret is looking for secrets that don’t exist in Secret Manager

Check the error details:

$ kubectl describe externalsecret confident-externalsecret -n confident-ai

Private cluster access

By default (confident_public_gke = false), the GKE API server is only accessible from within the VPC (or authorized networks). This is a security best practice—it prevents unauthorized access from the internet.

To access a private cluster, you need network connectivity to the VPC.

Option A: VPN to your corporate network (recommended)

If your organization has VPN connectivity to GCP (via Cloud Interconnect, Cloud VPN, or Network Connectivity Center):

Connect to your corporate VPN
Ensure the VPN routes include the Confident AI VPC address range
Run kubectl commands normally

This is the recommended approach for production because it uses your existing network security infrastructure.

VPN routing must include the GKE VPC. If you configured a custom address space (e.g., 10.0.0.0/16) in Prerequisites, ensure your VPN routes include it. Work with your network team to add the route if needed.

Option B: Bastion / Jump box

If you don’t have existing VPC connectivity, you can use a Compute Engine VM within the VPC as a jump box:

Create a VM in the Confident AI VPC
SSH into the VM (via IAP for additional security)
Install kubectl and gcloud CLI on the VM
Run kubectl commands from the VM

The Terraform code includes a commented-out bastion configuration in bastion.tf that you can enable as a starting point.

Option C: Enable public API (not recommended for production)

If you’re just testing, you can enable public API access by setting confident_public_gke = true in your tfvars and re-running Terraform. This makes the GKE API accessible from the internet (subject to authorized networks).

Public GKE API is a security risk. While authenticated by Cloud IAM, a publicly accessible API endpoint increases your attack surface. Only use this for temporary testing, never for production.

Grant access to team members

The person who ran Terraform is automatically a GKE admin. To grant access to other team members:

Using Google Groups (recommended)

Add Google Group emails to your tfvars:

1 confident_gke_admin_group_emails = ["gke-admins@yourdomain.com"]

Then re-run terraform apply. This grants cluster admin access to all members of that Google Group.

Using gcloud CLI

For individual users:

$ gcloud projects add-iam-policy-binding "<your-project-id>" \
>   --member="user:teammate@yourdomain.com" \
>   --role="roles/container.admin"

IAM bindings require the identity to exist in Cloud Identity. If you get errors, verify the user or group email is correct and exists in your organization.

ArgoCD access

ArgoCD is deployed for GitOps-based deployments. You can access it once your network has connectivity to the cluster:

$ # Get the ArgoCD URL
$ terraform output argocd_server_url
$ 
$ # Credentials
$ # Username: admin
$ # Password: the argocd_admin_password you configured

ArgoCD runs inside the cluster behind an internal GCP Load Balancer, so it’s only accessible via the internal network. You’ll need VPN connectivity to access the dashboard.

Troubleshooting

”You must be logged in to the server (Unauthorized)”

error: You must be logged in to the server (Unauthorized)

Your GCP identity isn’t authorized to access the cluster:

Verify your credentials: gcloud auth list
Check you’re using the same identity that ran Terraform
If using a different identity, have an admin add you (see above)

“Unable to connect to the server: dial tcp: i/o timeout”

You have no network path to the GKE API:

For private clusters, ensure you’re connected to VPN
Verify the VPN routes include the VPC address range
Check no firewall is blocking HTTPS (port 443) to Google APIs

Nodes show “NotReady”

Nodes take a few minutes to fully initialize. Wait 2-3 minutes after the cluster is created. If they stay NotReady:

$ kubectl describe node <node-name>

Look at the “Conditions” section for clues. Common causes:

VPC-native networking not configured correctly
Node can’t reach the GKE API
Node VM has insufficient resources

Next steps

With cluster access configured, proceed to Kubernetes Deployment to deploy the Confident AI application services.

$	gcloud container clusters get-credentials \
>	$(terraform output -raw cluster_name) \
>	--region $(terraform output -raw gcp_region) \
>	--project $(terraform output -raw gcp_project_id)

$	gcloud projects add-iam-policy-binding "<your-project-id>" \
>	--member="user:teammate@yourdomain.com" \
>	--role="roles/container.admin"

$	# Get the ArgoCD URL
$	terraform output argocd_server_url
$
$	# Credentials
$	# Username: admin
$	# Password: the argocd_admin_password you configured