Automating Cloud Run Deployments with GitHub Actions and Terraform

Automating Cloud Run deployments is best achieved by using Terraform for infrastructure management and GitHub Actions for the CI/CD pipeline. This approach eliminates configuration drift by using Git as the single source of truth and secures the process through Workload Identity Federation. By transitioning to this automated model, teams can reduce deployment times to under five minutes while ensuring every change is audited and reversible.

Three weeks ago, I broke the production environment for my FastAPI backend at 4:45 PM on a Friday. It wasn't a complex logic bug or a database deadlock. I had simply run gcloud run deploy from my local terminal and forgotten to include a new environment variable required for our Gemini API integration. Because I was bypassing a formal CI/CD pipeline, there was no validation, no peer review of the infrastructure change, and no easy way to rollback without manually hunting through my command history.

That 15-minute outage cost us about $400 in lost API credits and a lot of personal stress. It was the "final straw" moment that pushed me to stop treating my infrastructure as a side project and start treating it as code. I spent the following weekend migrating everything to a fully automated pipeline using Terraform for infrastructure management and GitHub Actions for deployment. This post documents the exact setup I used, the "chicken-and-egg" problem I encountered with container images, and how I secured the whole thing without ever touching a long-lived Service Account key.

Why Manual Cloud Run Deployments Create Production Risks

Manual deployments via the CLI lack a formal audit trail and frequently lead to configuration drift between a developer's local environment and the Google Cloud Console. When you're first building a service, the gcloud CLI feels like magic. You run one command, and your code is live. But as my system grew—specifically after I started optimizing complex AI agent workflows that required specific memory limits and environment configurations—the manual approach became a liability.

My manual process had three major flaws:

  • Configuration Drift: My local environment variables didn't match what was in the Google Cloud Console.
  • IAM Complexity: I was constantly tweaking permissions in the UI, leading to "Permission Denied" errors that took hours to debug.
  • Lack of Audit Trail: I couldn't tell who changed the CPU allocation or when the last successful deploy happened without digging through Cloud Logging.

I decided to move to a "GitOps" model where the GitHub repository is the single source of truth. If it’s not in the main branch, it doesn’t exist in production.

How to Define Cloud Run Infrastructure Using Terraform

Terraform allows you to define your entire Cloud Run service, including environment variables and resource limits, as version-controlled code. I started with Terraform because I wanted my Cloud Run service, its IAM roles, and its dependencies (like Artifact Registry and GCS buckets) to be version-controlled. Here is the core main.tf structure I landed on. One thing I learned the hard way: always use the google-beta provider for Cloud Run if you need the latest features like sidecars or specific VPC egress settings.


resource "google_cloud_run_v2_service" "api_service" {
  name     = "fastapi-backend"
  location = "us-central1"
  ingress  = "INGRESS_TRAFFIC_ALL"

  template {
    containers {
      image = "us-central1-docker.pkg.dev/${var.project_id}/my-repo/api-image:latest"
      
      resources {
        limits = {
          cpu    = "2"
          memory = "4Gi"
        }
      }

      env {
        name  = "DATABASE_URL"
        value = var.database_url
      }

      env {
        name  = "GEMINI_API_KEY"
        value_source {
          secret_key_ref {
            secret  = google_secret_manager_secret.gemini_key.secret_id
            version = "latest"
          }
        }
      }
    }
    
    scaling {
      max_instance_count = 10
      min_instance_count = 1
    }
  }

  traffic {
    type    = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }
}

The most important part of this configuration is the secret_key_ref. Hardcoding secrets in environment variables is a security risk that exposes sensitive data. By referencing Secret Manager, I ensure that the sensitive keys are only injected at runtime. However, this introduced my first hurdle: Terraform needs the container image to exist before it can create the Cloud Run service. This leads to what I call the "Chicken-and-Egg" problem.

Solving the Container Image Chicken-and-Egg Problem

The "chicken-and-egg" problem occurs because Terraform cannot deploy a Cloud Run service if the referenced container image does not yet exist in the registry. If you run terraform apply on a fresh project, it will fail because the Docker image defined in the image field doesn't exist yet in the Artifact Registry. But you can't push the image to the registry until the registry is created by Terraform.

I solved this by splitting my Terraform into two phases or using a "bootstrap" image. I prefer the bootstrap method. I manually pushed a tiny "hello world" Python image to the registry once. This allows Terraform to create the service. Subsequent GitHub Action runs then build the real application image and update the Cloud Run service to point to the new digest.

Another option is to use the lifecycle { ignore_changes = [template[0].containers[0].image] } block in Terraform. This tells Terraform to manage the infrastructure but let the CI/CD pipeline handle the specific image tag. I found this much cleaner for long-term maintenance of Cloud Run deployments.

Configuring GitHub Actions as a Secure Deployment Engine

GitHub Actions should be configured to use Workload Identity Federation to eliminate the need for long-lived, insecure Service Account JSON keys. My GitHub Actions workflow has two main jobs: Build and Deploy. I wanted to ensure that we only deploy if the Docker build succeeds and all tests pass. This is where I integrated my Python task queue testing to ensure that our Celery workers wouldn't fail on the new deployment.

Securing the Pipeline with Workload Identity Federation

I refuse to use JSON service account keys in GitHub Secrets. They are a security nightmare. If a key is leaked, it’s a permanent back-door until someone manually rotates it. Instead, I used Workload Identity Federation (WIF). This allows GitHub Actions to impersonate a Google Cloud Service Account using short-lived tokens.

The setup is a bit verbose, but the security payoff is worth it. You have to create a Workload Identity Pool and Provider, then grant the GitHub repository permission to act as a specific Service Account. You can find the full setup guide in the official google-github-actions/auth documentation.

Here is the YAML snippet for the authentication step in my workflow:


jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      contents: 'read'
      id-token: 'write'

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Authenticate to Google Cloud
      uses: google-github-actions/auth@v2
      with:
        workload_identity_provider: 'projects/123456789/locations/global/workloadIdentityPools/my-pool/providers/my-provider'
        service_account: 'github-actions-deployer@my-project.iam.gserviceaccount.com'

    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2

The Build and Push Step

Once authenticated, the runner builds the Docker image. I use the commit SHA as the image tag. Using :latest in production is a recipe for disaster because it makes rollbacks nearly impossible. By tagging with the SHA, I have an immutable record of exactly what code is running in which container.


    - name: Build and Push Container
      run: |-
        docker build -t us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT_ID }}/my-repo/api-image:${{ github.sha }} .
        gcloud auth configure-docker us-central1-docker.pkg.dev
        docker push us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT_ID }}/my-repo/api-image:${{ github.sha }}

The Terraform Apply Step

Finally, the workflow runs Terraform. I pass the new image tag as a variable. This ensures that Terraform updates the Cloud Run service definition with the exact image we just built.


    - name: Terraform Apply
      run: |-
        terraform init
        terraform apply -var="image_tag=${{ github.sha }}" -auto-approve

Analyzing Deployment Performance and Cost Efficiency

Transitioning to automated Cloud Run deployments reduced our total deployment time from 12 minutes to 4 minutes and 22 seconds. Automating this didn't just make my life easier; it gave me data on how our deployments perform. Before automation, a manual deploy took me about 12 minutes (including the time to remember the commands and wait for the upload).

Here is a breakdown of the time spent:

Step Duration Notes
Environment Setup 25s Installing Terraform & GCloud SDK
Docker Build 1m 45s Using layer caching significantly helps here
Push to Artifact Registry 50s Internal GCP network speeds are fast
Terraform Apply 1m 22s Cloud Run revision creation and traffic rollout

From a cost perspective, GitHub Actions' free tier covers almost all our usage. The main cost increase came from Artifact Registry storage. Because I am now tagging every single commit with a unique SHA, I was storing gigabytes of old images. I had to implement a Cleanup Policy in Artifact Registry to delete any image older than 30 days that isn't tagged as 'production'. This saved us about $15/month in storage fees.

Resolving IAM Permission Denied Errors in CI/CD

To successfully execute Terraform plans, the GitHub Action service account requires the roles/iam.serviceAccountUser role on the Cloud Run runtime service account. The most frustrating part of this migration was the IAM roles. For the GitHub Action to run terraform apply, the service account needs an incredibly broad set of permissions. Initially, I tried to be "Least Privilege" and only gave it roles/run.admin. It failed immediately.

I realized that Terraform doesn't just manage the Cloud Run service; it also checks the state of the Service Account used by Cloud Run. Therefore, the GitHub Action service account needs roles/iam.serviceAccountUser on the Cloud Run runtime service account. Without this, you'll get a cryptic error saying "Permission denied on resource '...'" even though your Cloud Run permissions are correct.

My final IAM list for the GitHub Deployer service account was:

  • Cloud Run Admin: To manage the service.
  • Artifact Registry Writer: To push images.
  • Storage Admin: To manage the Terraform state file in the GCS bucket.
  • Service Account User: To assign the runtime service account to the Cloud Run service.
  • Secret Manager Secret Accessor: Only if Terraform needs to read secret values during the plan phase.

Key Takeaways for Robust Cloud Run Deployments

The transition from manual to automated Cloud Run deployments ensures that every production change is traceable, secure, and easily reproducible. The initial setup friction is always worth the long-term stability. Here are the core lessons I took away:

  • Immutable Tags are Non-negotiable: Use the Git SHA for your container tags. It makes debugging production issues 10x easier because you can trace every container back to a specific line of code.
  • Workload Identity Federation is the Standard: Stop using JSON keys. The setup is harder, but the security posture is infinitely better.
  • Terraform State Management: Always use a remote backend (like a GCS bucket) for your Terraform state. I almost lost my local state file during a laptop migration, which would have made managing existing resources a nightmare.
  • Layer Caching Matters: Optimize your Dockerfile. Put your pip install or npm install commands before copying your source code. This reduced my build time from 5 minutes to under 2 minutes.
  • Cleanup Policies: Automated deployments create a lot of "garbage" images. Set up automated deletion policies early before the storage costs creep up.

Related Reading

Looking ahead, my next challenge is implementing Canary Deployments. Currently, Terraform flips 100% of the traffic to the new revision immediately. While this is better than manual deploys, I want to move toward a model where 5% of traffic hits the new version first, allowing us to monitor for 500 errors before the full rollout. Cloud Run supports this natively via traffic tags, but orchestrating it through Terraform and GitHub Actions requires a more sophisticated state machine logic that I'm currently prototyping.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI