How to Reduce Cloud Run Costs by 40% with Resource Tuning

How to Reduce Cloud Run Costs by 40% with Resource Tuning

To reduce Cloud Run costs, developers should optimize concurrency levels, utilize 'CPU always allocated' for consistent traffic, and set runtime-specific memory limits like GOMEMLIMIT. These adjustments minimize billable execution time and prevent expensive cold starts caused by container crashes.

I woke up on the first of the month to a billing alert that made my stomach drop: $1,442.80. My expected budget for my side project—a document processing engine built on Gemini—was $400. Somewhere between my last deployment and the weekend traffic spike, I had managed to burn through a thousand dollars of cloud credits and actual cash. I spent the next 72 hours staring at Cloud Monitoring dashboards, realizing that my "serverless" dream had become a financial nightmare because I relied on default configurations.

The service in question was the core component of my Data Extraction Pipeline using Gemini Function Calling. It’s a Go-based service running on Cloud Run that ingests large PDFs, sends them to the Gemini API, and processes the structured output. It’s CPU-intensive, memory-hungry, and highly sensitive to latency. When traffic scaled from 10 to 100 concurrent users, Cloud Run did exactly what it was told to do: it scaled out. But it scaled inefficiently, spinning up hundreds of containers that were mostly idling or fighting for CPU cycles.

In this post, I’m going to walk through the exact steps I took to bring that bill down to $860 without sacrificing a single millisecond of user-facing performance. If you are running Go or Python services on Cloud Run and seeing your costs creep up, you are likely falling into the same "concurrency trap" I did.

Why Do Default Concurrency Settings Increase Cloud Run Costs?

Default concurrency settings often lead to inefficient resource usage and higher Cloud Run costs for compute-heavy applications. When you create a new Cloud Run service, Google defaults your concurrency to 80. This sounds great on paper—one container handling 80 requests at once! But for a Go service performing heavy JSON parsing and TLS handshakes with external APIs, 80 concurrent requests on a single vCPU is a recipe for disaster. I noticed my "Container CPU Utilization" was hovering around 15%, yet my "Billable Instance Time" was through the roof. Why?

The problem was the "CPU Throttling" that happens outside of request processing. By default, Cloud Run only allocates CPU during request processing. If your Go runtime is trying to perform garbage collection (GC) or manage internal state between requests, and the CPU is throttled, those tasks take longer. This leads to longer request tail latencies, which keeps the container "active" longer, which costs you more money. I was paying for containers that were essentially struggling to breathe.

Step 1: Finding the Sweet Spot with k6 Benchmarking

I stopped guessing and started measuring. I used k6 to run load tests against a staging environment with different CPU and memory configurations. My goal was to find the point where I maximized throughput per dollar, not just throughput. I realized that for my specific workload, 1 vCPU and 2GB of RAM was the "danger zone"—the CPU would hit 100% and the Go scheduler would start thrashing.

// My k6 test script to find the breaking point
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 50 }, // ramp up to 50 users
    { duration: '5m', target: 50 }, // stay at 50
    { duration: '2m', target: 0 },  // scale down
  ],
};

export default function () {
  let res = http.get('https://my-service-staging.a.run.app/health');
  check(res, { 'status is 200': (r) => r.status === 200 });
  sleep(1);
}

After running this against several configurations, I found that bumping the instance to 2 vCPUs and 4GB of RAM, while reducing concurrency to 15, actually made the service cheaper. Why? Because each request finished 30% faster. In the world of Cloud Run, shorter execution time = lower bill, even if the hourly rate for the instance is higher.

How Does Switching to CPU Always Allocated Reduce Cloud Run Costs?

Switching to 'CPU always allocated' can reduce Cloud Run costs by providing a 25% pricing discount and preventing performance-degrading CPU throttling. This was the most counter-intuitive change I made. I switched from "CPU only allocated during requests" to "CPU always allocated." Usually, people think "Always Allocated" is more expensive because you pay for the instance even when it's idle. However, for a service with steady baseline traffic, the math changes completely.

When CPU is always allocated, you get a significant discount on the per-second rate (roughly 25% lower for the "always on" portion). More importantly, the Go runtime can perform background tasks like GC and keep-alives without being throttled. I saw my P99 latency drop from 4.2s to 2.8s just by making this switch. Because the requests finished faster, I needed fewer total instances to handle the same load.

If you're curious about the specific pricing tiers, check the official Cloud Run pricing page. The key is to calculate your "Idle vs. Active" ratio. If your instances are active more than 60% of the time, "CPU Always Allocated" is almost always cheaper.

How to Optimize Go Memory Limits to Prevent Costly Restarts?

Configuring GOMEMLIMIT is essential to prevent Out of Memory (OOM) kills that trigger expensive container cold starts. Another silent killer was the OOM kill cycle. In my previous post about reducing Go Cloud Run cold starts, I touched on binary size, but I didn't talk about runtime memory management. Cloud Run containers are hard-limited. If you set a 512MB limit and Go's heap grows to 513MB, the container dies instantly. This triggers a cold start for the next request, which is slow and expensive.

I implemented the GOMEMLIMIT environment variable, which was introduced in Go 1.19. This tells the Go runtime to be more aggressive with garbage collection as it approaches a specific limit, rather than blindly growing the heap until the OS kills it.

# In my Dockerfile or Cloud Run Env Vars
GOMEMLIMIT=1800MiB
GOGC=100

By setting GOMEMLIMIT to about 90% of my container's RAM, I effectively eliminated OOM kills. My "Instance Count" graph smoothed out significantly—instead of a jagged saw-tooth pattern of crashes and restarts, I had a stable line of healthy containers.

How to Configure Cloud Run Autoscaling for Maximum Efficiency?

Strategic use of min-instances and max-instances balances availability with cost protection to avoid billing spikes. My original configuration had min-instances set to 0. This is the "pure serverless" approach, but it was costing me money in the form of "Execution Time" for cold starts. Every time a new instance spun up, it took 5-6 seconds to become ready. During those seconds, I was being billed, but the instance wasn't yet serving traffic efficiently.

I changed my min-instances to 3. This sounds like it would increase costs, but it actually prevented the "scaling storm" that happened during minor traffic fluctuations. By keeping 3 instances warm, I could absorb small spikes without triggering the expensive cold-start process of a 4th or 5th instance. For the max-instances, I capped it at 50. Earlier, I had it at 100, which allowed runaway processes or bot traffic to spin up 100 containers and drain my wallet before I could react.

The YAML Configuration That Saved Me

Here is the redacted version of my service.yaml that I now use as a template for all my high-performance Go services on Cloud Run. It balances performance with cost-efficiency.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: document-processor
  annotations:
    run.googleapis.com/launch-stage: BETA
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "3"
        autoscaling.knative.dev/maxScale: "50"
        run.googleapis.com/cpu-throttling: "false" # CPU Always Allocated
        run.googleapis.com/execution-environment: gen2
    spec:
      containerConcurrency: 20 # Lower concurrency for CPU-heavy tasks
      containers:
      - image: gcr.io/my-project/processor:latest
        resources:
          limits:
            cpu: "2000m" # 2 vCPUs
            memory: "4Gi"
        env:
        - name: GOMEMLIMIT
          value: "3600MiB"

Final Results: Achieving a 40% Reduction in Cloud Run Costs

Optimizing resource allocation per instance significantly lowers total monthly spend by decreasing the total billable execution time. After running with these changes for a full billing cycle, the results were undeniable. I track these metrics in a custom Looker Studio dashboard connected to my billing export in BigQuery.

  • Monthly Spend: Dropped from $1,442.80 to $858.12 (approx. 40.5% reduction).
  • P99 Latency: Improved from 4.2s to 2.9s.
  • Cold Starts: Reduced by 85% due to min-instances and GOMEMLIMIT stability.
  • CPU Utilization: Increased from 15% to 45% (meaning I am actually using the resources I pay for).

The most shocking realization was that by increasing the resources per instance (moving from 1 CPU to 2 CPUs), I actually decreased the total cost. Most developers are afraid to allocate more resources because they think "more resources = more money." In a request-based billing model, "more resources = faster completion = less money."

What I Learned: Key Takeaways

If you are struggling with Cloud Run costs, here is the mental checklist I now use for every deployment:

1. Stop Trusting Default Concurrency

The default of 80 is only for I/O-bound services (like a simple proxy). If your service does any heavy lifting—AI orchestration, image processing, or complex crypto—drop your concurrency to 10-20 and use more powerful instances. You'll stop the CPU context-switching thrashing that wastes billable cycles.

2. Always Use GOMEMLIMIT

If you are using Go, GOMEMLIMIT is not optional in a containerized environment. Without it, the Go GC has no idea it's about to hit a hard wall. Setting it to 90% of your container's limit is the single easiest way to stop the crash-restart-bill cycle.

3. CPU Always Allocated is a Cost-Saver

Don't be fooled by the "Serverless" marketing. If you have any consistent traffic, the 25% discount and the lack of CPU throttling make "Always Allocated" the cheaper option. It also makes your application feel much snappier to the end user because the background housekeeping isn't fighting for "active" request cycles.

4. Set a Max-Scale Safety Net

Never deploy a service with an unlimited or very high max-instances unless you have deep pockets. A bug in a retry loop or a sudden DDoS attack can cost you thousands of dollars in a single night. Set a sane max-scale and use Cloud Monitoring alerts to notify you if you hit that ceiling.

Related Reading

Moving forward, I’m looking into implementing a custom sidecar for more granular metric collection. While Cloud Monitoring is decent, I want to see deeper into the Go runtime's scheduler behavior during peak loads. My next challenge is to see if I can move some of the pre-processing logic to the edge using Cloudflare Workers to reduce the load on Cloud Run even further. Cost optimization is never a "one and done" task; it’s a continuous process of refining the relationship between your code and the infrastructure it lives on.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI