Optimizing Cloud Run Memory with Go Worker Pools

Implementing Go worker pools on Cloud Run prevents memory exhaustion by limiting the number of concurrent goroutines processing heavy tasks. This pattern replaces unbounded concurrency with a fixed set of workers and a buffered job queue, ensuring stable memory utilization under high load.

Last Tuesday, my PagerDuty alert went off at 3:14 AM. It wasn't a subtle warning; it was a critical failure notification indicating that my primary media processing service on Cloud Run had entered a crash loop. When I pulled up the Google Cloud Console, the memory utilization graph looked like a vertical wall. We had hit our 2GB limit, and the runtime was throwing OOMKilled errors across every single instance. My "simple" Go backend, which I had bragged about for its concurrency model, was eating itself alive under a sudden burst of 5,000 concurrent video transcoding requests.

The culprit was a classic mistake I see often in Go development: unbounded concurrency. I was using the "fire and forget" pattern—spawning a new goroutine for every incoming HTTP request. While goroutines are incredibly cheap (starting at about 2KB of stack space), the tasks they were performing were not. Each goroutine was initializing a heavy library, allocating large buffers for image manipulation, and holding those buffers until the task finished. When the request rate spiked, the scheduler happily spawned thousands of goroutines, but the physical RAM couldn't keep up. I realized I had built a system that was optimized for throughput at the expense of stability.

In this post, I’m going to walk through how I rebuilt that concurrency model using a robust worker pool pattern. We’ll look at the specific code changes that dropped our memory footprint by 65% and why managing a fixed set of workers is almost always superior to letting goroutines run wild in a resource-constrained serverless environment.

The Fatal Flaw: Why Unbounded Goroutines Cause Memory Exhaustion

Spawning a new goroutine for every request without limits leads to immediate memory exhaustion in resource-constrained environments like Cloud Run. Before the refactor, my code looked something like this. It’s a pattern that works fine during local development or under light load, but it’s a ticking time bomb for production services.

func (s *Server) handleProcess(w http.ResponseWriter, r *http.Request) {
    jobID := r.URL.Query().Get("id")
    
    // The "Fire and Forget" mistake
    go func(id string) {
        if err := s.processor.Process(id); err != nil {
            log.Printf("Error processing job %s: %v", id, err)
        }
    }(jobID)

    w.WriteHeader(http.StatusAccepted)
}

The problem here is that there is no backpressure. If 1,000 requests hit this endpoint in one second, the Go runtime attempts to run 1,000 s.processor.Process calls simultaneously. In a serverless environment like Cloud Run, where you are often billed for and constrained by a specific memory limit, this is a recipe for disaster. Each process call used roughly 150MB of heap memory. You do the math: 1,000 * 150MB is way beyond the 2GB container limit.

I had previously spent time automating Cloud Run deployments with GitHub Actions and Terraform, which allowed me to roll back quickly, but the underlying architectural flaw remained. I needed a way to limit the number of active tasks while still accepting incoming requests.

How to Design a Semaphore-Based Go Worker Pool

A semaphore-based worker pool uses a buffered channel to act as a job queue, effectively capping the maximum number of concurrent operations. I decided to implement a dispatcher-worker pattern. The goal was simple: maintain a fixed number of workers (goroutines) that pull jobs from a centralized queue (a buffered channel). This allows us to control the maximum memory consumption by strictly limiting the number of concurrent "heavy" operations.

First, I defined the Job and Worker structures. I wanted the system to be flexible enough to handle different types of tasks, so I used a simple interface for the jobs.

type Job interface {
    Execute(ctx context.Context) error
}

type Worker struct {
    ID         int
    JobChannel chan Job
    Quit       chan bool
}

func NewWorker(id int, jobChannel chan Job) Worker {
    return Worker{
        ID:         id,
        JobChannel: jobChannel,
        Quit:       make(chan bool),
    }
}

The core logic lives in the Start method of the worker. It listens on the job channel and executes whatever it receives. Crucially, it also listens for a quit signal to allow for graceful shutdowns—something that is vital when Cloud Run scales your instances down.

func (w Worker) Start() {
    go func() {
        for {
            select {
            case job := <-w.JobChannel:
                // We use a background context or one passed from the dispatcher
                ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
                if err := job.Execute(ctx); err != nil {
                    log.Printf("Worker %d: error: %v", w.ID, err)
                }
                cancel()
            case <-w.Quit:
                return
            }
        }
    }()
}

The Dispatcher: How to Manage the Worker Lifecycle

The dispatcher orchestrates the worker pool by initializing a fixed number of goroutines and distributing jobs from a central queue. The Dispatcher acts as the orchestrator. It holds the job queue and manages the lifecycle of the workers. I chose to use a buffered channel for the job queue. This provides a "buffer zone" where requests can wait if all workers are currently busy, rather than blocking the HTTP response immediately.

type Dispatcher struct {
    WorkerPool chan Job
    MaxWorkers int
    JobQueue   chan Job
}

func NewDispatcher(maxWorkers int, queueSize int) *Dispatcher {
    return &Dispatcher{
        WorkerPool: make(chan Job, maxWorkers),
        MaxWorkers: maxWorkers,
        JobQueue:   make(chan Job, queueSize),
    }
}

func (d *Dispatcher) Run() {
    for i := 0; i < d.MaxWorkers; i++ {
        worker := NewWorker(i, d.JobQueue)
        worker.Start()
    }
}

With this setup, I can initialize the dispatcher when my application starts. If I know my Cloud Run instance has 2GB of RAM and each job takes ~150MB, I can safely set my MaxWorkers to 10 or 12, leaving some overhead for the Go runtime and the HTTP server itself. This is a massive improvement over the "unlimited" approach.

How to Handle Backpressure and Request Timeouts

Implementing a non-blocking drop-and-notify strategy using Go's select/default pattern prevents the job queue from blocking the entire service. One of the hardest things to get right in a concurrent system is what happens when the queue is full. If the JobQueue is full, the call to d.JobQueue <- job will block. In an HTTP handler, this means the client's connection stays open until a slot opens up. If the wait is too long, the client (or an upstream load balancer) will timeout.

I decided to implement a non-blocking "drop-and-notify" strategy for when the system is under extreme load. This is where I missed the ease of Python structured logging for a moment, as I had to be very careful with how I logged these drops to ensure I could trace them back to specific users without flooding my logs.

func (s *Server) handleProcess(w http.ResponseWriter, r *http.Request) {
    job := &VideoJob{ID: r.URL.Query().Get("id")}

    select {
    case s.dispatcher.JobQueue <- job:
        w.WriteHeader(http.StatusAccepted)
        w.Write([]byte("Job queued"))
    default:
        // The queue is full!
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte("Server busy, try again later"))
        log.Printf("Queue full, dropping job %s", job.ID)
    }
}

This select block with a default case is the Go way of handling backpressure. Instead of letting the memory spiral out of control, we tell the client that we are at capacity. This is much better than crashing the entire instance, which would drop all currently processing jobs, not just the new one.

Why Context and Graceful Shutdown are Mandatory

Using the Go context package and sync.WaitGroup ensures that workers finish their current tasks before the Cloud Run instance terminates. In Go, the context package is your best friend for managing long-running tasks. When Cloud Run sends a SIGTERM to your container, you have a short window (usually 10 seconds by default, though configurable) to finish what you're doing. If you don't handle this, your workers will be killed mid-execution, potentially leaving data in a corrupted state.

I modified my dispatcher to listen for these signals and close the JobQueue. Closing the channel signals to the workers that no more work is coming. Using a sync.WaitGroup, I can ensure the main function doesn't exit until all workers have finished their current task.

func (d *Dispatcher) Shutdown(ctx context.Context) error {
    close(d.JobQueue)
    
    // Wait for workers to finish or context to timeout
    done := make(chan struct{})
    go func() {
        // Assume we added a WaitGroup to the dispatcher
        d.wg.Wait()
        close(done)
    }()

    select {
    case <-done:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

This level of control is essential for production-grade software. It ensures that we respect the lifecycle of the infrastructure we are running on. If you are interested in how this fits into a larger system, check out my thoughts on the best project structure for scalability, as many of those architectural principles apply here regardless of the language.

Performance Benchmarks: How Worker Pools Reduced Memory by 62%

Load testing confirms that Go worker pools reduce peak memory usage by over 60% and eliminate OOMKilled errors during traffic spikes. I ran a series of load tests using k6 to compare the old "unbounded" approach with the new "worker pool" approach. I simulated a burst of 500 requests over a 10-second period on a Cloud Run instance with 2GB of RAM.

Metric	Unbounded (Old)	Worker Pool (New)	Change
Peak Memory Usage	1.98 GB (Crashed)	740 MB	-62.6%
P99 Latency	14.2s	4.1s	-71.1%
Success Rate	42%	100%	+58%
CPU Throttling	Severe	Minimal	N/A

The results were stark. While the unbounded approach initially seemed "faster" because it started all jobs immediately, the resulting CPU contention and memory pressure caused the Go garbage collector (GC) to run constantly. This "GC thrashing" consumed more CPU than the actual work itself, leading to massive latency spikes and eventual crashes. The worker pool, by contrast, kept the CPU utilization steady and the memory well within safe limits.

One interesting observation was the GOMEMLIMIT environment variable. In Go 1.19 and later, you can set this to hint to the runtime when it should be more aggressive with garbage collection. I set GOMEMLIMIT=1800MiB for my 2GB container. Combined with the worker pool, this gave the runtime a clear boundary, preventing the "OOM death spiral" where the container is killed before the GC can even finish its cycle.

Key Takeaways for Scaling Go Applications

Managing concurrency through fixed pools is essential for maintaining stability and cost-efficiency in serverless architectures. Building this system was a humbling reminder that Go's ease of concurrency can be a double-edged sword. Here are my primary takeaways from this incident:

Concurrency is not Parallelism: Spawning 10,000 goroutines on a 2-core machine doesn't make things 5,000 times faster. It just creates 10,000 things competing for the same 2 cores and limited RAM.
Always Implement Backpressure: A system that doesn't know how to say "no" is a system that will eventually fail. Using buffered channels with a select/default pattern is the most effective way to handle load spikes in Go.
Resource Limits are Hard Constraints: In serverless environments, you must design for the container's limits. Your code should be aware of the memory it consumes per unit of work.
Graceful Shutdown is Mandatory: Cloud Run and Kubernetes will kill your processes. If you don't handle SIGTERM and drain your worker pools, you will lose data and leave your system in an inconsistent state.
Observability is Key: Without structured logging and clear metrics, I would have spent days debugging why the containers were crashing. Seeing the correlation between request rate and memory growth made the problem obvious.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

Optimizing Cloud Run Memory with Go Worker Pools

The Fatal Flaw: Why Unbounded Goroutines Cause Memory Exhaustion

How to Design a Semaphore-Based Go Worker Pool

The Dispatcher: How to Manage the Worker Lifecycle

How to Handle Backpressure and Request Timeouts

Why Context and Graceful Shutdown are Mandatory

Performance Benchmarks: How Worker Pools Reduced Memory by 62%

Key Takeaways for Scaling Go Applications

Related Reading

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs