How to Debug a Go Goroutine Leak in Cloud Run

A Go goroutine leak occurs when a goroutine is started but never terminates, often because it is blocked on a channel that is never closed or a context that is never cancelled. To resolve this, developers should use the pprof tool to identify blocked goroutines and refactor the code to use context.Context for robust lifecycle management.

Last Tuesday at 3:14 AM, my phone's vibration nearly shook it off the nightstand. PagerDuty was screaming. One of our core Go microservices, which handles high-throughput event routing for our AI orchestration layer, was hitting 95% memory utilization on Google Cloud Run. By the time I opened my laptop, the service had crashed, restarted, and was already climbing back up to the 2GB limit. This wasn't a sudden spike; it was a slow, methodical crawl—the classic signature of a resource leak.

In a managed environment like Cloud Run, memory leaks are expensive. Because of the way we had configured autoscaling, the increasing memory footprint was preventing instances from being offloaded, leading to a 40% increase in our daily compute costs over just 48 hours. I had initially suspected a heap allocation issue, perhaps a growing slice or a map that wasn't being cleared, but the reality turned out to be much more insidious. I was dealing with a Go goroutine leak that was silently eating our stack memory and holding onto references that the Garbage Collector (GC) couldn't touch.

In this post, I’ll walk through the exact steps I took to isolate the leak, the tools I used to visualize the problem, and the architectural mistake I made that led to thousands of orphaned goroutines. If you’ve ever seen your Go service's memory graph look like a staircase to heaven, this is for you.

Identifying the Symptoms of a Go Goroutine Leak

A Go goroutine leak typically manifests as a linear increase in memory usage that does not plateau, even when traffic decreases. The first thing I did was look at our Cloud Monitoring dashboard. The "Memory Usage" metric for the service showed a perfect linear ascent. In Go, when you see memory usage rise without ever plateauing, you have two primary suspects: a heap leak or a goroutine leak.

Every goroutine in Go starts with a minimum stack size of approximately 2KB. While that sounds small, if you leak 50,000 goroutines—which is surprisingly easy to do—you've already lost 100MB just in stack space. More importantly, any variables captured by the closure of that goroutine stay on the heap as long as the goroutine is alive.

I noticed that the memory climb correlated perfectly with our request rate. However, when the request rate dropped during the low-traffic hours of 1 AM to 4 AM, the memory didn't drop. It just stayed flat at its elevated level. This told me that the resources weren't being tied up by active requests, but by something that was supposed to finish after a request but never did.

This service acts as a high-performance proxy for our AI agents. While our Python-based systems handle the heavy lifting of LLM logic—as I discussed in my post on Building Self-Correcting AI Agents with Gemini and Python—this Go service manages the WebSocket connections and long-polling buffers. Losing this service meant our agents were essentially "blind" to incoming telemetry.

How to Enable pprof for Real-Time Production Debugging

Enabling the net/http/pprof package allows you to capture live stack traces and identify exactly where goroutines are stuck. You cannot fix what you cannot see. My first move was to enable the net/http/pprof package. I know some engineers are hesitant to expose pprof in production due to security concerns, but in a debugging emergency, it is your best friend. I wrapped the pprof endpoints behind an internal-only admin port that isn't exposed to the public internet via the Cloud Run ingress.

import (
    _ "net/http/pprof"
    "net/http"
)

func main() {
    // Separate goroutine for the diagnostic server
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    
    // ... rest of the microservice logic ...
}

With the service redeployed, I used a sidecar-style approach to access the pprof data. Since Cloud Run doesn't allow direct SSH, I used gcloud compute ssh to a jump box in the same VPC and then used curl to grab a snapshot of the goroutine profile. The command I used to get a human-readable summary was:

curl http://service-internal-ip:6060/debug/pprof/goroutine?debug=1

Analyzing Stack Traces to Isolate the Leaking Code

High goroutine counts in pprof output usually point to a specific function where resources are being held indefinitely. The output was staggering. In a healthy state, this service usually runs around 150 to 200 goroutines. The pprof output showed 64,281 goroutines. The vast majority of them (over 64,000) were blocked on a single operation:

64102 @ 0x43f2a5 0x440121 0x46d321 0x7a2104
#   0x46d321    github.com/techfrontier/event-router/internal/buffer.(*Stream).listen+0x121
#   0x7a2104    github.com/techfrontier/event-router/internal/buffer.NewStream.func1+0x24

This was the "smoking gun." The listen method in my buffer package was spawning goroutines that never exited. I had written this code three months ago to handle streaming telemetry back to our FastAPI-based authentication layer, which I previously wrote about in FastAPI Authentication: Scaling Production Apps with JWT and Redis. It turns out my Go service was being much less efficient than the Python backend it was talking to.

Why Channel Deadlocks Cause Persistent Memory Leaks

Goroutines blocked on channel operations prevent the Garbage Collector from reclaiming associated memory, leading to Out-of-Memory (OOM) errors. I went back to the source code for internal/buffer/stream.go. The logic was supposed to create a new stream for every incoming client connection, listen for events on a channel, and then exit when the client disconnected. Here is a simplified version of what I found:

func (s *Stream) listen() {
    for {
        select {
        case event := <-s.eventChan:
            s.process(event)
        case <-s.quit:
            return
        }
    }
}

func NewStream() *Stream {
    s := &Stream{
        eventChan: make(chan Event),
        quit:      make(chan struct{}),
    }
    go s.listen()
    return s
}

At first glance, this looks fine. There is a quit channel to signal the goroutine to stop. But as I dug deeper into the lifecycle of the Stream struct, I realized that the quit channel was only ever closed if the client explicitly sent a "disconnect" message. If the client simply dropped the TCP connection (which happens constantly with mobile clients or unstable networks), the Stream object was supposed to be garbage collected.

However, the listen goroutine was still running! It was blocked on <-s.eventChan or <-s.quit. Because the goroutine held a reference to the Stream struct, the GC couldn't reclaim the Stream. Because the Stream was still alive, the channels remained open. I had created a circular dependency where the goroutine was keeping the object alive, and the object was the only thing that could stop the goroutine.

Using pprof Visualization to Confirm Resource Bloat

The go tool pprof visualizer provides a graphical representation of heap allocations, confirming if leaked goroutines are holding onto large data structures. To confirm this, I used the go tool pprof interactive visualizer. I took a heap profile to see what was occupying the most space:

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

The resulting graph showed that a massive amount of memory was allocated to chan Event and the underlying buffers for the Stream struct. This confirmed that the thousands of leaked goroutines were preventing the cleanup of the associated data structures.

I also checked the official Go runtime/pprof documentation to ensure I wasn't misinterpreting the "idle" goroutine states. The documentation makes it clear: a goroutine blocked on a channel receive will stay in that state forever if the channel is never closed and no data is sent. This is exactly what was happening.

How to Fix a Go Goroutine Leak Using context.Context

Replacing manual quit channels with context.Context ensures that goroutines terminate automatically when a request or connection is closed. The fix required two changes. First, I needed to use context.Context for cancellation instead of a manual quit channel. Context is the idiomatic way to handle lifecycle management in Go, especially when dealing with network requests. Second, I needed to ensure that the context was cancelled as soon as the parent request or connection was terminated.

I refactored the Stream to accept a context:

func (s *Stream) listen(ctx context.Context) {
    for {
        select {
        case <-ctx.Done():
            // Context was cancelled by the caller (e.g., connection closed)
            return
        case event, ok := <-s.eventChan:
            if !ok {
                return // Channel closed
            }
            s.process(event)
        }
    }
}

func NewStream(ctx context.Context) *Stream {
    s := &Stream{
        eventChan: make(chan Event),
    }
    // The goroutine now respects the lifecycle of the passed context
    go s.listen(ctx)
    return s
}

In the HTTP handler, I now passed the request context:

func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // r.Context() is automatically cancelled when the client disconnects
    stream := NewStream(r.Context())
    // ... handle stream ...
}

This change ensured that as soon as the HTTP request ended—whether through a clean finish or a client disconnect—the ctx.Done() signal would fire, the listen goroutine would exit, and the GC would finally be able to clean up the Stream struct and its channels. Implementing this pattern is the most effective way to resolve a Go goroutine leak.

Implementing Monitoring to Prevent Future Goroutine Leaks

Setting alerts on runtime.NumGoroutine() provides an early warning system for resource leaks before they impact service stability. Fixing the immediate bug was great, but I wanted to make sure this wouldn't happen again. I added a "Goroutine Guard" to our internal health check endpoint. This is a simple check that compares the current number of goroutines against a threshold.

func HealthCheckHandler(w http.ResponseWriter, r *http.Request) {
    numGoroutines := runtime.NumGoroutine()
    if numGoroutines > 5000 {
        log.Printf("CRITICAL: High goroutine count detected: %d", numGoroutines)
        // We don't necessarily want to fail health checks and kill the pod,
        // but we definitely want an alert triggered.
    }
    w.WriteHeader(http.StatusOK)
}

I also integrated a more robust timeout mechanism. Using context.WithTimeout in our worker pools ensures that even if a context isn't cancelled by a client, it will eventually expire and clean up its resources. This is a pattern I've started applying across all our Go services to provide a "fail-safe" for resource management.

Measuring the Impact of Optimized Memory Management

Resolving a Go goroutine leak stabilizes memory usage, reduces Cloud Run compute costs, and improves CPU scheduler efficiency. After deploying the fix, I monitored the metrics for 24 hours. The memory usage graph, which previously looked like a steep ramp, was now a flat line with small, healthy oscillations. The number of goroutines stabilized at around 140, even under peak load.

Our Cloud Run costs dropped immediately, as instances were now able to scale down to zero when not in use, rather than staying alive simply because they were "full" of leaked memory. One interesting side effect was a slight improvement in CPU latency. It turns out that having 60,000+ idle goroutines puts a non-trivial strain on the Go scheduler. Even though they aren't "doing" anything, the scheduler still has to track them. Once we cleared the bloat, our P99 response times dropped by about 15ms.

What I Learned / Key Takeaways

Goroutines are not free: While 2KB is small, the heap objects they keep alive are not. Always define the exact lifecycle of every goroutine you spawn. If you use the go keyword, you must know exactly what will cause that goroutine to return.
pprof is non-negotiable: Trying to debug a Go memory leak without pprof is like trying to find a needle in a haystack with your eyes closed. Make it easy to grab a profile from your production environment securely.
Context is king: Use context.Context for more than just database timeouts. It should be the primary mechanism for signaling the end of life for background workers and streams.
Don't trust the GC: The Garbage Collector can only do its job if you release references. A blocked goroutine is a permanent reference that the GC cannot override.
Monitor goroutine counts: Set up alerts on runtime.NumGoroutine(). It’s often a better leading indicator of a leak than memory usage itself, which can be masked by GC cycles.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

How to Debug a Go Goroutine Leak in Cloud Run

How to Debug a Go Goroutine Leak in Cloud Run

Identifying the Symptoms of a Go Goroutine Leak

How to Enable pprof for Real-Time Production Debugging

Analyzing Stack Traces to Isolate the Leaking Code

Why Channel Deadlocks Cause Persistent Memory Leaks

Using pprof Visualization to Confirm Resource Bloat

How to Fix a Go Goroutine Leak Using context.Context

Implementing Monitoring to Prevent Future Goroutine Leaks

Measuring the Impact of Optimized Memory Management

What I Learned / Key Takeaways

Related Reading

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs