Go Cloud Run Memory Leaks: Diagnosing and Resolving for AI Workloads

Go Cloud Run Memory Leaks: Diagnosing and Resolving for AI Workloads

I've been on a mission lately to squeeze every last drop of efficiency out of our services. When you're running AI inference workloads in the cloud, especially with Go microservices on platforms like Cloud Run, every megabyte of memory and every CPU cycle counts. Not just for performance, but directly for your cloud bill. My latest battle? A sneaky memory leak that was slowly but surely eating away at our Cloud Run instances, leading to increased costs and, occasionally, disruptive container restarts.

The first sign of trouble wasn't a sudden explosion, but a creeping malaise. Over a few weeks, I noticed a subtle upward trend in the memory usage graphs for one of our core Go inference services. This service is responsible for taking pre-processed data and feeding it to a smaller, specialized LLM for quick classifications, a critical part of our content generation pipeline. Initially, it was just a few extra MBs, easily dismissible as normal operational variance. But then, the curve steepened. Our average memory utilization, which used to hover comfortably around 150MB, started climbing past 200MB, then 250MB, eventually touching the 512MB limit we had set for the Cloud Run instances before they’d gracefully restart. These restarts, while handled by Cloud Run, meant increased latency for some requests and a general feeling of instability that I couldn't ignore.

This wasn't just a performance headache; it was a cost headache. Cloud Run charges based on CPU and memory allocation during active requests. While our service was mostly idle between requests, the baseline memory footprint was growing. More memory allocated meant higher costs, even during quiet periods. My gut told me we had a memory leak, and it was time for a deep dive.

Initial Symptoms and Monitoring Clues

My first stop was Google Cloud Monitoring. The memory utilization graphs for the affected Cloud Run service were clear: a sawtooth pattern, but with a steadily increasing baseline. Each peak would reach closer to the memory limit, trigger a container restart, and then the cycle would begin again from a slightly higher starting point. This pattern is a classic indicator of a memory leak – memory accumulates over time, is released on restart, but then starts accumulating again.

I also checked the request logs for any obvious errors or anomalies correlating with the memory spikes. Nothing jumped out immediately. Latency was generally stable, and error rates were low. This suggested the leak wasn't tied to a specific error condition but rather to normal, successful request processing.

Cloud Run's built-in metrics are fantastic for high-level observations, but for deep-seated memory issues within a Go application, I needed more granular data. This is where Go's built-in profiling tools, especially pprof, become indispensable.

Leveraging Go's pprof for In-Depth Analysis

The challenge with pprof in a serverless, ephemeral environment like Cloud Run is getting access to the profiling data. You can't just SSH in and run commands. My approach involved a combination of local reproduction and a temporary, controlled deployment with an exposed profiling endpoint.

Step 1: Local Reproduction

Before touching production, I tried to reproduce the issue locally. I set up a local Docker container mirroring our Cloud Run environment as closely as possible. I then hammered the local service with simulated traffic using hey (a load generator) and a custom script that mimicked our typical request patterns, including variations in input size and concurrent requests. I exposed the net/http/pprof endpoint in my local build:


package main

import (
    "log"
    "net/http"
    _ "net/http/pprof" // Import pprof for HTTP endpoints

    "github.com/your-org/your-service/internal/app" // Your application logic
)

func main() {
    // ... your application setup ...

    // Expose pprof endpoints on a separate port or path
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // ... your main HTTP server setup ...
    log.Fatal(http.ListenAndServe(":8080", app.NewRouter()))
}

After letting the local service run for a few hours under simulated load, I could see the memory creeping up in docker stats. Bingo. The leak was reproducible.

Step 2: Capturing pprof Data

With the local reproduction confirmed, I could now confidently use pprof. I used the following commands to capture heap and goroutine profiles:


# Capture a 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Capture a heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Capture a goroutine profile
go tool pprof http://localhost:6060/debug/pprof/goroutine

The heap profile was my primary target for memory leaks. When I ran go tool pprof -http=:8000 heap.pb (after downloading the heap profile), I immediately got a graphical representation of memory allocation. The flame graph and tree view were invaluable. I looked for functions or data structures that were holding onto an increasing amount of memory over time, especially those that didn't seem to be releasing it.

I took multiple heap profiles at different times during the local test run (e.g., after 1 hour, then after 3 hours) and compared them. This "diff" approach helps pinpoint what's growing. You can do this by downloading two profiles, say `heap1.pb` and `heap2.pb`, and then running `go tool pprof --base heap1.pb heap2.pb`.

Identifying Common Go Memory Leak Patterns

Through the pprof analysis, I narrowed down the culprits to a few common patterns. Here’s what I found and how I addressed them:

Pattern 1: Unclosed HTTP Response Bodies

One of our helper functions made an external HTTP call to a separate service to fetch some metadata for the LLM inference. I found that while we were reading the response body, we weren't always explicitly closing it. Even if you read the entire body, the underlying connection might not be released back to the connection pool if resp.Body.Close() isn't called.

Before (Leaky Code Snippet):


func fetchMetadata(id string) ([]byte, error) {
    resp, err := http.Get(fmt.Sprintf("http://metadata-service/%s", id))
    if err != nil {
        return nil, fmt.Errorf("failed to make request: %w", err)
    }
    // Missing: defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, fmt.Errorf("failed to read response body: %w", err)
    }
    return body, nil
}

The pprof heap profile showed significant allocations originating from net/http.(*Transport).getConn and related buffer allocations, indicating that HTTP connections were being held open or not properly cleaned up. Each unclosed body, even if small, contributed to a slow but steady accumulation of memory.

After (Fixed Code Snippet):


func fetchMetadata(id string) ([]byte, error) {
    resp, err := http.Get(fmt.Sprintf("http://metadata-service/%s", id))
    if err != nil {
        return nil, fmt.Errorf("failed to make request: %w", err)
    }
    defer resp.Body.Close() // Crucial: Ensure the response body is closed

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, fmt.Errorf("failed to read response body: %w", err)
    }
    return body, nil
}

This fix was straightforward but incredibly impactful. It's a classic mistake, easy to overlook, especially when error paths might bypass the `defer` or when you assume `io.ReadAll` handles everything.

Pattern 2: Unbounded Caches or Maps

Our service also implemented a simple in-memory cache for frequently requested LLM prompts to reduce redundant lookups and improve latency. While a cache is good, an unbounded cache is a memory leak waiting to happen. The pprof heap graph highlighted a `map[string]string` growing continuously, specifically in the function responsible for populating this cache.

Before (Leaky Cache):


var promptCache = make(map[string]string)
var cacheMutex sync.RWMutex

func getPrompt(key string) string {
    cacheMutex.RLock()
    if prompt, ok := promptCache[key]; ok {
        cacheMutex.RUnlock()
        return prompt
    }
    cacheMutex.RUnlock()

    // Fetch from database/external source
    prompt := fetchPromptFromDB(key)

    cacheMutex.Lock()
    promptCache[key] = prompt
    cacheMutex.Unlock()
    return prompt
}

This cache simply grew forever. Every unique prompt, no matter how old or infrequently used, stayed in memory. For an LLM inference service, the variety of prompts can be vast, making this a significant leak source.

After (Fixed with LRU Cache):

I replaced the simple map with an LRU (Least Recently Used) cache implementation. I opted for a small, well-tested LRU library to avoid reinventing the wheel.


import "github.com/hashicorp/golang-lru" // Or similar LRU implementation

var promptCache *lru.Cache

func init() {
    var err error
    // Initialize LRU cache with a maximum size, e.g., 1000 entries
    promptCache, err = lru.New(1000)
    if err != nil {
        log.Fatalf("Failed to initialize LRU cache: %v", err)
    }
}

func getPrompt(key string) string {
    if val, ok := promptCache.Get(key); ok {
        return val.(string)
    }

    prompt := fetchPromptFromDB(key)
    promptCache.Add(key, prompt)
    return prompt
}

By capping the cache size, I ensured that older, less frequently accessed prompts would be evicted, keeping the memory footprint stable. This dramatically reduced the persistent heap growth associated with the cache.

Pattern 3: Goroutine Leaks and Unread Channels

While the heap profile pointed to the biggest memory culprits, the goroutine profile also offered insights. I noticed a steady increase in the number of goroutines, specifically those in a "chan receive" state, waiting indefinitely. This usually indicates a goroutine that was launched but never properly finished or cleaned up.

In our service, we had a component that asynchronously processed some post-inference analytics. This involved sending messages to a channel, which was then read by a goroutine.

Before (Leaky Goroutine):


var analyticsChan = make(chan AnalyticsData, 100) // Buffered channel

func init() {
    go func() {
        for data := range analyticsChan {
            // Process analytics data
            processAnalytics(data)
        }
    }()
}

func sendAnalytics(data AnalyticsData) {
    // This could block if channel is full and no reader is active,
    // or if the goroutine processing it dies.
    analyticsChan <- data
}

The issue here was subtle: if processAnalytics encountered a panic or if the main application shut down unexpectedly without closing analyticsChan, the goroutine reading from it would remain in a blocked state, consuming a small amount of memory but never terminating. Over time, particularly during rapid deployments or error conditions, these orphaned goroutines could accumulate.

After (Fixed with Context and Graceful Shutdown):

I refactored this to use a context for cancellation and ensured that the channel was properly closed on application shutdown. This is crucial for graceful termination and preventing goroutine leaks.


type AnalyticsService struct {
    dataChan chan AnalyticsData
    stopChan chan struct{}
    wg       sync.WaitGroup
}

func NewAnalyticsService() *AnalyticsService {
    s := &AnalyticsService{
        dataChan: make(chan AnalyticsData, 100),
        stopChan: make(chan struct{}),
    }
    s.wg.Add(1)
    go s.run() // Start the processing goroutine
    return s
}

func (s *AnalyticsService) run() {
    defer s.wg.Done()
    for {
        select {
        case data := <-s.dataChan:
            processAnalytics(data)
        case <-s.stopChan:
            // Drain any remaining messages before exiting
            for len(s.dataChan) > 0 {
                processAnalytics(<-s.dataChan)
            }
            return
        }
    }
}

func (s *AnalyticsService) Send(data AnalyticsData) {
    select {
    case s.dataChan <- data:
        // Sent successfully
    default:
        // Handle case where channel is full (e.g., log, drop, or block)
        log.Println("Analytics channel full, dropping data.")
    }
}

func (s *AnalyticsService) Shutdown() {
    close(s.stopChan)
    s.wg.Wait() // Wait for the run goroutine to finish
    close(s.dataChan) // Close the data channel after the goroutine has stopped reading
}

// In main or server setup:
// analyticsSvc := NewAnalyticsService()
// defer analyticsSvc.Shutdown() // Ensure shutdown is called on exit

This pattern ensures that the goroutine can gracefully exit when the application is shutting down, preventing it from becoming a zombie. While `pprof`'s goroutine profile might not directly show "memory leak" in the same way as heap, accumulated goroutines do consume memory (stack space, closure variables) and indicate a resource management problem.

Metrics and Cost Impact Post-Fix

The results were almost immediate and incredibly satisfying. After deploying the fixes, I closely monitored the Cloud Run metrics. The memory utilization graphs quickly stabilized. Instead of the steadily climbing baseline, I saw a flat line, with peaks only during active request processing, returning to a consistent low idle memory footprint.

  • Average Memory Usage: Dropped from ~250MB (and climbing) to a stable ~120MB.
  • Container Restarts: Eliminated completely due to memory limits.
  • CPU Utilization: Saw a slight, but noticeable, decrease during peak loads due to less garbage collection pressure.
  • Cloud Run Costs: Projected to decrease by roughly 15-20% for this service, primarily due to lower memory allocation charges and fewer container instances being provisioned to handle restarts. Before, Cloud Run would sometimes spin up more instances to compensate for the instability, leading to higher costs.

This wasn't just about saving money; it was about stability and predictability. Our service became more robust, and I gained confidence in its long-term operational health.

What I Learned / The Challenge

This entire debugging journey reinforced several critical lessons for me. Firstly, never underestimate the power of Go's built-in profiling tools. pprof, while sometimes daunting to get started with in a containerized, serverless environment, provides unparalleled visibility into your application's runtime behavior. It's not just for CPU bottlenecks; it's a memory leak detection superpower.

Secondly, memory leaks in Go can be subtle. They're often not about raw memory allocation but about holding onto references to objects that are no longer needed, preventing the garbage collector from reclaiming them. Unclosed resources (like HTTP response bodies), unbounded data structures (like maps), and leaky goroutines are the usual suspects. It's easy to make these mistakes, even for experienced developers, which is why regular profiling and code reviews are so important.

Finally, the interplay between application-level memory management and cloud cost optimization is direct and significant. A seemingly small memory leak, left unchecked, can translate into substantial operational costs over time, especially with usage-based billing models like Cloud Run. Proactive monitoring and quick action on anomalies are key to keeping cloud bills in check.

Related Reading

Understanding how memory leaks impact your overall system performance and cost is crucial. If you're working with large language models, the memory footprint of embeddings and vector search can be another major cost driver. My colleague's deep dive on LLM Embedding and Vector Search Cost Optimization: A Deep Dive offers excellent strategies for managing those specific resource-intensive components. The principles of identifying and optimizing memory usage are highly transferable.

Furthermore, the data feeding these AI models needs to be ingested efficiently. If your data ingestion pipeline isn't cost-effective, you might be paying too much even before your Go service starts processing. Check out our recent post on Building a Cost-Effective Data Ingestion Pipeline for LLM Fine-Tuning on GCP. While that post focuses on GCP data ingestion, the lessons on resource efficiency and cost-aware architecture directly complement the memory optimization work I've detailed here.

Looking Forward

This experience has made me even more committed to baking observability and performance profiling into our development workflow from the start. I'm exploring automated ways to run pprof snapshots in non-production environments during integration tests, and even considering a lightweight, opt-in profiling agent for production, perhaps with stricter access controls and sampling rates to minimize overhead. The goal is to catch these subtle memory creeps before they become expensive problems.

We're also looking into Go's new features and best practices for memory management, especially with the upcoming Go versions. Continuous learning and adaptation are key in the fast-paced world of AI engineering and cloud infrastructure. For anyone else grappling with similar issues, I highly recommend diving deep into the official Go pprof documentation – it's a goldmine of information.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI