Debugging and Resolving Python Asyncio Memory Leaks on Cloud Run

It was a quiet Tuesday morning when the alerts started trickling in. Not the usual "high latency" or "error rate spike" that we've grown accustomed to debugging, but something more insidious: a steady stream of OOMKilled (Out Of Memory Killed) instances on one of our core Python asyncio services running on Cloud Run. Initially, it was just one or two instances a day, easily dismissed as transient issues. But within a week, the frequency escalated, impacting service availability and leading to a noticeable increase in our Cloud Run billing for memory usage. My heart sank a little each time I saw the red alerts; I knew I was in for a deep dive into the murky waters of memory profiling.

The Slow Creep: Identifying the Leak

Our service, responsible for processing incoming data streams before feeding them to an LLM inference pipeline, is built on Python's asyncio. It's designed to be highly concurrent and efficient. The initial symptoms were subtle: increased container restart counts, slightly elevated latency for some requests, and then, the dreaded OOMKills. What made it particularly frustrating was that the memory usage wasn't a sudden spike; it was a slow, agonizing creep upwards over several hours until the container hit its memory limit and was unceremoniously terminated by Cloud Run. This pattern immediately screamed "memory leak" rather than a burst of high memory allocation for a specific task.

My first stop was Google Cloud Monitoring. The container/memory/usage_bytes metric for the affected Cloud Run service told a clear story. Instead of a sawtooth pattern where memory would rise during request processing and then fall back, it was a staircase, each step representing an increase that never fully receded. The graph showed a steady upward trend until it abruptly dropped (due to an OOMKill and a new instance starting), only to begin its ascent again. The container/memory/limit_bytes was set to 1GB, which had been more than sufficient for months.

Here's a simplified visualization of what I was seeing in Cloud Monitoring:


Time             Memory Usage (MB)
----------------------------------
08:00 AM         250
09:00 AM         300
10:00 AM         350
11:00 AM         420
12:00 PM         500
01:00 PM         600
02:00 PM         750
03:00 PM         900
03:30 PM         980 (approaching 1GB limit)
03:35 PM         (OOMKilled - new instance starts)
03:36 PM         250
... (pattern repeats)

This confirmed my suspicion. We had a bona fide memory leak. The challenge now was to find *where* in our asynchronous Python codebase it was hiding.

Diving Deep with Python's Memory Profilers

Reproducing a memory leak that manifests over hours in a production environment locally can be notoriously difficult. My strategy was to simulate a high load over an extended period and observe memory usage. I spun up a local Docker container mimicking our Cloud Run environment and bombarded it with requests using locust. Sure enough, after about an hour, the memory usage started climbing.

Python offers several excellent tools for memory profiling. For this particular issue, I leaned heavily on tracemalloc, a built-in module that tracks memory allocations. It's incredibly useful for identifying where memory is being allocated and, crucially, where it's *not* being deallocated.

To integrate tracemalloc into our asyncio application, I added some conditional logic to enable it during local debugging or via a specific environment variable in test environments. Here's a simplified example of how I set it up:


import tracemalloc
import asyncio
import os
import time
import objgraph # For later visual inspection

# --- Configuration for tracemalloc ---
ENABLE_TRACEMALLOC = os.getenv("ENABLE_TRACEMALLOC", "false").lower() == "true"
TRACEMALLOC_SNAPSHOT_INTERVAL_SECONDS = 300 # Every 5 minutes

if ENABLE_TRACEMALLOC:
    tracemalloc.start()
    print("tracemalloc started.")

# --- Our (simplified) asyncio application ---
_global_cache = {} # A potential culprit!

async def process_data(data: dict):
    # Simulate some async work
    await asyncio.sleep(0.01)

    # Imagine a scenario where we're storing something in a global cache
    # that isn't properly bounded or evicted.
    # This is a simplified example of how a leak might occur.
    if data.get("cache_me"):
        key = data["id"]
        _global_cache[key] = data # Adding to cache without eviction!
        # This will grow unbounded if not managed.

    return {"status": "processed", "id": data["id"]}

async def main_service_loop():
    counter = 0
    last_snapshot_time = time.time()
    while True:
        # Simulate incoming requests
        data = {"id": f"item_{counter}", "value": "some_large_string" * 100}
        if counter % 100 == 0: # Cache every 100th item to simulate a leak source
            data["cache_me"] = True

        await process_data(data)
        counter += 1

        if ENABLE_TRACEMALLOC and (time.time() - last_snapshot_time) > TRACEMALLOC_SNAPSHOT_INTERVAL_SECONDS:
            snapshot = tracemalloc.take_snapshot()
            top_stats = snapshot.statistics('lineno')
            print(f"\n--- tracemalloc Snapshot ({time.time()}) ---")
            for stat in top_stats[:10]:
                print(stat)
            print(f"Global cache size: {len(_global_cache)} items")
            print("------------------------------------------")
            last_snapshot_time = time.time()

        await asyncio.sleep(0.001) # Small delay to prevent busy-waiting

if __name__ == "__main__":
    try:
        asyncio.run(main_service_loop())
    except KeyboardInterrupt:
        print("Service stopped.")
    finally:
        if ENABLE_TRACEMALLOC:
            snapshot = tracemalloc.take_snapshot()
            top_stats = snapshot.statistics('lineno')
            print(f"\n--- Final tracemalloc Snapshot ---")
            for stat in top_stats[:10]:
                print(stat)
            tracemalloc.stop()

When I ran this simplified example with ENABLE_TRACEMALLOC=true python your_service.py, the output from tracemalloc quickly highlighted the problem:


--- tracemalloc Snapshot (1678886400.0) ---
/path/to/your_service.py:31: size=120 KiB, count=100, average=1.2 KiB
/usr/local/lib/python3.9/collections/__init__.py:476: size=80 KiB, count=50, average=1.6 KiB
...
Global cache size: 100 items
------------------------------------------

--- tracemalloc Snapshot (1678886700.0) ---
/path/to/your_service.py:31: size=240 KiB, count=200, average=1.2 KiB
/usr/local/lib/python3.9/collections/__init__.py:476: size=160 KiB, count=100, average=1.6 KiB
...
Global cache size: 200 items
------------------------------------------

The line /path/to/your_service.py:31 (which was _global_cache[key] = data in my actual code) consistently appeared at the top of the tracemalloc report, and its size and count kept increasing with each snapshot. This was the smoking gun. It clearly indicated that objects were being allocated at that line and were not being released. The global cache was indeed growing unbounded.

Beyond tracemalloc, I also experimented with objgraph for visualizing object references, especially useful for circular references. While tracemalloc pointed me to the allocation site, objgraph.show_growth() or objgraph.show_backrefs() could help understand *why* objects weren weren't being garbage collected if the issue wasn't a simple unbounded collection. You can learn more about tracemalloc and its capabilities in the official Python documentation.

The Root Cause: An Unbounded Cache

The leak, as identified by tracemalloc, was indeed an unbounded cache. In our real service, we had a dictionary, similar to _global_cache in the example, intended to store intermediate results for a short period to prevent redundant LLM calls for identical prompts. The problem was a crucial oversight: there was no eviction policy. The cache grew indefinitely as new data came in, holding onto large string objects and other data structures, never releasing them. Over time, this accumulated memory until Cloud Run's memory limit was breached.

This type of leak is common in concurrent applications, especially when dealing with I/O-bound tasks where temporary data might be held for longer than anticipated. Another common source of leaks in asyncio services, which I've debugged in the past, involves unmanaged coroutine tasks. If tasks are created but never awaited or properly cancelled, they can hold references to objects, preventing their garbage collection. While not the primary cause of *this* particular memory leak, it's a related issue I always keep an eye on.

Other Common Python Memory Leak Patterns I Considered:

Circular References: Although Python's garbage collector handles most circular references, complex or custom objects might sometimes pose issues, especially if they involve C extensions. objgraph is excellent for visualizing these.
Unclosed Resources: File handles, database connections, or network sockets that are opened but never properly closed can sometimes hold onto memory or other system resources. Using async with statements for context managers helps mitigate this.
Large Global Objects: Accidentally assigning large data structures to global variables or module-level variables without careful management can lead to permanent memory consumption.
C Extensions: If your Python service uses C extensions, memory allocated by these extensions might not be properly released back to the Python interpreter, leading to leaks that are harder to track with Python-native tools.

The Fix: Implementing a Bounded Cache

Once the unbounded cache was identified, the solution was straightforward: implement an eviction policy. For our use case, a simple Least Recently Used (LRU) cache was perfect. Python's functools.lru_cache decorator is a fantastic, battle-tested solution for memoization, but it's designed for functions, not for a globally managed dictionary. For our global cache, I opted for a collections.deque-based approach or, even better, a dedicated LRU cache library like cachetools.

Here's how I refactored the problematic cache using cachetools.LRUCache:


import asyncio
import os
import time
from cachetools import LRUCache
import sys

# --- Configuration for tracemalloc (optional, for debugging) ---
ENABLE_TRACEMALLOC = os.getenv("ENABLE_TRACEMALLOC", "false").lower() == "true"
TRACEMALLOC_SNAPSHOT_INTERVAL_SECONDS = 300

if ENABLE_TRACEMALLOC:
    import tracemalloc
    tracemalloc.start()
    print("tracemalloc started.")

# --- Our (simplified) asyncio application with a bounded cache ---
# Max cache size set to 1000 items, max_size=sys.getsizeof to limit by memory
# For simplicity, let's use item count for now, but memory-based sizing is better for large objects.
_bounded_cache = LRUCache(maxsize=1000) # The fixed cache!

async def process_data_with_cache(data: dict):
    await asyncio.sleep(0.01) # Simulate async work

    key = data["id"]
    if key in _bounded_cache:
        print(f"Cache hit for {key}")
        return _bounded_cache[key] # Accessing updates LRU

    # Simulate expensive operation (e.g., LLM call)
    result = {"status": "processed_new", "id": data["id"], "content": data["value"]}
    _bounded_cache[key] = result # Add to cache, automatically evicts if maxsize reached
    print(f"Cache miss for {key}, added to cache. Current cache size: {len(_bounded_cache)}")
    return result

async def main_service_loop_fixed():
    counter = 0
    last_snapshot_time = time.time()
    while True:
        data = {"id": f"item_{counter % 1500}", # Simulate some repeated keys
                "value": "some_large_string_for_processing" * 100}

        await process_data_with_cache(data)
        counter += 1

        if ENABLE_TRACEMALLOC and (time.time() - last_snapshot_time) > TRACEMALLOC_SNAPSHOT_INTERVAL_SECONDS:
            snapshot = tracemalloc.take_snapshot()
            top_stats = snapshot.statistics('lineno')
            print(f"\n--- tracemalloc Snapshot ({time.time()}) ---")
            for stat in top_stats[:10]:
                print(stat)
            print(f"Bounded cache size: {len(_bounded_cache)} items")
            print("------------------------------------------")
            last_snapshot_time = time.time()

        await asyncio.sleep(0.001)

if __name__ == "__main__":
    try:
        asyncio.run(main_service_loop_fixed())
    except KeyboardInterrupt:
        print("Service stopped.")
    finally:
        if ENABLE_TRACEMALLOC:
            snapshot = tracemalloc.take_snapshot()
            top_stats = snapshot.statistics('lineno')
            print(f"\n--- Final tracemalloc Snapshot ---")
            for stat in top_stats[:10]:
                print(stat)
            tracemalloc.stop()

After deploying this fix, the results were immediate and satisfying. The memory usage graph in Cloud Monitoring flatlined, settling into a stable pattern well within our 1GB limit. The OOMKills ceased entirely, and the service's overall stability and latency improved. This also had a positive impact on our Cloud Run costs, as fewer instances were being restarted and memory usage was optimized.

It's worth noting that when dealing with large objects in a cache, simply limiting by item count might not be enough. If individual items are very large, you might still hit memory limits with a small number of items. In such cases, cachetools allows you to specify maxsize based on the sum of the sizes of stored items (e.g., using sys.getsizeof or a custom size function), which provides a more robust memory-aware cache.

What I Learned / The Challenge

This debugging adventure reinforced several critical lessons for me:

Proactive Monitoring is Key: While alerts helped identify the problem, a consistent review of resource metrics could have caught the slow memory creep much earlier, before it escalated to OOMKills.
Don't Trust Implicit Garbage Collection: Relying solely on Python's garbage collector without understanding object lifecycles in complex, concurrent applications is a recipe for disaster. Explicitly managing resources and understanding where references are held is paramount.
Profiling is Essential, Even in Production: While tracemalloc has a performance overhead, having a mechanism to enable it (or other profilers like memory_profiler) in a controlled manner in non-production environments is invaluable for debugging elusive issues.
Cloud Run's Ephemeral Nature: Cloud Run instances are restarted frequently for various reasons. While this provides resilience, it can mask slow memory leaks if the leak's accumulation period is longer than the instance's typical lifespan. When OOMKills start occurring, it's often because the leak has become severe enough to exhaust memory faster than instances are recycled.

The biggest challenge was the "slow creep" nature of the leak. It wasn't a sudden, easily reproducible crash, but a gradual degradation that made pinpointing the exact moment of failure difficult without proper historical metrics and profiling tools. It also highlighted the importance of understanding the memory implications of common Python data structures, especially when used in long-running services.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

Debugging and Resolving Python Asyncio Memory Leaks on Cloud Run

Debugging and Resolving Python Asyncio Memory Leaks on Cloud Run

The Slow Creep: Identifying the Leak

Diving Deep with Python's Memory Profilers

The Root Cause: An Unbounded Cache

Other Common Python Memory Leak Patterns I Considered:

The Fix: Implementing a Bounded Cache

What I Learned / The Challenge

Related Reading

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs