Debugging Python Memory Leaks in Containerized FastAPI Apps

Debugging Python Memory Leaks in Containerized FastAPI Apps

Python memory leaks in FastAPI applications are frequently caused by circular references in complex state objects and the glibc allocator's failure to release memory to the OS. These issues can be resolved by implementing weakref for back-references and using malloc_trim to force the system to reclaim unused memory from the heap.

Last Tuesday, I woke up to a string of PagerDuty alerts that every developer dreads: Cloud Run Instance Termination: Memory Limit Exceeded. My FastAPI service, which handles the orchestration for my AI agent workflows, was hitting its 4GB RAM ceiling and restarting every twenty minutes. In the world of serverless, this isn't just a stability issue; it’s a direct hit to the wallet. Because I had "min-instances" set to five to keep latency low, the constant cycling was triggering cold starts and driving my GCP bill up by roughly $45 a day.

The service in question is part of a system I’ve been documenting lately, specifically the logic behind building an interruptible AI workflow engine with Python generators. While that architecture solved my state management problems, it clearly introduced a silent killer in the form of a memory leak. I had assumed Python’s garbage collector would handle the cleanup of my generator states, but I was wrong. The Resident Set Size (RSS) of my containers was climbing linearly—about 15MB per request—and never returning to the baseline.

In this post, I’m going to walk through the exact process I used to isolate the leak, the tools that actually worked in a containerized environment, and the specific code changes that dropped my steady-state memory usage from 3.8GB back down to a crisp 240MB.

Why Standard Cloud Monitoring Fails to Detect Python Memory Leaks

Standard cloud metrics often show memory exhaustion but fail to identify the specific Python objects or C-extensions responsible for the leak. My first instinct was to look at the Cloud Monitoring dashboard. It showed the classic "sawtooth" pattern: memory goes up, hits the limit, the container crashes, and it starts over. However, GCP’s metrics don't tell you *what* is holding onto the memory. Is it the Python heap? Is it a C-extension? Is it the buffer cache?

I initially suspected my vector database client, thinking I might be over-allocating buffers during similarity searches. I had recently been optimizing vector database costs for production RAG, and I thought I might have broken something in the connection pooling logic. But after disabling the vector DB calls and mocking the responses, the leak persisted. This was a pure Python problem.

The difficulty with debugging memory in Python is that sys.getsizeof() is almost useless for complex objects. It only reports the size of the object itself, not the objects it references. In a containerized environment, you also have the overhead of the OS and the web server (Gunicorn/Uvicorn) workers. I needed to see inside the heap while the service was under load.

How to Reproduce Python Memory Leaks in a Local Docker Environment

Reproducing a memory leak locally requires a containerized environment that mirrors production limits and a load-testing script to simulate high traffic. You can't debug what you can't measure. I created a docker-compose.yml that mirrored my production Cloud Run environment, limiting the memory to 1GB to speed up the failure. I then wrote a simple locust script to hammer the /process endpoint.


# Simple reproduction script snippet
import requests
import time

def simulate_load():
    for i in range(100):
        resp = requests.post("http://localhost:8080/v1/agent/run", json={"task": "debug leak"})
        print(f"Request {i}: {resp.status_code}")
        time.sleep(0.1)

Within 50 requests, I saw the Docker stats showing memory usage climbing from 150MiB to 600MiB. It was reproducible. Now I needed to instrument the code.

Using Tracemalloc to Identify High-Allocation Code Paths

The tracemalloc module provides a built-in way to take snapshots of memory allocations and pinpoint the exact line of code causing growth. Python’s built-in tracemalloc module is the gold standard for this. It allows you to take snapshots of the memory allocations and compare them. I added a temporary debugging endpoint to my FastAPI app that would return the top 10 lines of code responsible for the most memory growth.


import tracemalloc
from fastapi import APIRouter

router = APIRouter()
tracemalloc.start()

@router.get("/debug/memory")
async def get_memory_stats():
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    
    result = []
    for stat in top_stats[:10]:
        result.append(str(stat))
    return {"top_allocations": result}

After running 20 requests, I hit that endpoint. The results were illuminating:


{
  "top_allocations": [
    "/app/internal/agents/executor.py:142: size=42.5 MiB, count=1240, average=35 KiB",
    "/usr/local/lib/python3.11/site-packages/pydantic/main.py:341: size=12.2 MiB, count=8500, average=1.4 KiB",
    ...
  ]
}

Line 142 in executor.py was the culprit. It was where I was instantiating my AgentState object within a generator loop. But why wasn't it being cleared? Python uses reference counting, and when a variable goes out of scope, it should be reaped.

How Circular References and AI State Objects Cause Memory Bloat

Circular references between parent and child objects prevent Python’s reference counter from reaching zero, delaying or preventing garbage collection. In my AI workflow engine, I have an AgentExecutor class that manages a list of Step objects. Each Step object, for convenience, had a back-reference to its parent Executor.


class AgentExecutor:
    def __init__(self):
        self.steps = []
        self.context = {}

    def add_step(self, step_func):
        step = Step(func=step_func, parent=self) # The Leak
        self.steps.append(step)

class Step:
    def __init__(self, func, parent):
        self.func = func
        self.parent = parent # Circular Reference

In standard Python, circular references are handled by the Garbage Collector (GC). However, I was using __del__ methods in some of my classes to log when an agent finished its task. I learned the hard way that in older versions of Python (pre-3.4), objects with __del__ methods involved in a circular reference were marked as "uncollectable." While modern Python (3.4+) can handle this via PEP 442, there are still edge cases where the GC takes a long time to trigger, or doesn't trigger at all if the allocation pressure isn't high enough in the specific generation.

But the real kicker was how I was handling the Gemini API responses. I was storing the raw response objects inside the AgentState. These objects contain complex C-extensions for protobuf handling. When combined with the circular reference in my executor, the GC was struggling to traverse the graph efficiently.

Visualizing Object Reference Graphs with Objgraph to Spot Leaks

Objgraph allows developers to generate visual maps of object relationships, making it easy to spot uncollected reference cycles in the heap. To confirm the circular reference, I used objgraph. I added a call to generate a graph of the objects that were leaking.


import objgraph

# Inside the debug endpoint
objgraph.show_most_common_types(limit=10)
# To see why AgentExecutor isn't being collected:
roots = objgraph.get_leaking_objects()
if roots:
    objgraph.show_backrefs(roots[0], max_depth=5, filename='chain.png')

The generated graph showed a massive web of AgentExecutor -> Step -> AgentExecutor. Even after the request finished and the local variable in the FastAPI route was gone, the objects stayed in memory because their reference count never hit zero, and the generational GC hadn't kicked in yet.

How to Fix Python Memory Leaks Using Weakref and Malloc_trim

Replacing hard references with weakref allows objects to be collected while still being accessible, and malloc_trim forces the OS to reclaim unused memory. The fix was two-fold. First, I replaced the hard back-references with weakref. A weak reference allows you to access the parent object without increasing its reference count.


import weakref

class Step:
    def __init__(self, func, parent):
        self.func = func
        self._parent_ref = weakref.ref(parent)

    @property
    def parent(self):
        return self._parent_ref()

Second, I realized that Python’s garbage collector doesn't always play well with container memory limits. The GC triggers based on the number of allocations vs. deallocations, but it doesn't know that the Docker container is about to hit a hard 1GB limit. It might think it still has plenty of "system" memory when the cgroup is actually about to kill it.

I implemented a middleware to manually trigger a collection and, crucially, call malloc_trim. This is a trick I learned from a previous project where I used Go for an LLM API proxy. In that case, Go's GC was much more predictable, which led me to investigate why Python was holding onto memory even after gc.collect().

In Linux environments (like Cloud Run), Python’s memory allocator (pymalloc) might not return memory to the OS immediately. Calling ctypes.CDLL('libc.so.6').malloc_trim(0) forces the glibc allocator to release free memory back to the system.


import gc
import ctypes
from fastapi import Request

@app.middleware("http")
async def gc_middleware(request: Request, call_next):
    response = await call_next(request)
    
    # Only run this periodically or based on a threshold to avoid CPU overhead
    # For debugging, we do it every 50 requests
    if request.state.request_count % 50 == 0:
        gc.collect()
        try:
            libc = ctypes.CDLL("libc.so.6")
            libc.malloc_trim(0)
        except Exception:
            pass
            
    return response

Measuring the Impact of Memory Optimization on Cloud Run Performance

Implementing these memory management strategies reduced container RAM usage by over 90% and eliminated OOM (Out of Memory) restarts. After deploying these changes, I ran my Locust load test again. The results were night and day.

Metric Before Fix After Fix
Baseline Memory (RSS) 180 MB 165 MB
Memory after 100 requests 840 MB 210 MB
Memory after 500 requests 3.2 GB (OOM Risk) 235 MB
Avg Request Latency 450ms 420ms (Less GC thrashing)

The memory usage still grows slightly (which is normal for the Python interpreter's internal caching), but it eventually plateaus. The "sawtooth" pattern was gone, replaced by a stable, flat line. My Cloud Run instances stopped restarting, and my daily cost dropped from $65 to around $20.

Best Practices for Managing Python Memory Leaks in Production

Managing memory in serverless environments requires a proactive approach involving instrumentation, weak references, and manual allocator tuning. Here are the key takeaways from this investigation:

  • Containers are not VMs: Python's GC doesn't respect cgroup limits naturally. You have to be more aggressive with memory management in high-throughput containerized apps.
  • Avoid Circular References: Even if you think the GC will handle them, they delay object destruction and increase the complexity of the GC's job. Use weakref for back-references.
  • Instrument Early: Adding a /debug/memory endpoint (secured behind an API key, obviously) saved me hours of guessing. tracemalloc is built-in and powerful.
  • malloc_trim is magic: If your RSS is high but your Python heap looks small, it’s likely the glibc allocator holding onto memory. malloc_trim is the "fix" for this on Linux.
  • Pydantic overhead: Large Pydantic models (especially v1) can be memory-intensive when instantiated thousands of times. Moving to Pydantic v2 helped reduce the baseline allocation per request.

Related Reading

Fixing this leak reminded me that as we move toward more complex AI agent architectures, the "plumbing" of our applications becomes even more critical. We’re passing around huge context windows and complex state objects, and Python’s ease of use can sometimes mask significant architectural flaws. My next step is to automate this memory tracking by pushing tracemalloc snapshots to Cloud Logging whenever memory usage exceeds 70% of the container limit, so I can catch these leaks in staging before they hit my credit card.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI