Technical Post-Mortem: Fixing a Cascading AI Pipeline Failure

A technical post-mortem is a structured process used to identify the root cause of a system failure and implement preventative measures to ensure it does not recur. This specific framework focuses on establishing a high-resolution timeline, performing a "Five Whys" analysis, and deploying architectural safeguards like circuit breakers to protect AI-powered applications.

At 02:14 AM last Tuesday, my phone vibrated off the nightstand. It wasn’t a wrong number or a telemarketer; it was PagerDuty informing me that my production API’s error rate had spiked from 0.01% to 84% in less than three minutes. By 02:30 AM, our Cloud Run instances were hitting 100% CPU utilization and then death-spiraling into Out-of-Memory (OOM) kills. By 04:00 AM, I had stabilized the system, but we had lost roughly $450 in wasted compute and burned through a significant portion of our Gemini API quota for the day.

The immediate fix was a "restart and pray" combined with a temporary rate limit increase, but as any senior engineer knows, the "fix" isn't done until the technical post-mortem is complete. I’ve seen too many teams treat post-mortems as a bureaucratic box-ticking exercise. In my experience, a poorly executed post-mortem is a guarantee that you will be woken up by the exact same bug three months from now. I’ve spent years refining a process that focuses on technical root causes rather than finger-pointing, and I want to walk you through exactly how I handled this specific AI pipeline failure to ensure it never happens again.

This incident was particularly nasty because it involved a cascading failure between a Python FastAPI backend, a Redis Streams-based task queue, and the Gemini Pro 1.5 model. If you are building AI-powered automation, you are likely sitting on a similar powder keg of latency-induced failures. Here is how I dissected the disaster and the framework I use to turn production failures into architectural hardening.

Why the AI Pipeline Failure Caused a Cascading System Crash

Cascading failures in AI pipelines often stem from unhandled external API latency that exhausts local system resources before timeouts can trigger. Before we look at the technical post-mortem document itself, we need to understand the technical failure. Our system uses a Python workflow engine to process long-running AI agent tasks. We recently moved to a more robust architecture, which I detailed in my previous post on building a resilient Python workflow engine with Redis Streams. However, even the best architecture can crumble if you don't account for external API degradation.

The failure sequence was a classic "slow death" scenario:

  1. The Gemini API started experiencing elevated p99 latency in the us-central1 region, jumping from 4 seconds to 45 seconds per request.
  2. Our Python worker threads, which were using a standard synchronous blocking call to the Gemini API, became tied up waiting for responses.
  3. Because the workers were blocked, the Redis Stream started backing up.
  4. As the backlog grew, our Cloud Run autoscaler saw the CPU usage spike (due to the overhead of managing thousands of blocked threads) and spun up more instances.
  5. The new instances immediately pulled the oldest, most "poisonous" tasks from the stream, hit the same 45-second timeouts, and eventually crashed due to memory exhaustion from the accumulated state of thousands of pending requests.

I realized that while I had implemented basic error handling—as I discussed in Gemini error handling and idempotency—I hadn't accounted for a "brownout" where the API doesn't fail, but simply becomes impossibly slow.

How to Establish a Timeline for a Technical Post-Mortem

A high-resolution timeline built from Cloud Logging and Monitoring data is the foundation of any effective technical post-mortem. The first thing I do in any technical post-mortem is establish a sequence of events that maps system metrics to specific timestamps. I don't rely on memory; I pull logs from Cloud Logging and metrics from Cloud Monitoring. For this incident, the timeline revealed a critical 12-minute gap between the first latency spike and our first automated alert.

The "Five Whys" Analysis

I use the "Five Whys" method to move past the surface-level symptom. For this specific incident, it looked like this:

  • Why did the service crash? Because the Cloud Run instances hit OOM limits.
  • Why did they hit OOM limits? Because the number of concurrent Python threads exceeded the memory capacity of the container.
  • Why were there so many threads? Because the Gemini API calls were taking 45+ seconds, causing a massive queue of "in-flight" requests.
  • Why didn't the system stop sending requests? Because we lacked a circuit breaker to trip when external latency exceeded a specific threshold.
  • Why was there no circuit breaker? Because we assumed the 30-second httpx timeout was sufficient to protect the system. (It wasn't).

Implementing Circuit Breakers to Prevent AI Pipeline Failures

Protecting system stability requires moving beyond simple timeouts to active concurrency management and circuit breaker patterns. In the technical post-mortem doc, I include the "Bad Code" vs. "The Fix." This is vital for the rest of the team to learn. In this case, the culprit was how we were handling the httpx client within our FastAPI workers. We were using a global timeout, but we weren't monitoring the concurrency of those timed-out calls.

The Original Flawed Implementation

import httpx
import asyncio

async def process_ai_task(payload):
    # This looks okay, but it's a trap.
    # A 30s timeout is too long when 1000s of tasks are arriving.
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(GEMINI_ENDPOINT, json=payload)
        return response.json()

The problem here is that if Gemini slows down to 29 seconds, every single worker stays alive for 29 seconds. If you are receiving 100 tasks per second, you suddenly have 2,900 active connections and threads. This is what killed our memory.

The Hardened Implementation

The fix involved implementing a semaphore to limit local concurrency and a circuit breaker pattern. I also reduced the timeout significantly. If the AI can't respond in 10 seconds, I'd rather fail fast and let the Redis Stream retry logic handle it later, or move the task to a Dead Letter Queue (DLQ).

from aiocache import cached
from pybreaker import CircuitBreaker
import asyncio

# Define a circuit breaker: fail after 5 consecutive failures or timeouts
breaker = CircuitBreaker(fail_max=5, reset_timeout=60)

# Limit total concurrent AI calls per instance to 50
sem = asyncio.Semaphore(50)

@breaker
async def hardened_ai_call(payload):
    async with sem:
        async with httpx.AsyncClient(timeout=10.0) as client:
            try:
                response = await client.post(GEMINI_ENDPOINT, json=payload)
                response.raise_for_status()
                return response.json()
            except httpx.TimeoutException:
                # Log the specific timeout for monitoring
                print("Gemini Timeout - Circuit Breaker incrementing")
                raise

By adding the Semaphore, I ensured that no matter how slow the external API got, my Cloud Run instance would never attempt to hold more than 50 requests in memory simultaneously. This prevents the OOM death spiral. According to the Google SRE Handbook, this type of "load shedding" is critical for maintaining system availability during partial failures.

What Action Items Should a Technical Post-Mortem Include?

Every technical post-mortem must result in concrete, categorized tasks for immediate mitigation, engineering debt, and improved observability. A post-mortem without action items is just a diary entry. I categorize my action items into three buckets: Immediate Mitigation, Engineering Debt, and Monitoring/Alerting.

1. Immediate Mitigation (Done)

  • Implemented a 10s hard timeout on all Gemini API calls.
  • Added a concurrency semaphore to the worker loop.
  • Flushed the "poison pill" tasks from the Redis Stream that were causing immediate crashes.

2. Engineering Debt (Next Sprint)

  • Implement a proper Dead Letter Queue (DLQ) for Redis Streams. Currently, we just retry until the stream's max-age is reached. We need to move failing tasks to a separate "investigation" stream after 3 failed attempts.
  • Refactor the worker to use a singleton httpx.AsyncClient to reuse connection pools, as creating a new client per request was adding 200ms of overhead.

3. Monitoring and Alerting (High Priority)

  • Create a Cloud Monitoring dashboard for "AI Latency p99". We were only monitoring 500 errors, but we missed the latency "brownout."
  • Set up an alert for "Redis Stream Consumer Lag." If the lag exceeds 500 messages, I want a warning before the autoscaler starts panicking.

A Standardized Technical Post-Mortem Template for Engineering Teams

Using a standardized markdown template ensures that all critical failure data is captured consistently across different incidents and teams. I keep a markdown template in our internal Wiki. If you're a solo dev or a lead engineer, I highly recommend standardizing this. Here is the structure I follow:

  • Title: [YYYY-MM-DD] [Service Name] [Brief Description]
  • Status: Draft / Reviewed / Completed
  • Authors: (Who was on the call?)
  • Summary: 2-3 sentences on what happened and the impact.
  • Impact: (e.g., "84% error rate for 90 minutes, 1,200 users affected, $450 compute overage").
  • Root Cause: The "Five Whys" output.
  • Trigger: What was the specific event that started the failure?
  • Resolution: How was it fixed?
  • Action Items: A table with [Task], [Owner], and [Due Date].

Key Lessons Learned from Solving AI Infrastructure Outages

Managing AI infrastructure requires a shift in focus from handling explicit errors to managing silent latency brownouts that can destabilize the entire stack. Conducting this technical post-mortem taught me a few things that I had grown complacent about. Even as a senior engineer, it's easy to forget that "resilience" isn't a feature you build once; it's a continuous process of discovery.

1. Latency is often more dangerous than errors

A 500 error is clean. Your code knows what to do with it. A 45-second response time from a 10-second timeout is a zombie request. It eats resources, holds locks, and clogs queues. Always design your systems to "fail fast." I’d much rather return an error to a user in 2 seconds than keep them waiting for 30 seconds only to crash the server anyway.

2. Autoscaling is a double-edged sword

Cloud Run is fantastic, but its default behavior is to scale up to meet demand. If that "demand" is actually a bottleneck in a downstream service, Cloud Run will happily spin up 100 instances, each one failing and costing you money. I’ve since adjusted our scaling parameters. I detailed some of these cost-saving measures in my post on reducing Cloud Run costs by 40%, but the most important lesson is to cap your max instances during the experimental phase of any AI agent rollout.

3. Observability must include external dependencies

We had great metrics for our FastAPI code and our Redis instance. We had almost zero metrics for the Gemini API's internal performance from *our* perspective. We were relying on the status page of the provider, which—surprise—didn't show any issues during our outage. You must instrument your external API calls with OpenTelemetry or custom Prometheus metrics.

Related Reading

Moving forward, I’m integrating automated "Chaos Engineering" into our staging environment. Next week, I plan to write a script that artificially injects 60-second latencies into our Gemini API mock to see if our new circuit breakers and semaphores actually hold up. It’s better to break things on purpose on a Wednesday afternoon than to have them break on their own at 2 AM on a Tuesday. I'll keep documenting these failures here—because if we aren't honest about our outages, we aren't really learning anything.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI