Fixing Intermittent Python Cloud Run Connection Resets

Fixing Intermittent Python Cloud Run Connection Resets

Intermittent Python Cloud Run connection resets are typically caused by a mismatch between the application's keep-alive settings and the Google Front End (GFE) idle timeouts. To resolve this, developers should set the httpx keepalive_expiry to 10-20 seconds and implement retry logic for RemoteProtocolErrors. This ensures stale connections are discarded before they cause 502 Bad Gateway errors.

I woke up at 2:14 AM last Tuesday to a PagerDuty alert that I’ve grown to loathe: a spike in 502 Bad Gateway errors on our primary AI orchestration service. For context, this service is a FastAPI application running on Google Cloud Run, responsible for chaining multiple calls to the Gemini API and our internal vector databases. It’s the backbone of the system I described in my previous post on Building a Scalable Event-Driven AI Automation System with Python. When it fails, our entire automated pipeline grinds to a halt.

The dashboard showed a 5.2% error rate. It wasn’t a total blackout, which in many ways is worse. Total blackouts are usually obvious—a bad config, a crashed container, or a DNS failure. Intermittent failures, however, are ghosts. They suggest a race condition, a resource leak, or a subtle mismatch between the application code and the cloud infrastructure it inhabits. I spent the next 14 hours digging through traces, packet captures, and library source code to find the culprit. It wasn’t a single bug, but a combination of how Python’s httpx library handles connection pooling and how Google Cloud’s infrastructure manages idle TCP connections.

Why Does Cloud Run Return Peer Closed Connection Errors?

Intermittent 502 errors often stem from the Google Front End (GFE) closing idle TCP connections before the application realizes they are dead. The logs were frustratingly vague. Most of the failed requests were throwing httpx.RemoteProtocolError: peer closed connection without sending complete message or httpcore.RemoteDisconnected. On the surface, this looks like the server (in this case, the Gemini API or our internal Go-based microservices) is just hanging up on us. My first instinct was to blame the upstream services. I checked the Gemini API status page—all green. I checked our Go service metrics—zero 5xx errors on their end. The Go services didn't even see the requests reaching them.

This narrowed the problem down to the egress path from my Python application. I realized that the "peer" closing the connection wasn't necessarily the final destination; it could be any intermediary, including the Google Front End (GFE) or the Cloud Run infrastructure itself. I noticed that these errors happened almost exclusively on requests that had been idle for more than a few seconds. If the service was under heavy, constant load, the error rate dropped. If the traffic was bursty, the error rate spiked.

How to Reproduce Network Failures in Containerized Environments

I couldn't reproduce this locally because local environments lack the specific networking constraints of a containerized environment in GCP. To get closer to the truth, I deployed a "debug" version of the container with enhanced logging for httpcore and httpx. I needed to see the lifecycle of every TCP connection. I used the following configuration to force more verbosity in my FastAPI logs:

import logging

# Increase verbosity for the underlying network libraries
logging.getLogger("httpx").setLevel(logging.DEBUG)
logging.getLogger("httpcore").setLevel(logging.DEBUG)

# Standard FastAPI setup
from fastapi import FastAPI
app = FastAPI()

With this enabled, I saw a pattern. The application was trying to reuse a connection from its internal pool that had already been silently closed by the GCP load balancer. In Cloud Run, idle connections are eventually reaped. If your application thinks a connection is still "warm" and tries to send a request over it, but the infrastructure has already dropped that entry from its state table, you get a ConnectionResetError.

How Connection Pooling Mismatches Cause Intermittent 502 Errors

A mismatch occurs when the Python HTTP client's keep-alive duration exceeds the infrastructure's idle timeout threshold. Most modern Python HTTP clients, like requests or httpx, use connection pooling by default. This is generally a good thing; it saves the overhead of the TCP handshake and TLS negotiation for every request. However, httpx.AsyncClient (which I use for its superior performance in async environments) has a default keepalive_expiry of 5 seconds. Meanwhile, the infrastructure intermediaries in GCP often have their own idle timeouts, which can vary wildly depending on the specific path (Cloud NAT, VPC Egress, etc.).

I had recently optimized our memory usage, as I detailed in Debugging Python Memory Leaks in Containerized FastAPI Apps, but in doing so, I had inadvertently reduced the number of active workers. This meant each worker was handling more idle time between bursts of AI agent activity, making the connection pool "stale" more often. To resolve Python Cloud Run connection resets, you must align library settings with GCP timeouts.

Why Increasing HTTP Timeouts Often Worsens Connection Resets

Increasing the keep-alive limits without addressing the underlying expiration mismatch keeps "dead" connections in the pool for longer. My first attempt at a fix was to increase the timeouts. I thought if I gave the connections more time, the errors would vanish. I changed my client initialization to this:

import httpx

# My naive first attempt at a fix
client = httpx.AsyncClient(
    timeout=httpx.Timeout(30.0, connect=10.0),
    limits=httpx.Limits(max_keepalive_connections=20, max_connections=100)
)

The result? The error rate actually increased. The application was confidently grabbing a connection that had been dead for 20 seconds, trying to write to it, and failing immediately. I was fighting the infrastructure instead of working with it.

How to Implement Aggressive Keep-Alive and Retries in FastAPI

The most effective way to eliminate connection resets is to reduce the keep-alive expiry to under 20 seconds and wrap requests in a retry decorator. The real solution required a two-pronged approach: making the client more aware of connection health and implementing a robust retry strategy that understands the difference between a "failed request" and a "failed connection."

Optimizing httpx AsyncClient for Google Cloud Infrastructure

I had to lower the keepalive_expiry to be significantly shorter than the infrastructure's idle timeout. In GCP, 30 seconds is a common threshold for various load balancers, but I found that setting my client to 20 seconds or even 10 seconds virtually eliminated the "stale connection" problem. I also switched to using a single global client instance to properly manage the pool across the FastAPI app lifecycle.

import httpx
from contextlib import asynccontextmanager

# Optimized client configuration
HTTP_CLIENT_CONFIG = {
    "timeout": httpx.Timeout(15.0, connect=5.0),
    "limits": httpx.Limits(
        max_keepalive_connections=10, 
        max_connections=50,
        keepalive_expiry=10.0 # Shorter expiry to prevent stale connections
    ),
}

class GlobalClient:
    client: httpx.AsyncClient = None

    @classmethod
    async def get_client(cls):
        if cls.client is None:
            cls.client = httpx.AsyncClient(**HTTP_CLIENT_CONFIG)
        return cls.client

    @classmethod
    async def close_client(cls):
        if cls.client:
            await cls.client.aclose()
            cls.client = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: Initialize the client
    await GlobalClient.get_client()
    yield
    # Shutdown: Clean up
    await GlobalClient.close_client()

app = FastAPI(lifespan=lifespan)

Using Tenacity for Intelligent Network-Level Retries

Even with perfect pool management, network blips happen, and you must design for failure in cloud-native environments. I implemented a retry decorator using the tenacity library, specifically targeting the RemoteProtocolError. This ensures that if we do hit a dead connection, we transparently retry the request on a fresh connection.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import httpx

# Define which exceptions are "safe" to retry
# We only retry on network-level failures, not 4xx or 5xx responses
retry_on_network_error = retry(
    retry=retry_if_exception_type((httpx.ConnectError, httpx.RemoteProtocolError, httpx.ReadTimeout)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    reraise=True
)

@retry_on_network_error
async def call_external_api(url: str, data: dict):
    client = await GlobalClient.get_client()
    response = await client.post(url, json=data)
    response.raise_for_status()
    return response.json()

This combination was the magic bullet. The keepalive_expiry=10.0 ensured that 99% of the connections in the pool were actually alive, and the tenacity retry logic caught the remaining 1% of edge cases where a connection died exactly as it was being pulled from the pool.

How to Prevent Cloud NAT Port Exhaustion in Cloud Run

Cloud NAT port exhaustion occurs when a service exceeds its allocated source ports, leading to connection timeouts during traffic spikes. While the Python code changes solved the majority of the issues, I noticed a secondary problem during my investigation: occasional ConnectTimeout errors during peak traffic. This wasn't a connection reset; it was an inability to establish a connection at all. Debugging Python Cloud Run connection resets requires looking at both the application and the NAT gateway.

When a Cloud Run service communicates with the outside world (like the Gemini API), it often goes through a Cloud NAT gateway if you have VPC egress configured. Each NAT gateway has a limited number of ports available per VM instance. If your application opens too many concurrent connections, you hit "Port Exhaustion." This is well-documented in the Google Cloud NAT troubleshooting guide, but it’s easy to overlook until it bites you.

I had to adjust the "Minimum ports per VM instance" in our Cloud NAT settings. By default, it was set to 64, which is far too low for a high-concurrency Python AI agent that might be making 20-30 parallel API calls per request. I bumped this to 512, which provided enough headroom for our scaling needs without wasting IP addresses.

Performance Results: Comparing Metrics Before and After Optimization

Optimizing connection pools reduced the error rate from 5.2% to 0.01% while improving P99 latency by 25%. After deploying these changes, I monitored the service for 48 hours. The results were stark:

Metric Before Optimization After Optimization
Error Rate (502/Connection Reset) 5.2% 0.01%
P99 Latency 2.4s 1.8s
Max Concurrent Connections 142 48
CPU Utilization (Avg) 22% 18%

The latency improvement was an unexpected bonus. By properly managing the connection pool, we were spending less time in the "Connect" phase of the HTTP lifecycle. The CPU utilization drop was likely due to the reduction in exception handling and retry logic overhead that was previously clogging the event loop.

The Final Verdict: Key Takeaways for Cloud-Native Python

  • Default library settings are often wrong for Cloud Run: Libraries like httpx are designed with general-purpose defaults. In highly managed environments like Cloud Run, you must tune your keep-alive and pool settings to match the infrastructure's aggressive idle timeouts.
  • The "5-second rule": For GCP services, I now start with a 10-20 second keepalive_expiry. Anything higher is a gamble.
  • Distinguish between Request and Connection errors: Your retry logic should be surgical. Retrying a 400 Bad Request is useless, but retrying a RemoteProtocolError is essential for resilience.
  • Observability is everything: Without debug-level logs on httpcore, I would still be guessing about which side was closing the connection. If you can't see the TCP handshake, you're flying blind.
  • Check your NAT: If you use VPC egress, monitor your NAT port usage. Port exhaustion looks exactly like a slow API, but it's actually a networking bottleneck.

Correcting Python Cloud Run connection resets is a critical step for any developer building high-availability AI services on Google Cloud.

Related Reading

Solving this felt like a rite of passage. In the world of cloud-native Python, you aren't just writing code; you're managing a complex dance between the language runtime, the asynchronous event loop, and the opaque networking layers of your cloud provider. My next challenge is looking into how the new Gemini 1.5 Flash models handle long-lived streaming connections, as I suspect the keep-alive issues will be even more pronounced there. I'll be tracking the socket stability over the coming weeks to see if a similar tuning is required for WebSockets.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI