Cloud Run Debugging: Fixing Intermittent 504 Timeouts

Cloud Run Debugging: Fixing Intermittent 504 Timeouts

To fix intermittent Cloud Run 504 timeouts, you must enable "CPU is always allocated" for background tasks and strictly limit database connection pools per instance. Implementing application-level timeouts that are shorter than the platform limit ensures observability and prevents silent request failures during scaling events.

Last Tuesday at 3:14 AM, my PagerDuty went off. Again. For the third time in a week, my primary AI agent orchestration service—a Go-based backend running on Google Cloud Run—was throwing intermittent 504 Gateway Timeouts and 503 Service Unavailable errors. The frustrating part? It wasn't a total collapse. My monitoring dashboards showed a jagged "sawtooth" pattern: 99.9% availability for hours, followed by a sudden spike to 15% error rates that lasted exactly four minutes before vanishing.

When you're building AI-powered systems, these "ghost" outages are a nightmare. I initially blamed the Gemini API, assuming I was hitting rate limits or experiencing upstream latency. But the logs told a different story. My Gemini latency was stable, yet the Cloud Run instances were terminating connections before the response could be sent back to the client. I was losing state, burning tokens, and—most importantly—losing my sleep. Over the next 72 hours, I tore apart my infrastructure to find the root cause. It wasn't a single bug, but a combination of how Cloud Run handles CPU allocation and a fundamental misunderstanding of how Go's database/sql pool interacts with serverless scaling.

In this post, I’m documenting the exact steps I took to diagnose and resolve these outages. If you’re running high-concurrency workloads on Cloud Run or any serverless platform and seeing "upstream request timeout" errors that don't make sense, this technical analysis is for you.

Why Cloud Run Instances Throw 504 Gateway Timeout Errors

Intermittent 504 errors in serverless environments often stem from a combination of CPU throttling and database connection exhaustion during rapid scaling events. My service is an orchestration layer. It takes a user request, fetches context from a PostgreSQL database (Cloud SQL), sends a prompt to Gemini 1.5 Pro, and then updates the database with the agent's state. I had already implemented robust state management, which I wrote about in AI Agent State Management: Recovering Workflows Without Token Waste, but that only helps you recover from a crash—it doesn't prevent the crash itself.

The logs showed a terrifying sequence. First, a few requests would take longer than 10 seconds. Then, the Cloud Run instance would hit its concurrency limit. Because the existing requests weren't finishing, Cloud Run would spin up a new instance. That new instance would attempt to connect to the database, fail, and then the whole service would death-spiral. I saw hundreds of these errors in Cloud Logging:


{
  "textPayload": "The request failed because the instance could not be started.",
  "resource": { "type": "cloud_run_revision", "labels": { "service_name": "agent-orchestrator" } },
  "severity": "ERROR"
}

Initially, I thought I had a goroutine leak. I suspected that my Gemini streaming implementation was leaving open connections. I spent four hours profiling the heap and checking runtime.NumGoroutine(), but everything looked normal. The leak wasn't in my code's logic; it was in the infrastructure's assumptions.

How CPU Throttling Causes Intermittent 504 Timeouts in Cloud Run

Cloud Run throttles CPU to near-zero immediately after a response is sent, which freezes background goroutines and causes stale database connections. The first breakthrough came when I looked at the "CPU Allocation" setting in the Cloud Run console. By default, Cloud Run only allocates CPU during request processing. This is great for saving money, but it is catastrophic for any background task or asynchronous cleanup. I realized that my service was using a background goroutine to flush telemetry data and close database transactions after the HTTP response was sent.

When the response was sent, Cloud Run immediately throttled the CPU to near-zero. My background goroutine, which was supposed to return the database connection to the pool, was essentially frozen in time. When the next request arrived, the CPU was "thawed," but by then, the database driver often thought the connection had timed out or become stale. This led to a "Connection reset by peer" error on the *next* request, not the current one. This specific behavior is a common cause of Cloud Run 504 timeouts in high-concurrency Go applications.

I had to switch my CPU allocation to "CPU is always allocated." While this increased my baseline cost by about 18%, it eliminated the weird 504s that happened immediately after a quiet period. If you are doing anything with background threads, even just closing a logger, you cannot rely on the default CPU throttling behavior. You can read more about this in the official Google Cloud Run CPU allocation docs.

Managing Database Connection Pool Exhaustion During Cloud Run Scaling

Serverless environments require aggressive database connection limits because each instance manages its own independent pool, which can quickly overwhelm a central database. Once the CPU issue was fixed, the frequency of outages dropped, but they didn't disappear. I was still seeing spikes in latency during "cold starts"—when Cloud Run spins up a new instance to handle a surge in traffic. I noticed that every time a new instance started, it would try to open 25 new connections to my Cloud SQL instance. With a concurrency limit of 80 and a max instance count of 10, I was suddenly hitting the 500-connection limit on my database.

My initial database configuration was naive. I was using the default settings for the Go pgx driver, which doesn't account for the ephemeral nature of serverless instances. Here is the "bad" code I started with:


// Bad: Default settings that kill serverless DBs
db, err := sql.Open("pgx", connStr)
if err != nil {
    log.Fatal(err)
}
// I didn't set MaxOpenConns or MaxIdleConns, 
// so it defaulted to unlimited or too high for a scaled environment.

In a serverless environment, you have to be extremely aggressive with your connection pooling. You aren't just managing one pool; you're managing N pools across N instances. I had to recalibrate my math. If I have 10 instances, and my DB supports 100 connections, each instance *must* be capped at 10 connections. I also needed to ensure that idle connections were closed quickly so that a dying instance didn't "hang onto" a connection slot that a new instance needed.

Here is the optimized configuration I moved to:


func initDB() *sql.DB {
    db, err := sql.Open("pgx", os.Getenv("DATABASE_URL"))
    if err != nil {
        log.Fatalf("Unable to connect to database: %v", err)
    }

    // Crucial for Cloud Run:
    // Limit the number of open connections per instance
    db.SetMaxOpenConns(8) 
    
    // Maintain a few idle connections to speed up new requests
    db.SetMaxIdleConns(2) 
    
    // Close connections that haven't been used for 5 minutes
    db.SetConnMaxLifetime(5 * time.Minute) 
    
    // Close idle connections very quickly to free up slots for other instances
    db.SetConnMaxIdleTime(1 * time.Minute)

    return db
}

After applying this change, the 503 errors during scaling events disappeared. The instances were now playing nice with the shared database resource. I also realized that I needed to integrate this more tightly with my Gemini calls, which I had previously documented in Building a Data Extraction Pipeline with Gemini Function Calling. The database needs to be ready to receive the extracted data immediately, or the whole pipeline stalls.

Using Context Deadlines to Prevent Silent Cloud Run Failures

Application-level timeouts must be configured to be shorter than the infrastructure timeout to allow for proper logging and resource cleanup before the connection is severed. Even with the CPU and DB issues resolved, I still had a lingering problem: 504 timeouts on very long AI generations. Gemini 1.5 Pro can take 30-60 seconds to reason through a complex prompt. If my Cloud Run request timeout was set to 60 seconds and the Gemini API took 59 seconds, the overhead of the Go runtime and the DB update would push me over the limit.

The "Invisible Killer" here was Go's context.Context. I was passing the request context directly to the Gemini SDK and the database driver. This sounds like a best practice, but in a serverless environment with a hard load balancer timeout, it creates a race condition. If the load balancer (the GCLB in front of Cloud Run) has a 60-second timeout, but your code doesn't explicitly handle that timeout *before* the load balancer does, the client gets a generic 504 and you have no idea why.

I realized I needed to "fail fast" inside my application code so I could log exactly what happened before the infrastructure cut me off. I implemented a hard application-level timeout that was 2 seconds shorter than the Cloud Run timeout.


func handleAgentRequest(w http.ResponseWriter, r *http.Request) {
    // Cloud Run timeout is 60s, so we use 58s to allow for cleanup/logging
    ctx, cancel := context.WithTimeout(r.Context(), 58*time.Second)
    defer cancel()

    result, err := callGeminiWithRetry(ctx, r.Body)
    if err != nil {
        if errors.Is(err, context.DeadlineExceeded) {
            log.Printf("Internal Timeout: Gemini took too long for request %s", r.Header.Get("X-Request-ID"))
            http.Error(w, "Agent processing timeout", http.StatusGatewayTimeout)
            return
        }
        // Handle other errors...
    }

    // Update DB with results using a fresh context if the request context is dead
    // (This is only safe if you have 'CPU Always Allocated' enabled!)
    updateCtx, updateCancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer updateCancel()
    saveToDB(updateCtx, result)
}

This change was a game-changer for observability. Instead of seeing generic Cloud Run 504 timeouts in the load balancer logs, I started seeing specific "Agent processing timeout" messages in my application logs. This allowed me to identify which specific prompts were causing Gemini to hang and optimize them accordingly.

Load Test Results: Reducing Cloud Run Latency and Error Rates

Optimizing CPU and connection settings improved the request success rate from 84.2% to 99.95% under heavy load. To prove these changes actually worked, I ran a load test using k6. I simulated a spike from 0 to 200 concurrent users over 2 minutes. This is a classic "stress test" for serverless architectures because it forces rapid scaling and heavy DB connection churn.

Performance Metrics Before Optimization:

  • Success Rate: 84.2%
  • P99 Latency: 14.8s
  • Max DB Connections: 500 (Hard limit hit)
  • Error Pattern: Frequent 503s during instance scale-up.

Performance Metrics After Optimization:

  • Success Rate: 99.95%
  • P99 Latency: 8.2s
  • Max DB Connections: 84
  • Error Pattern: Only 2 timeouts recorded, both caught by application logic.

The reduction in P99 latency was the most surprising. By capping the database connections and keeping the CPU "warm," I eliminated the massive tail latency caused by connection-wait times and cold-start CPU throttling. The system felt snappy even under load.

Best Practices for Cloud Run Debugging and Stability

Successful Cloud Run debugging requires a deep understanding of how the platform manages compute resources and external connections differently than traditional servers. Debugging this taught me that serverless isn't just "someone else's server." It’s a specific execution environment with constraints that traditional VPS or Kubernetes deployments don't have. Here are the core lessons I'm carrying forward into my next project:

1. CPU Throttling is a Silent Killer

If your service performs any work after sending a response—logging, telemetry, DB cleanup, or async AI processing—you must enable "CPU is always allocated." The cost increase is negligible compared to the engineering hours lost to debugging intermittent Cloud Run 504 timeouts.

2. Math Your Database Connections

Never leave your DB connection pool to the defaults in a serverless environment. Calculate your MaxOpenConns using the formula: (Total DB Limit / Max Cloud Run Instances) - 1. This ensures that even at peak scale, you will never trigger a 503 by exhausting the database's capacity.

3. Application Timeouts Must Be Less Than Infrastructure Timeouts

Always set your code's internal context timeouts to be slightly shorter (1-2 seconds) than your platform's timeout. This allows your application to log the error, clean up resources, and return a meaningful error message to the client before the load balancer kills the connection.

4. Observability Requires Context

Generic 504 errors are useless. By wrapping my Gemini calls and DB transactions in specific error handlers that check for context.DeadlineExceeded, I transformed a "system is down" alert into a "this specific prompt is too slow" optimization task.

Further Resources for Cloud Run Optimization

Fixing these outages wasn't about writing more code; it was about understanding the invisible hand of the cloud provider. My backend is now stable, but the journey doesn't end here. My next challenge is tackling the cold-start latency of the Gemini SDK itself, which seems to have a heavy initialization cost in Go. I'm currently experimenting with lazy-loading the client to see if I can shave another 400ms off the initial request. I'll be documenting those benchmarks in my next post, so stay tuned.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI