Go HTTP Client Leaks on Cloud Run: Preventing Resource Exhaustion with LLMs
Go HTTP Client Leaks on Cloud Run: Preventing Resource Exhaustion with LLMs
I remember the exact moment I knew something was fundamentally wrong. It was a Monday morning, and my usual routine of checking our Cloud Run service dashboards revealed a disturbing trend. Over the weekend, the memory usage on one of our core LLM-orchestration services had slowly but steadily crept upwards. What started as a minor anomaly soon turned into a full-blown performance regression: increased latency for API calls, higher CPU utilization, and, most alarmingly, a significant spike in the number of active Cloud Run instances. Our monthly cloud bill projection was suddenly looking much, much uglier.
This particular service is responsible for routing requests to various LLM providers, both external APIs and our own self-hosted inference endpoints. It's a critical component, handling a high volume of concurrent requests. My initial hypothesis revolved around LLM payload sizes or perhaps an inefficient prompt engineering strategy, but logs showed nothing out of the ordinary in terms of request volume or complexity. The problem felt more systemic, a slow bleed rather than a sudden rupture.
The Slow Burn: Unpacking the Symptoms
The first clue was the memory profile. Using Cloud Monitoring, I observed a sawtooth pattern on memory usage, but with each "sawtooth," the baseline was higher than the last. This indicated a leak, something accumulating over time that wasn't being properly released. Paired with this, CPU usage was also elevated, even during periods of low traffic, suggesting background work or resource contention. Latency, while not catastrophic, had increased by an average of 15-20% across the board.
The most telling metric was the Cloud Run instance count. For a service usually humming along with 2-3 instances during peak hours, we were now seeing 5-7 instances, even off-peak. Cloud Run's autoscaler was desperately trying to keep up with the perceived demand and resource exhaustion, spinning up new instances to handle the load, only for those new instances to eventually succumb to the same memory creep. This was a clear sign of resource inefficiency, directly translating to increased costs.
Digging Deeper: Identifying the Culprit
My suspicion quickly turned to network interactions. Given the service's role in communicating with multiple external LLM APIs and internal inference services, inefficient handling of HTTP connections seemed plausible. In Go, the standard library's net/http package is incredibly powerful, but its default behaviors, while sensible for many use cases, can lead to subtle issues in high-concurrency, long-running server environments like Cloud Run if not understood.
I began reviewing the Go code for how we were making HTTP requests. And there it was, a pattern I'd seen before in less experienced codebases, now staring me in the face in our own production service:
package main
import (
"io/ioutil"
"log"
"net/http"
"time"
)
// This is a simplified example of the problematic pattern.
// In a real service, this would be inside a handler or business logic.
func makeLLMRequestBad() ([]byte, error) {
// PROBLEM: A new http.Client is created for every request.
// This also implicitly creates a new http.Transport.
client := &http.Client{
Timeout: 30 * time.Second,
}
req, err := http.NewRequest("GET", "http://some-llm-api.com/generate", nil)
if err != nil {
return nil, err
}
resp, err := client.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close() // Important, but not enough to fix the leak.
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
return nil, err
}
return body, nil
}
func main() {
// Imagine this being called thousands of times concurrently
// within a Cloud Run service's request handler.
for i := 0; i < 100; i++ {
_, err := makeLLMRequestBad()
if err != nil {
log.Printf("Error making request: %v", err)
} else {
log.Println("Request successful (simulated)")
}
time.Sleep(10 * time.Millisecond) // Simulate some work
}
log.Println("Simulated requests finished.")
}
The core issue here is that creating &http.Client{} for every single request also implicitly creates a new http.Transport. The http.Transport is responsible for managing HTTP connection pools, keeping connections alive (via HTTP Keep-Alive), and reusing them for subsequent requests to the same host. When you create a new client for every request, you're essentially discarding the connection pool after each use.
This leads to several problems:
-
Socket Exhaustion: Each new connection opened eventually enters a
TIME_WAITstate on the operating system level, consuming system resources. While these eventually clean up, a high rate of new connections can exhaust available ephemeral ports, leading to "address already in use" errors. - Memory Leaks: Although not a traditional memory leak in the sense of unreferenced memory, the underlying network connections and their associated buffers consume memory. Without proper pooling and closure, this memory accumulates until the garbage collector eventually reclaims it, but it adds significant pressure and can lead to the "slow creep" I observed.
- Increased Latency: Establishing a new TCP connection (and potentially performing TLS handshakes) for every request adds significant overhead. HTTP Keep-Alive is designed precisely to avoid this, but it's nullified by creating new clients.
- Higher CPU Usage: All the above contribute to higher CPU usage – establishing connections, managing short-lived resources, and increased GC activity.
The Fix: Reusing and Configuring http.Client
The solution, thankfully, is well-documented within the Go community: reuse your http.Client instances. A single http.Client (and thus its underlying http.Transport) should ideally be shared across multiple goroutines and requests, especially in a long-running service.
Here's the corrected pattern:
package main
import (
"io/ioutil"
"log"
"net/http"
"time"
)
// Declare a global or package-level http.Client instance.
// This client will be reused for all HTTP requests made by the service.
var httpClient *http.Client
func init() {
// It's crucial to initialize the client once, typically in an init() function
// or as part of your application's startup.
// For high-concurrency scenarios, it's also good practice to customize the Transport.
tr := &http.Transport{
MaxIdleConns: 100, // Max idle connections across all hosts
MaxIdleConnsPerHost: 10, // Max idle connections per target host
IdleConnTimeout: 90 * time.Second, // How long an idle connection is kept alive
// Other settings like TLSClientConfig, DialContext, etc., can go here.
}
httpClient = &http.Client{
Timeout: 30 * time.Second, // Overall request timeout
Transport: tr,
}
log.Println("HTTP Client initialized with custom transport.")
}
// This is a simplified example of the correct pattern.
func makeLLMRequestGood() ([]byte, error) {
// GOOD: Reuse the globally initialized httpClient.
req, err := http.NewRequest("GET", "http://some-llm-api.com/generate", nil)
if err != nil {
return nil, err
}
resp, err := httpClient.Do(req) // Use the shared client
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
return nil, err
}
return body, nil
}
func main() {
// Imagine this being called thousands of times concurrently
// within a Cloud Run service's request handler.
for i := 0; i < 100; i++ {
_, err := makeLLMRequestGood()
if err != nil {
log.Printf("Error making request: %v", err)
} else {
log.Println("Request successful (simulated)")
}
time.Sleep(10 * time.Millisecond) // Simulate some work
}
log.Println("Simulated requests finished.")
}
By sharing a single http.Client instance, all requests benefit from the underlying http.Transport's connection pooling. This means that after the initial connection setup, subsequent requests to the same host can reuse an existing idle connection, drastically reducing latency and resource overhead.
Fine-Tuning the http.Transport
While reusing the client is the primary fix, optimizing the http.Transport's parameters is crucial for high-performance services interacting with various APIs, especially those involving large data transfers or long-lived connections, like LLM inference.
-
MaxIdleConns: This sets the maximum number of idle (keep-alive) connections across all hosts. A reasonable default is often 100, but you might need to increase it if your service connects to many different external endpoints. -
MaxIdleConnsPerHost: This is often more critical. It limits the maximum number of idle connections that will be kept open per target host. If your service primarily talks to a few specific LLM endpoints, setting this value appropriately (e.g., 10-20) ensures you always have a pool of ready connections for those frequent targets. If this is too low, you'll still incur connection setup overhead even with a shared client. -
IdleConnTimeout: This specifies the maximum amount of time an idle (keep-alive) connection will remain open before being closed by the client. Default is 90 seconds. If the server you're connecting to has a shorter keep-alive timeout, you might want to match it to prevent the client from trying to reuse a connection that the server has already closed, leading to "connection reset by peer" errors. -
DisableKeepAlives: Setting this totruewould disable HTTP keep-alives entirely. While it might seem like a way to prevent leaks, it actively works against performance and resource efficiency by forcing a new TCP connection for every single HTTP request. I strongly advise against setting this totrueunless you have a very specific and unusual use case.
For our LLM services, particularly those interacting with diverse external APIs, I found that tuning MaxIdleConnsPerHost was particularly impactful. Some LLM providers might have slightly different server-side keep-alive behaviors, and having a robust pool per host ensured we weren't constantly re-establishing connections. This directly impacts the cost of LLM inference, as reducing latency can mean fewer concurrent requests and thus fewer instances needed. You can read more about how latency impacts LLM serving costs in my previous post: Optimizing LLM Inference on Cloud Run: Dynamic Batching for Cost and Latency.
Validating the Fix with Metrics
Once the code change was deployed, the difference was almost immediate and highly satisfying.
- Memory Usage: The sawtooth pattern quickly stabilized. The baseline memory usage dropped significantly (by about 30-40% on average) and remained flat, regardless of traffic spikes. This was the most direct confirmation of the leak being plugged.
- CPU Utilization: CPU usage returned to expected levels, only spiking during actual request processing and remaining low during idle periods.
- Latency: Average request latency decreased by about 10-15%, especially for consecutive requests to the same endpoint, thanks to connection reuse.
- Cloud Run Instances: The Cloud Run autoscaler breathed a sigh of relief. The number of active instances during peak hours returned to our historical baseline of 2-3, sometimes spiking to 4, but never reaching the alarming 7-8 instances we saw during the problem period. This directly translated to substantial cost savings.
The official Go documentation on the net/http package provides comprehensive details on these configurations, and it's always my first stop when debugging such issues. It's a goldmine for understanding the nuances of HTTP client behavior in Go.
What I Learned / The Challenge
This experience was a powerful reminder that even in modern serverless environments like Cloud Run, fundamental networking principles still apply. Debugging subtle resource leaks in an ephemeral, autoscaling environment can be tricky. You don't always have direct access to tools like netstat or lsof to examine open connections directly on a running instance. Instead, you rely heavily on aggregated metrics (memory, CPU, instance count, latency) and careful code review.
The challenge lies in the fact that Go's defaults are often "good enough," leading developers to overlook the underlying mechanisms. It's easy to forget that each http.Client carries with it a fully capable connection manager. While pprof can be invaluable for identifying goroutine and memory leaks, the symptoms of connection exhaustion can sometimes be masked as general performance degradation before they manifest as outright errors. My focus on cost metrics was key to identifying this problem early, before it became a full-blown outage.
Related Reading
If you're wrestling with optimizing your LLM services on Cloud Run, these posts from my devlog might also be helpful:
- Optimizing LLM Inference on Cloud Run: Dynamic Batching for Cost and Latency: This deep dive explains how we tackled latency and cost by intelligently batching requests, a crucial aspect of efficient LLM serving that complements connection pooling.
- Optimizing Open-Source LLM Serving Costs on Cloud Run with Quantization and Speculative Decoding: This article explores techniques like quantization to reduce the memory footprint of self-hosted LLMs, another angle on resource optimization directly impacting Cloud Run costs.
Looking Ahead
This particular fix has brought significant stability and cost savings to our LLM orchestration services. However, the journey of optimization is never truly over. My next steps involve exploring more granular monitoring of HTTP client connection pools, potentially integrating OpenTelemetry to capture metrics like active connections, idle connections, and connection reuse rates directly from our application code. This would provide even earlier warning signs of potential resource contention. I'm also keen to evaluate the benefits of HTTP/2 multiplexing for our internal service-to-service communication, as it could further reduce overhead by allowing multiple requests over a single TCP connection, pushing our efficiency on Cloud Run even further.
Comments
Post a Comment