Building a High Performance LLM API Gateway with Go and Cloud Run

Building a High Performance LLM API Gateway with Go and Cloud Run

An LLM API Gateway centralizes access to multiple AI providers, enabling real-time token counting, budget enforcement, and unified authentication. By using Go and Cloud Run, developers can implement a high-performance proxy that prevents cost overruns and provides granular observability across all internal AI services.

Last month, I woke up to a PagerDuty alert at 2:00 AM that had nothing to do with server uptime and everything to do with my credit card. A junior developer on our team had accidentally pushed a test script with an unbounded loop that was hitting the OpenAI gpt-4o endpoint. By the time I killed the process, we had burned $432 in less than thirty minutes. It was a classic "shadow AI" disaster. We had no centralized visibility, no per-key quotas, and no way to kill a rogue session without rotating a global API key that would have broken production for everyone.

I realized then that letting every microservice manage its own LLM API keys was a recipe for bankruptcy. I needed a centralized LLM API Gateway. It had to be fast (sub-millisecond overhead), it had to support multiple providers (OpenAI, Anthropic, and Gemini), and it had to run on Google Cloud Platform (GCP) with minimal operational overhead. Most importantly, it needed to enforce hard budgets at the user and service level.

I decided to build this in Go. Why? Because Go’s net/http/httputil package provides the most robust primitives for building reverse proxies, and its concurrency model is perfect for handling thousands of streaming LLM connections without blowing up the memory footprint. In this post, I’ll walk you through the architecture, the code for streaming token counting, and how I optimized the costs on Cloud Run.

Why Building a Custom LLM API Gateway is Essential for Cost Control

Centralizing LLM access through a custom gateway prevents "shadow AI" and allows for unified budget enforcement across multiple providers. I looked at a few open-source LLM gateways before starting. Some were too heavy, requiring a sidecar for every pod. Others were built in Python, which is great for data science but lacks the raw performance and low-memory profile I wanted for a high-traffic gateway. I wanted a single binary that I could deploy to Cloud Run, scale to zero when not in use, and use to unify our internal API consumption.

My requirements for the LLM API Gateway were specific:

  • Provider Agnostic: A single interface for /v1/chat/completions that routes to OpenAI, Anthropic, or Vertex AI based on headers.
  • Identity-Aware: Use GCP Service Accounts and IAM to authenticate internal services instead of static API keys.
  • Real-time Budgeting: Count tokens in the stream and cut the connection if a service exceeds its daily quota.
  • Observability: Export every request's metadata (model, tokens, latency, cost) to BigQuery for analysis.

By centralizing this, we could also implement techniques like strategic prompt chaining at the gateway level, injecting system prompts or caching common queries before they even hit the provider.

How to Implement a High-Performance Reverse Proxy Using Go

Go's standard library provides the httputil.ReverseProxy primitive, which is the most efficient way to build a low-latency LLM API Gateway. The core of the gateway handles the heavy lifting of forwarding requests, but the magic happens in how we modify the request and response to handle authentication and logging.

Here is the foundational structure of my proxy handler:

func (g *Gateway) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    provider := r.Header.Get("X-LLM-Provider")
    targetURL, err := g.getProviderURL(provider)
    if err != nil {
        http.Error(w, "Invalid provider", http.StatusBadRequest)
        return
    }

    proxy := httputil.NewSingleHostReverseProxy(targetURL)
    
    // Customize the director to inject API keys and rewrite paths
    originalDirector := proxy.Director
    proxy.Director = func(req *http.Request) {
        originalDirector(req)
        req.Host = targetURL.Host
        req.Header.Set("Authorization", "Bearer "+g.getSecret(provider))
        
        // Remove internal headers before sending to the provider
        req.Header.Del("X-Internal-Service-ID")
    }

    proxy.ServeHTTP(w, r)
}

This looks simple, but the real challenge is the response. LLM responses are almost always streamed (Server-Sent Events). If you want to count tokens and calculate costs, you can't just wait for the response to finish; you have to intercept the stream using the httputil.ReverseProxy implementation.

How to Intercept Streaming Responses for Real-Time Token Counting

Real-time token counting requires wrapping the response body to intercept Server-Sent Events (SSE) and closing connections immediately when quotas are exceeded. To count tokens without adding latency, I had to wrap the response body. I used a custom ResponseWriter to capture the chunks as they passed through the proxy. For OpenAI-compatible providers, I used the tiktoken-go library to estimate token counts on the fly.

The problem with streaming is that if you wait until the end to check the budget, the damage is already done. I implemented a "drain" check. Every 100 tokens, the gateway checks a Redis-backed counter. If the service has exceeded its daily limit, the gateway closes the TCP connection immediately.

type streamingUpdater struct {
    http.ResponseWriter
    serviceID string
    tokenCount int
}

func (s *streamingUpdater) Write(b []byte) (int, error) {
    // Basic logic to find 'content' blocks in SSE chunks
    tokens := estimateTokens(b)
    s.tokenCount += tokens
    
    if s.tokenCount > getQuota(s.serviceID) {
        return 0, errors.New("quota exceeded")
    }
    
    return s.ResponseWriter.Write(b)
}

This approach ensured that even if a developer wrote an infinite loop, the gateway would kill the stream after a few cents of usage rather than hundreds of dollars. This is much more effective than the cost-optimization strategies I discussed in my previous post on function calling and tool use optimization, as it provides a hard safety net.

How to Secure LLM API Keys Using GCP Secret Manager and IAM

Integrating with GCP Secret Manager and using OIDC tokens eliminates the need for static API keys and simplifies internal service authentication. Hardcoding API keys is a cardinal sin. Even putting them in environment variables is risky because they can show up in logs or UI consoles. I integrated the gateway directly with GCP Secret Manager. The gateway fetches the keys at startup and caches them in memory, refreshing them every hour.

I also moved away from internal API keys. Instead, I used the Authorization: Bearer $(gcloud auth print-identity-token) pattern. The gateway validates the Google OIDC token, extracts the service account email, and uses that as the serviceID for quota management. This means I don't have to manage a separate database of internal "Gateway Keys." If you have a valid GCP service account in our project, you can hit the gateway.

How to Reduce LLM API Gateway Latency to Sub-Millisecond Levels

Optimizing connection pooling and using fast JSON parsers can reduce LLM API Gateway overhead to negligible levels. When you sit in the middle of every LLM call, latency is your biggest enemy. I initially saw a 40ms overhead per request, which was unacceptable. After profiling the Go application using pprof, I found two bottlenecks:

  1. TLS Handshaking: The proxy was creating a new connection to OpenAI for every request. I fixed this by using a custom http.Transport with MaxIdleConnsPerHost set to 100.
  2. JSON Parsing: I was parsing the entire request body to log the prompt. Switching to buger/jsonparser allowed me to extract the "model" and "messages" fields without full unmarshaling, shaving off 12ms.

I also deployed the gateway on Cloud Run with "min-instances" set to 1. While scaling to zero is great for costs, the cold start of a Go binary plus the time to fetch secrets from Secret Manager resulted in a 3-second delay for the first request. Keeping one instance warm costs about $15/month but provides a much better experience for our engineers.

For more on how I handle scaling and batching in these environments, check out my guide on Optimizing LLM Inference on Cloud Run.

How to Implement Automatic Provider Failover for High Availability

A centralized LLM API Gateway provides a layer of abstraction that allows for seamless switching between models during provider outages. One unexpected benefit of the gateway was resilience. During a recent OpenAI outage, I was able to update a config map in our gateway to redirect all gpt-4o traffic to claude-3-5-sonnet (via a shim that converted the API format). The client applications didn't even know they were talking to a different provider. They just saw the same OpenAI-compatible response format. This kind of abstraction is invaluable for production reliability.

Key Takeaways for Building Scalable AI Infrastructure

Building a robust LLM API Gateway requires a focus on performance and security to ensure that AI adoption remains cost-effective and secure.

  • Standard Library First: Go's net/http and httputil are incredibly powerful. Don't reach for a complex framework (like Gin or Echo) for a proxy until you actually need the routing complexity. The overhead is rarely worth it for a gateway.
  • Streaming is Tricky: Intercepting SSE (Server-Sent Events) requires careful buffer management. If you're not careful, you can accidentally buffer the whole response in memory, defeating the purpose of streaming and potentially crashing your container.
  • Cost Visibility is the Best Feature: The most popular feature of the gateway wasn't the API, but the Grafana dashboard I built on top of the BigQuery logs. Seeing "Service A spent $50 today" changed how our teams approached prompt engineering.
  • IAM is Better than Keys: Using GCP's native identity for authentication removed a massive amount of secret-rotation toil.

Related Reading

Building this LLM API Gateway was a turning point for our AI infrastructure. We went from a "wild west" of API keys and unpredictable bills to a governed, observable system. My next step is to implement semantic caching at the gateway level using a vector database, which should allow us to skip the LLM entirely for 15-20% of redundant internal queries. I'll be documenting that process as soon as the benchmarks are in.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI