Optimizing LLM Inference on Cloud Run: Dynamic Batching for Cost and Latency

Optimizing LLM Inference on Cloud Run: Dynamic Batching for Cost and Latency

It was a Monday morning, and my coffee hadn't quite kicked in when I saw the alert. Our LLM inference service, running on Google Cloud Run, had just breached its monthly cost threshold – and it was only the second week of the month. A quick glance at the dashboards confirmed my fears: not only were costs escalating rapidly, but P99 latency had also started to creep up, occasionally spiking to unacceptable levels. This wasn't just a financial hit; it was directly impacting user experience, particularly for our more complex RAG queries. I knew right then I had a significant problem on my hands, one that pointed squarely at how we were handling LLM inference at scale.

The Unexpected Cost Spike and Performance Degradation

Our initial setup for serving a fine-tuned open-source LLM on Cloud Run was straightforward. We containerized a simple FastAPI application that loaded our model and exposed an inference endpoint. Cloud Run's automatic scaling and serverless nature seemed like a perfect fit, allowing us to pay only for what we used. For a while, it worked beautifully. As our user base grew, however, and the number of concurrent requests increased, the cracks started to show. We were seeing a rapid increase in the number of Cloud Run instances, each consuming CPU and memory, often sitting idle for significant portions of time while waiting for a single, small request to process.

My first thought was a memory leak or an inefficient model loading process. I spent days poring over logs, profiling the application, and even trying different base images. While I found minor optimizations, nothing explained the dramatic cost increase and the concurrent latency issues. The CPU utilization on individual instances, when they were active, looked fine, but the *total* CPU utilization across all instances was astronomical, and many instances were frequently in a "cold" state, leading to those frustrating latency spikes.

Understanding the LLM Inference Bottleneck: The Need for Batching

The core problem, I realized, wasn't necessarily a bug in my code, but a fundamental mismatch between how LLMs are most efficiently served and how Cloud Run was handling our request patterns. LLMs, especially larger ones, are computationally intensive. Their true power and efficiency come when you can process multiple requests (or parts of requests) in parallel on the GPU or specialized hardware. This is known as **batching**. Processing a single token for one request often takes almost as much time and resource allocation as processing a batch of tokens for multiple requests, up to a certain batch size. Without effective batching, each request, no matter how small, essentially monopolizes an entire LLM inference pipeline, leading to:

  • **Underutilized hardware:** A powerful GPU or many CPU cores sit idle while processing a single prompt.
  • **Increased cold starts:** Cloud Run scales by spinning up new instances for concurrency. If each instance only handles one request at a time, you need many more instances, leading to frequent cold starts as demand fluctuates.
  • **Higher costs:** More instances, more CPU/memory consumption, more billing.
  • **Higher latency:** Each request waits for its turn, or for a new instance to spin up.

Our initial setup was essentially "batch size 1" inference. Every single incoming request was treated as an independent task, leading to the exact symptoms I was observing.

Experimenting with Batching Strategies on Cloud Run

My goal became clear: I needed to implement an effective batching strategy that could coalesce multiple incoming requests into a single, larger inference call to the LLM, without introducing unacceptable delays or complexities. This is where the real experimentation began.

1. The Naive Approach: No Batching (Our Starting Point)

As mentioned, this was our initial state. Each HTTP request directly triggered an LLM inference call.


# Simplified FastAPI endpoint (initial state)
@app.post("/generate")
async def generate_text(request: GenerationRequest):
    # Model loaded globally
    input_ids = tokenizer.encode(request.prompt, return_tensors="pt")
    output_ids = model.generate(input_ids, max_new_tokens=request.max_new_tokens)
    generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)
    return {"text": generated_text}

This led to high instance counts, low per-instance utilization for actual inference, and the cost/latency issues.

2. Static Batching (Failed Attempt)

My first thought was to implement a simple queue. Requests would come in, get added to a queue, and a separate thread would periodically pick `N` requests from the queue and send them to the model.


# Pseudo-code for static batching
request_queue = asyncio.Queue()
batch_size = 4 # Fixed batch size
batch_timeout = 0.1 # seconds

async def inference_worker():
    while True:
        batch = []
        start_time = time.time()
        while len(batch) < batch_size and (time.time() - start_time) < batch_timeout:
            try:
                request = await asyncio.wait_for(request_queue.get(), timeout=batch_timeout - (time.time() - start_time))
                batch.append(request)
            except asyncio.TimeoutError:
                break
        
        if batch:
            # Process batch with model
            # Distribute results back to original requestors

This approach quickly showed its flaws. A fixed `batch_size` and `batch_timeout` were difficult to tune. If traffic was low, requests would sit in the queue for the `batch_timeout`, increasing latency. If traffic was high, we'd still be limited by the `batch_size`, potentially creating a backlog if the model couldn't keep up. It was a step in the right direction but lacked the dynamism needed for fluctuating serverless workloads.

3. Dynamic Batching with a Specialized LLM Server

This is where I found the true solution. Instead of reinventing the wheel, I looked for existing solutions designed for efficient LLM serving. The Hugging Face `text-generation-inference` (TGI) service emerged as a powerful option. TGI is specifically built to handle LLM inference with features like dynamic batching, continuous batching, quantization, and optimized token streaming. It acts as a robust inference server that you can run in a container.

The key for me was TGI's dynamic batching capabilities. It intelligently groups incoming requests based on configurable parameters like `max_batch_total_tokens` and `max_batch_prefill_tokens`, and a `waiting_served_batch_size`. This means it tries to fill up batches as much as possible within certain limits, prioritizing throughput while keeping an eye on latency for individual requests.

My new `Dockerfile` for Cloud Run looked something like this (simplified):


# Dockerfile for TGI on Cloud Run
FROM ghcr.io/huggingface/text-generation-inference:2.0.0 # Or a later version

# Set environment variables for the model and batching
ENV MODEL_ID=my/fine-tuned-llama-model
ENV HUGGING_FACE_HUB_TOKEN=hf_YOUR_TOKEN
ENV MAX_BATCH_TOTAL_TOKENS=4096 # Max tokens in a single batch
ENV MAX_BATCH_PREFILL_TOKENS=2048 # Max tokens for initial prompt processing
ENV HUGGINGFACE_HUB_CACHE=/data # Cache model weights

# Expose the TGI port
EXPOSE 80

# The TGI entrypoint handles everything
ENTRYPOINT ["text-generation-inference"]

To deploy this on Cloud Run, I used `gcloud` commands, paying close attention to resource allocation and concurrency settings:


# gcloud command for deploying TGI
gcloud run deploy my-llm-service \
    --image gcr.io/my-project-id/my-llm-tgi-image:latest \
    --platform managed \
    --region us-central1 \
    --memory 16Gi \
    --cpu 8 \
    --timeout 300 \
    --no-allow-unauthenticated \
    --min-instances 1 \
    --max-instances 5 \
    --concurrency 10 \
    --set-env-vars MODEL_ID=my/fine-tuned-llama-model,MAX_BATCH_TOTAL_TOKENS=4096,MAX_BATCH_PREFILL_TOKENS=2048

Notice the `concurrency 10` setting. This is crucial. Cloud Run's concurrency setting dictates how many requests a single instance can handle simultaneously. With TGI's dynamic batching, a single TGI instance can efficiently handle multiple concurrent HTTP requests by batching them internally. I found that a concurrency of 8-12 worked well for my specific model and traffic patterns, allowing the TGI server to build robust batches without individual requests waiting too long. Setting `min-instances 1` ensures we always have a warm instance ready, mitigating cold starts for the very first request after idle periods.

Results: A Dramatic Improvement

The impact was immediate and significant:

  • Cost Reduction: Our Cloud Run instance count dropped by over 60% during peak hours. This translated directly into a 45% reduction in overall monthly serving costs. We were finally utilizing our allocated CPU and memory much more efficiently.
  • Latency Improvement: P99 latency for inference fell by approximately 30-40%. Requests were no longer waiting for new instances to spin up; instead, they were quickly batched and processed by existing, warm TGI instances. Even P50 latency saw a noticeable improvement.
  • Throughput Increase: A single Cloud Run instance running TGI could now handle significantly more requests per second, maximizing the utility of our allocated resources.
  • Predictability: The service became much more stable and predictable under varying load, thanks to TGI's intelligent batching algorithms.

This experience really hammered home the importance of specialized tools for specialized tasks. While Cloud Run provides an excellent serverless platform, the nuances of LLM inference demand an intelligent server like TGI to truly unlock efficiency.

For more details on how we further optimized costs, you might be interested in my previous post: Optimizing Open-Source LLM Serving Costs on Cloud Run with Quantization and Speculative Decoding. This delves into techniques like quantization, which can complement dynamic batching by making the individual inference calls even faster and less resource-intensive, and speculative decoding, which can further accelerate token generation.

What I Learned / The Challenge

My biggest takeaway from this whole ordeal is that scaling LLMs isn't just about throwing more hardware at the problem or relying solely on the auto-scaling features of serverless platforms. It's about understanding the unique computational characteristics of these models and choosing the right tools and strategies to serve them efficiently. Dynamic batching is not a "nice-to-have"; it's a fundamental requirement for cost-effective and low-latency LLM inference, especially when dealing with variable and unpredictable request patterns typical of user-facing applications.

The challenge lies in finding the right balance for your specific workload. Tuning `MAX_BATCH_TOTAL_TOKENS`, `MAX_BATCH_PREFILL_TOKENS`, and Cloud Run's `concurrency` setting requires careful monitoring and iterative adjustments. Too aggressive batching can lead to increased latency for individual requests, while too conservative settings negate the benefits. It's a continuous optimization process, and what works for one model or traffic pattern might not work for another.

Another crucial learning was the value of specialized inference servers. While I could have tried to implement dynamic batching logic myself, leveraging a battle-tested solution like TGI saved me immense development time and provided a much more robust and performant outcome than I could have achieved independently. It allowed me to focus on the higher-level application logic rather than the intricate details of LLM runtime optimization.

Related Reading

If you're grappling with similar challenges in serving LLMs or building performant AI applications, I highly recommend checking out these related posts:

Looking ahead, my next steps involve exploring more advanced continuous batching techniques and potentially integrating dynamic prompt caching to further reduce redundant computations. The world of LLM inference optimization is constantly evolving, and staying on top of these advancements is key to maintaining a cost-effective and high-performing service. I'm also keen to experiment with different hardware accelerators available on Cloud Run and other serverless platforms, always with an eye on the cost-performance trade-off. There's always more to learn and more to optimize!

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI