Preventing LLM API Rate Limiting: Concurrency Control for Production Workloads

Preventing LLM API Rate Limiting: Concurrency Control for Production Workloads

I still remember the day the alerts started screaming. It was a Tuesday morning, and our LLM-powered content generation service, a core component of our project, was suddenly flooded with 429 Too Many Requests errors. Simultaneously, our cloud provider's billing dashboard began to tell a grim story: an unprecedented spike in API egress costs. It was a classic production failure, a performance anomaly, and a cost spike all rolled into one chaotic mess. My heart sank. What had happened?

For context, our service leverages large language models (LLMs) to generate various content pieces. Each user request might involve several sequential or parallel calls to different LLM providers, sometimes even to the same provider for different sub-tasks (e.g., summarization, keyword extraction, full article generation). We had designed it to be highly scalable, using serverless functions to handle incoming traffic. The problem, as I quickly discovered, wasn't a lack of scaling on our part, but rather an uncontrolled enthusiasm for making API calls to our downstream LLM providers.

The Genesis of a Costly Problem: Unfettered Concurrency

Our initial architecture for processing a user's content generation request was deceptively simple. An incoming HTTP request would trigger a Go-based serverless function. Inside this function, a fan-out pattern was used to parallelize several LLM calls. For instance, generating a blog post might involve:

  1. Generating an outline (1 LLM call).
  2. For each section in the outline, generating content (N LLM calls, potentially in parallel).
  3. Summarizing the entire post (1 LLM call).
  4. Extracting keywords (1 LLM call).

In a naive implementation, steps 2, 3, and 4 might all fire off concurrently if their inputs were ready. This worked beautifully in development and with low user traffic. The issue arose when we experienced a surge in user activity. Suddenly, hundreds, then thousands, of these serverless function instances were spinning up, each making multiple concurrent LLM API calls. The cumulative effect was a massive, uncoordinated hammering of the LLM providers' endpoints.

Symptoms: 429s, Latency, and Bill Shock

The symptoms were textbook:

  • 429 Too Many Requests Errors: Our logs were full of these. LLM providers, like OpenAI and Anthropic, implement rate limits to protect their infrastructure and ensure fair usage. These limits are typically measured in requests per minute (RPM) and tokens per minute (TPM). When our functions collectively exceeded these, the API would respond with a 429 HTTP status code.
  • Increased Latency: While our serverless functions scaled up quickly, the upstream LLM API calls were being throttled. This meant our functions were spending a lot of time waiting for retries, leading to significantly increased end-to-end latency for users.
  • Cost Spikes: This was the most alarming. Many of our LLM API calls were wrapped in retry logic with exponential backoff. While essential for resilience, uncontrolled retries against a rate-limited API meant we were burning through compute cycles in our serverless functions just waiting and retrying, incurring unnecessary billing for execution time. Furthermore, some LLM providers charge for both input and output tokens. If a request failed early due to a rate limit but still consumed input tokens before the error, we were paying for incomplete work. This problem is particularly acute with LLM APIs where token counts, not just request counts, drive costs.

I quickly realized that while our serverless environment provided automatic scaling for *our* functions, it offered no inherent global coordination for calls to *external* APIs. Each function instance was an island, blissfully unaware of the API call storm its peers were generating. This is a common challenge in serverless architectures when interacting with external, rate-limited services.

Debugging the Deluge: Metrics and Logs

My first step was to dive into the observability stack. I focused on:

  1. LLM API Gateway Logs: We use an internal API gateway to route LLM requests, which logs response codes and latency. This immediately showed the surge in 429s and elevated upstream latencies.
  2. Serverless Function Metrics: I checked invocation counts, execution duration, and error rates for our content generation functions. Invocation counts were high, error rates were spiking, and average execution duration had crept up from ~5 seconds to over 30 seconds for successful requests (due to retries) and much higher for failed ones.
  3. Billing Dashboards: Confirming the cost spike was critical. I correlated the time of the 429 errors with the increased spend, validating the hypothesis that retries and prolonged function execution were contributing to the unexpected bill.

The evidence pointed to a clear culprit: the sheer volume of concurrent requests hitting the LLM providers. We needed a way to control this.

Implementing Concurrency Control: A Multi-Layered Approach

Solving this required a multi-layered approach, addressing both per-invocation and service-wide concurrency.

1. Per-Invocation Concurrency with Go Semaphores

Within a single serverless function instance, if multiple LLM calls can be made in parallel, we need to limit that parallelism. Go's concurrency primitives make this relatively straightforward. I opted for a counting semaphore using a buffered channel to limit the number of active goroutines making LLM API calls.

Here's a simplified example of how I refactored our LLM call logic within a Go serverless function:


package main

import (
	"context"
	"fmt"
	"log"
	"sync"
	"time"

	"golang.org/x/sync/semaphore" // Using the weighted semaphore for better context integration
)

// LLMClient represents a simplified LLM API client
type LLMClient struct {
	Name string
}

// Call simulates an LLM API call with a potential delay and error
func (c *LLMClient) Call(ctx context.Context, prompt string) (string, error) {
	select {
	case <-ctx.Done():
		return "", ctx.Err() // Context canceled or timed out
	default:
	}

	// Simulate network latency and processing time
	time.Sleep(time.Duration(500+rand.Intn(1000)) * time.Millisecond) // 0.5s to 1.5s latency

	// Simulate occasional rate limit errors
	if rand.Intn(10) == 0 { // 10% chance of a 429 error
		return "", fmt.Errorf("LLM_API_ERROR: 429 Too Many Requests from %s", c.Name)
	}

	return fmt.Sprintf("Response for '%s' from %s", prompt, c.Name), nil
}

// processTask simulates a task that requires multiple concurrent LLM calls
func processTask(ctx context.Context, client *LLMClient, prompts []string) ([]string, error) {
	// Define a semaphore to limit concurrent LLM calls within this function instance
	// Let's say we allow up to 5 concurrent calls to any LLM API from a single function instance
	const maxConcurrentLLMCalls = 5
	sem := semaphore.NewWeighted(maxConcurrentLLMCalls)

	var wg sync.WaitGroup
	results := make(chan struct {
		result string
		err    error
	}, len(prompts))

	for i, prompt := range prompts {
		select {
		case <-ctx.Done():
			log.Printf("Context canceled during prompt processing: %v", ctx.Err())
			return nil, ctx.Err()
		default:
		}

		wg.Add(1)
		go func(p string, idx int) {
			defer wg.Done()

			// Acquire a semaphore slot
			if err := sem.Acquire(ctx, 1); err != nil {
				log.Printf("Failed to acquire semaphore for prompt %d: %v", idx, err)
				results <- struct {
					result string
					err    error
				}{"", err}
				return
			}
			defer sem.Release(1)

			log.Printf("Making LLM call for prompt %d: '%s'", idx, p)
			res, err := client.Call(ctx, p)
			if err != nil {
				log.Printf("LLM call for prompt %d failed: %v", idx, err)
			} else {
				log.Printf("LLM call for prompt %d successful", idx)
			}
			results <- struct {
				result string
				err    error
			}{res, err}
		}(prompt, i)
	}

	wg.Wait()
	close(results)

	var finalResponses []string
	var firstErr error
	for r := range results {
		if r.err != nil && firstErr == nil {
			firstErr = r.err // Capture the first error
		}
		if r.result != "" {
			finalResponses = append(finalResponses, r.result)
		}
	}

	return finalResponses, firstErr
}

// main function simulating a serverless function invocation
func main() {
	// A context with a timeout for the entire function execution
	ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second) // Overall function timeout
	defer cancel()

	llmClient := &LLMClient{Name: "OpenAI-GPT4"}
	taskPrompts := []string{
		"Write a headline about AI.",
		"Generate a short paragraph on serverless functions.",
		"List three benefits of cloud computing.",
		"Draft an email subject line for a product launch.",
		"Summarize a 500-word article about blockchain.",
		"Create a compelling social media post.",
		"Suggest 5 keywords for an article on quantum physics.",
		"Explain the concept of 'cold start' in serverless.",
		"Write a short poem about technology.",
		"Brainstorm 3 blog post ideas for a SaaS company.",
	}

	log.Println("Starting content generation task...")
	responses, err := processTask(ctx, llmClient, taskPrompts)
	if err != nil {
		log.Fatalf("Task completed with error: %v", err)
	}

	log.Println("\n--- All LLM Calls Completed ---")
	for i, res := range responses {
		log.Printf("Response %d: %s", i+1, res)
	}
}

In this code, semaphore.NewWeighted(maxConcurrentLLMCalls) creates a semaphore that allows a maximum of 5 concurrent goroutines to proceed. Each goroutine calls sem.Acquire(ctx, 1) before making its LLM API call and sem.Release(1) afterward. If 5 goroutines are already active, the 6th will block until one of the active ones finishes and releases its "weight" (or context is canceled). This ensures that a single serverless function instance never overwhelms the LLM API with too many concurrent requests from its own execution.

2. Service-Wide Concurrency with a Distributed Queue

While per-invocation limits are crucial, they don't solve the problem of multiple serverless instances collectively exceeding global API rate limits. If 100 function instances each allow 5 concurrent LLM calls, that's still 500 potential concurrent calls hitting the upstream API. To truly tame the beast, we needed a service-wide throttling mechanism.

My solution involved introducing an asynchronous processing queue, specifically Amazon SQS (or Google Cloud Pub/Sub, Azure Service Bus, etc., depending on your cloud provider). Instead of directly invoking the content generation logic, the initial serverless function now acts as an entry point, validating the request and then pushing a message onto an SQS queue. A separate set of worker functions (or the same function, configured differently) would then consume from this queue.

Architecture diagram showing API Gateway -> Lambda (Enqueue) -> SQS Queue -> Lambda (Worker) -> LLM API
Asynchronous LLM Processing with SQS for Distributed Concurrency Control

The key here is how the SQS consumer Lambda functions are configured:

  • Limited Concurrency: I set a specific "Reserved Concurrency" limit on the Lambda function that consumes from the SQS queue. For example, if the LLM provider allows 100 RPM, and each of our worker functions takes ~1 second to process a single LLM call, I might set the reserved concurrency to 50-80. This ensures that even if there's a huge backlog in the SQS queue, we never spin up more than the allowed number of concurrent workers, thus respecting the upstream API's global rate limits.
  • Batch Processing: SQS consumer functions can process messages in batches. By configuring a small BatchSize (e.g., 1-5 messages) and a longer BatchWindow (e.g., up to 30 seconds), I could fine-tune the consumption rate. This helps smooth out spikes and prevents overwhelming the LLM API with bursts of requests.
  • Dead-Letter Queues (DLQs): For messages that repeatedly fail (e.g., due to persistent 429 errors from the LLM API or malformed data), I configured a DLQ. This allows me to inspect failed messages, understand root causes, and reprocess them if needed, preventing them from blocking the main queue.

This asynchronous pattern decouples the ingestion of user requests from the actual LLM processing, providing a crucial buffer and control point. If LLM APIs are slow or rate-limit us, user requests are simply queued, rather than failing immediately or causing our compute costs to skyrocket with retries. This strategy is highly effective for handling unpredictable traffic and is a common pattern for throttling external API calls in serverless environments.

Configuration Snippet (AWS Lambda Example)

Here's a conceptual snippet of how you might configure such a Lambda function (e.g., using Terraform):


resource "aws_lambda_function" "llm_worker" {
  function_name = "llm-content-worker"
  handler       = "main"
  runtime       = "go1.x"
  memory_size   = 256
  timeout       = 60 # Max execution time per message, including retries within the function
  role          = aws_iam_role.llm_worker_role.arn

  # ... other function configuration ...

  # Crucial for global rate limiting: Reserved Concurrency
  # This limits the maximum number of concurrent instances of this Lambda function
  # Adjust based on upstream LLM API rate limits (e.g., RPM)
  reserved_concurrent_executions = 80 
}

resource "aws_lambda_event_source_mapping" "sqs_trigger" {
  event_source_arn = aws_sqs_queue.llm_task_queue.arn
  function_name    = aws_lambda_function.llm_worker.arn

  # Batch size and window to control message consumption rate
  batch_size        = 5  # Process up to 5 messages per invocation
  maximum_batching_window_in_seconds = 10 # Wait up to 10 seconds to fill a batch
  
  # For FIFO queues, you'd also configure `maximum_concurrency` here,
  # which acts similarly to reserved_concurrent_executions for standard SQS.
  # For standard queues, reserved_concurrent_executions is the primary control.
}

resource "aws_sqs_queue" "llm_task_queue" {
  name                       = "llm-content-tasks"
  message_retention_seconds  = 86400 # 1 day
  delay_seconds              = 0
  max_message_size           = 262144 # 256 KB
  receive_wait_time_seconds  = 0

  # Dead-Letter Queue configuration
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.llm_dlq.arn
    maxReceiveCount     = 5 # Max retries before sending to DLQ
  })
}

resource "aws_sqs_queue" "llm_dlq" {
  name = "llm-content-tasks-dlq"
}

By carefully tuning reserved_concurrent_executions on the Lambda function and the batch_size/maximum_batching_window_in_seconds on the event source mapping, I gained granular control over the rate at which requests were sent to the LLM APIs.

Results: Stability, Reduced Errors, and Predictable Costs

The impact was immediate and significant:

  • Drastic Reduction in 429 Errors: The error rate plummeted. We still saw occasional 429s during peak bursts, but they were quickly mitigated by the queue and internal retries, rather than causing cascading failures.
  • Stable Latency: While individual LLM calls still had their inherent latency, the overall system became much more predictable. Users might experience a slight delay if the queue was backed up, but they wouldn't encounter outright failures or extremely long waits due to constant retries.
  • Cost Optimization: This was a huge win. By preventing unnecessary retries and ensuring our functions weren't idling for extended periods waiting on throttled APIs, our compute costs for the worker functions normalized. More importantly, we stopped paying for failed LLM API calls due to rate limits, leading to a significant reduction in our overall LLM API spend. This directly complements our ongoing efforts in Reducing LLM API Egress Costs: Data Transfer Optimization Strategies and Dynamic LLM Model Routing for API Cost Optimization.

What I Learned / The Challenge

The biggest lesson for me was that scaling serverless compute doesn't automatically mean scaling external API integrations. While serverless platforms handle *your* function's concurrency and scaling beautifully, they don't inherently manage the rate at which your *aggregate* functions interact with external, rate-limited services. This coordination is *your* responsibility as the architect and developer.

I also learned that LLM API rate limits are not just about requests per minute (RPM), but often also tokens per minute (TPM). A single, long prompt can hit TPM limits even if RPM is low. My initial focus was primarily on RPM, but I quickly adjusted to monitor and account for TPM as well, which sometimes required more nuanced throttling (e.g., using a weighted semaphore or a more sophisticated distributed rate limiter that tracks tokens).

The challenge was finding the right balance. Too aggressive with throttling, and I'd starve our LLM usage, leading to slower user experiences. Too lenient, and we'd be back to square one with 429s and high costs. It required continuous monitoring and iterative adjustments to the concurrency limits based on real-world usage patterns and provider-specific rate limits.

Related Reading

This journey into taming LLM API costs and performance regressions led me to explore several related optimization strategies:

  • Dynamic LLM Model Routing for API Cost Optimization: This post discusses how we dynamically route requests to different LLM models based on cost, latency, and capability. It's highly relevant because once you've controlled concurrency, the next step is to ensure you're using the most cost-effective model for each task, especially when dealing with varying token costs and rate limits across models.
  • Optimizing LLM Orchestration Costs with Serverless Functions: This piece dives deeper into general cost-saving strategies for serverless LLM workloads. The concurrency control discussed here is a foundational element for achieving those cost optimizations, as uncontrolled API calls will quickly negate any other efficiency gains.

Looking Forward

While the current solution has brought stability and predictability, I'm already eyeing the next set of improvements. I want to explore more dynamic, adaptive rate limiting. Instead of fixed concurrency limits, I envision a system that can intelligently adjust its throttling based on real-time feedback from LLM API responses (e.g., parsing Retry-After headers or observing actual latency/error rates). This would allow us to maximize throughput without manually tweaking configurations.

Another area of interest is incorporating a more sophisticated distributed rate limiter, perhaps using Redis or another distributed store, to apply token-aware rate limits across multiple services or even different API keys within our organization. This would provide even finer-grained control and prevent a single heavy user from impacting others, a critical feature as our platform grows. The goal is to build a self-optimizing system that gracefully handles external dependencies while keeping our costs in check and our users happy.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI