Building a Low-Latency, Cost-Efficient RAG System for Production

It started, as many things do in AI development, with an exciting proof-of-concept. My initial RAG (Retrieval-Augmented Generation) system was a marvel – a few hundred lines of Python, a vector database, and an LLM API call. It worked. It answered questions. It felt like magic. Then came the day I pushed it to a pre-production environment, and the magic quickly turned into a horror show of spiraling costs and abysmal performance.

I distinctly remember the first load test. Even with a modest 5 QPS, my P99 latency shot past 10 seconds. The LLM API bills for just that short test run were already making my eyes water. It became painfully clear: a naive RAG implementation, while great for demonstration, is a production nightmare waiting to happen. My goal shifted from "make it work" to "make it fast and cheap." This is the story of how I tackled those challenges head-on, transforming a sluggish, expensive prototype into a production-ready RAG system.

The Initial Bottlenecks: A Tale of Two Latencies (and Costs)

My first RAG implementation was straightforward, almost embarrassingly so. For each incoming query:

Generate embeddings for the user's query.
Query the vector database with these embeddings to retrieve relevant documents.
Construct a prompt with the retrieved documents and send it to an LLM.
Return the LLM's response.

Each step was synchronous, blocking the next. It was a sequential pipeline, and every single millisecond added up. The two biggest culprits for both latency and cost were immediately apparent:

Embedding Generation: Each query meant an API call to an embedding model. These calls, while individually fast, added up, especially with network overhead. More critically, they incurred per-token costs.
LLM Inference: The final LLM call was the heaviest hitter. It had the highest latency, the most variable response time, and by far the highest per-token cost, particularly as the context window grew with retrieved documents.

My initial P99 latency looked something like this:

Embedding Generation: ~300-500ms (including network roundtrip)
Vector DB Query: ~100-200ms
LLM Inference: ~1500-3000ms (highly variable)
Total: ~1.9s - 3.7s (best case)

This was for a single query. When multiple requests hit the service concurrently, the queuing and resource contention pushed these numbers into unacceptable territory.

Optimizing Embedding Generation: Batching and Caching to the Rescue

The first low-hanging fruit was embedding generation. Making a separate API call for every single query was inefficient. Most embedding models support batching, allowing you to send multiple text inputs in a single API request and receive their embeddings in return. This dramatically reduces network overhead and often benefits from internal optimizations on the model provider's side.

Here's a simplified example of how I refactored my embedding call:


import os
import time
from concurrent.futures import ThreadPoolExecutor
from typing import List

# Assume this is your actual embedding client, e.g., from OpenAI or Cohere
class EmbeddingClient:
    def __init__(self, api_key: str):
        self.api_key = api_key
        # Initialize your actual client here

    def get_embeddings_batch(self, texts: List[str]) -> List[List[float]]:
        # Simulate an API call
        print(f"Making batched embedding call for {len(texts)} texts...")
        time.sleep(0.2 + len(texts) * 0.01) # Simulate network + processing
        return [[i * 0.01 for i in range(1536)] for _ in texts] # Dummy embeddings

    def get_embedding(self, text: str) -> List[float]:
        return self.get_embeddings_batch([text])

embedding_client = EmbeddingClient(os.environ.get("EMBEDDING_API_KEY"))

# --- BEFORE: Synchronous, single-call embedding ---
def get_single_query_embedding(query: str):
    return embedding_client.get_embedding(query)

# --- AFTER: Batching embeddings for multiple concurrent queries ---
# In a real-world scenario, you'd have a queue and a background worker
# or an async framework to manage these batches.
# For demonstration, let's simulate a batch for multiple incoming requests.

def get_embeddings_for_queries_batched(queries: List[str]) -> List[List[float]]:
    # In a real system, you'd collect queries over a short window (e.g., 50ms)
    # or use a fixed-size batch. Here, we're just showing the API call.
    return embedding_client.get_embeddings_batch(queries)

# Example usage:
# queries_to_process = ["What is the capital of France?", "Who won the 2022 World Cup?"]
# batched_embeddings = get_embeddings_for_queries_batched(queries_to_process)
# print(f"Batched embeddings for {len(queries_to_process)} queries: {len(batched_embeddings)} results")

Beyond batching, I implemented a simple in-memory cache for embedding results. Many user queries are repetitive, especially "head" queries. By hashing the query text and storing its embedding, I could avoid API calls entirely for frequently asked questions. This yielded significant cost savings and latency reduction for cached hits.

For a deeper dive into controlling LLM API costs, which also applies directly to embedding costs, I highly recommend checking out my previous post: LLM API Cost Optimization: Reducing Context Window Expenses.

Optimizing Vector Database Retrieval: Indexing, Parallelization, and Metadata Filtering

The vector database query step also offered room for improvement. Initially, I used a basic `similarity_search` without much thought. As my document corpus grew, this became a bottleneck. The key optimizations here were:

Proper Indexing: Ensuring my vector database (I'm using Pinecone, but this applies to others like Weaviate, Milvus, or Faiss) was using an efficient index type like HNSW (Hierarchical Navigable Small World) was crucial. This significantly speeds up approximate nearest neighbor (ANN) search.
Parallelizing Embedding and Retrieval: While the embedding generation was happening, the service was idle, waiting. I realized I could start the vector database query as soon as the embedding for the *current* query was available, even if other embeddings were still being generated in a batch for other requests. For a single request, the embedding generation and vector DB query are sequential, but if you're processing multiple requests, you can overlap. More importantly, for a single request, if you have multiple retrieval steps (e.g., hybrid search, multiple types of documents), you can parallelize these.
Metadata Filtering: Often, you don't need to search the entire corpus. By adding metadata to my vectors (e.g., document source, creation date, topic), I could pre-filter the search space before the expensive vector similarity calculation. This drastically reduces the number of vectors the index needs to consider.

Here's a conceptual snippet illustrating parallelization using `asyncio` (assuming your vector DB client supports async operations):


import asyncio
from typing import List, Dict

# Assume vector_db_client is an async client for your vector database
# and embedding_client.get_embedding_async is an async version.

async def retrieve_documents_async(query_embedding: List[float], top_k: int = 5) -> List[Dict]:
    # Simulate async vector DB query
    print("Async: Querying vector database...")
    await asyncio.sleep(0.08) # Simulate 80ms DB lookup
    return [{"id": f"doc_{i}", "text": f"Retrieved document {i} for query."} for i in range(top_k)]

async def process_rag_query_async(query: str):
    start_time = time.perf_counter()

    # Step 1: Generate embedding (async)
    embedding_task = asyncio.create_task(embedding_client.get_embedding_async(query))

    # While embedding is generating, maybe do other async setup or pre-processing
    # ...

    query_embedding = await embedding_task
    embedding_time = time.perf_counter()

    # Step 2: Retrieve documents (async)
    retrieval_task = asyncio.create_task(retrieve_documents_async(query_embedding))
    retrieved_docs = await retrieval_task
    retrieval_time = time.perf_counter()

    # Step 3: Construct prompt and call LLM (async)
    # (LLM call would also be async)
    # For now, simulate LLM call
    print("Async: Calling LLM with retrieved documents...")
    await asyncio.sleep(1.5) # Simulate 1.5s LLM call
    llm_response = f"LLM responded to '{query}' with docs: {[d['id'] for d in retrieved_docs]}"
    llm_time = time.perf_counter()

    print(f"Query: '{query}'")
    print(f"Embedding took: {embedding_time - start_time:.3f}s")
    print(f"Retrieval took: {retrieval_time - embedding_time:.3f}s")
    print(f"LLM took: {llm_time - retrieval_time:.3f}s")
    print(f"Total async RAG time: {llm_time - start_time:.3f}s")
    return llm_response

# To run this example:
# class AsyncEmbeddingClient(EmbeddingClient):
#    async def get_embedding_async(self, text: str) -> List[float]:
#        print("Async: Generating embedding...")
#        await asyncio.sleep(0.2) # Simulate async embedding call
#        return self.get_embedding(text) # Use sync for actual result for simplicity
#
# async_embedding_client = AsyncEmbeddingClient(os.environ.get("EMBEDDING_API_KEY"))
# embedding_client = async_embedding_client # Use this for the example
#
# if __name__ == "__main__":
#     asyncio.run(process_rag_query_async("Tell me about recent AI advancements."))

By leveraging `asyncio`, I could ensure that while one network request was in flight (e.g., getting embeddings), the Python interpreter wasn't just sitting idle. This allowed for more efficient use of resources and better throughput, especially in a concurrent environment like Cloud Run.

The LLM Inference Challenge: Prompt Engineering and Context Reduction

The LLM inference step remained the biggest latency and cost component. My initial approach was to simply dump all retrieved documents into the prompt. This often led to:

Excessive Token Usage: Sending large contexts to the LLM meant higher costs.
Suboptimal Responses: The LLM might get overwhelmed or distracted by irrelevant information, even if it was "retrieved."
Latency Spikes: Longer prompts take longer to process, especially with larger models.

The solution wasn't to retrieve fewer documents initially, but to be smarter about what made it into the final LLM prompt. I implemented a few strategies:

Re-ranking: After retrieving, say, 10-20 documents, I'd use a smaller, faster re-ranking model (or even a heuristic based on keyword overlap or document recency) to select the top 3-5 most relevant ones. This significantly reduced the context window for the main LLM.
Summarization (Conditional): For very long retrieved documents, if the query was broad, I'd conditionally summarize the documents before passing them to the main LLM. This is a trade-off: it adds another LLM call (for summarization), but if the original document is extremely long, the cost and latency savings on the *main* LLM call can be substantial. I found this most useful for documents over ~1000 tokens.
Prompt Compression: Techniques like LLM-based prompt compression, where a smaller LLM condenses the retrieved context, can also be effective, though this adds another LLM call.

My focus here was less on the retrieval itself and more on the *quality and conciseness* of the context passed to the final LLM. This is where my post on LLM API Cost Optimization: Reducing Context Window Expenses became incredibly relevant, as managing context is directly tied to both cost and latency.

Infrastructure and Concurrency: Scaling the RAG Service

Running this RAG service efficiently required a robust infrastructure. I chose Google Cloud Run for its serverless nature and auto-scaling capabilities. However, simply deploying the code wasn't enough; I needed to fine-tune its behavior for optimal performance and cost.

My initial Cloud Run configuration had a low concurrency setting, meaning each instance handled only a few requests at a time. This led to frequent cold starts and over-provisioning of instances, driving up costs. By increasing the concurrency (e.g., to 80-100 requests per instance) and ensuring my application code was truly asynchronous and non-blocking, I could serve many more requests with fewer instances. This significantly reduced my infrastructure bill.

Monitoring the P50, P90, and P99 latencies, along with instance count and CPU utilization, was critical. I used Cloud Monitoring to track these metrics rigorously. When I saw CPU utilization consistently low despite high instance counts, it was a clear sign that I wasn't utilizing my instances efficiently, often due to blocking I/O or low concurrency settings.

For an in-depth look at how I optimized my Cloud Run setup to save costs, you can read about my experience here: How I Halved My Cloud Run Bill: Auto-Scaling, Concurrency, and Request Optimization.

After implementing these changes, my P99 latency for the RAG service dropped from over 10 seconds to consistently under 2 seconds, even under higher load. My LLM API costs per query also saw a reduction of about 40-50% due to smarter context management and embedding batching.

What I Learned / The Challenge

The core challenge in building a production-ready RAG system isn't just making it "work," but making it work *efficiently* and *affordably*. It's a dance between latency and cost, often with conflicting requirements. I learned that:

Naive implementations don't scale: The quickest path to a PoC is often the slowest and most expensive path to production.
Every millisecond and token counts: Especially when integrating with external APIs, network latency and token counts are direct drivers of performance and cost.
Asynchronous programming is non-negotiable: For I/O-bound tasks like API calls, `asyncio` (or similar patterns in other languages) is essential for maximizing throughput and reducing idle time.
Context management is king: For LLM-based systems, intelligently managing the context window is the single biggest lever for cost and latency optimization.
Iterative optimization is key: I didn't get it right the first time. It was a process of identifying bottlenecks, implementing a solution, measuring, and repeating.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

Building a Low-Latency, Cost-Efficient RAG System for Production

Building a Low-Latency, Cost-Efficient RAG System for Production

The Initial Bottlenecks: A Tale of Two Latencies (and Costs)

Optimizing Embedding Generation: Batching and Caching to the Rescue

Optimizing Vector Database Retrieval: Indexing, Parallelization, and Metadata Filtering

The LLM Inference Challenge: Prompt Engineering and Context Reduction

Infrastructure and Concurrency: Scaling the RAG Service

What I Learned / The Challenge

Related Reading

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs