My Journey Building a Production-Ready RAG System with Open-Source LLMs

I'm sharing my detailed journey on how I built a robust, production-ready Retrieval-Augmented Generation (RAG) system for the AutoBlogger project. I faced significant challenges with generic LLM outputs and hallucinations, which led me to integrate open-source Large Language Models with a custom knowledge base. This post covers my architectural choices, implementation details including data chunking, embedding, vector database selection, and prompt engineering, all while being transparent about the difficulties and lessons learned. I provide code snippets and discuss how I balanced performance, cost, and maintainability to deliver contextually accurate blog posts for AutoBlogger. My Journey Building a Production-Ready RAG System for AutoBlogger with Open-Source LLMs

My Journey Building a Production-Ready RAG System for AutoBlogger with Open-Source LLMs

When I was building the posting service for AutoBlogger, my blog automation bot, I hit a wall with the initial content generation attempts. My goal was to produce high-quality, technically accurate, and contextually relevant blog posts. Initially, I leaned on a few general-purpose commercial LLM APIs, thinking their vast training data would be enough. While they could certainly churn out text, the results were often... well, generic. They lacked the specific nuances of the AutoBlogger project, frequently hallucinated details about our internal tools or specific architectural choices, and required extensive manual editing to align with our brand voice and technical accuracy. This wasn't automation; it was glorified copy-pasting and editing, which defeated the entire purpose of AutoBlogger.

I needed a way to ground these powerful language models in our specific knowledge base – AutoBlogger's documentation, previous DevLogs, internal wikis, and even my own design documents. This led me down the path of Retrieval-Augmented Generation (RAG), and crucially, doing it with open-source LLMs to maintain control, manage costs, and ensure data privacy.

Why RAG was the Only Way Forward for AutoBlogger

The problem was clear: LLMs, by default, are generalists. They've seen a colossal amount of internet data, but they don't inherently know the specifics of my proprietary project. Asking an LLM, "How does AutoBlogger handle asynchronous task queuing?" would likely result in a generic explanation of task queues, or worse, a confident but incorrect answer about a system we don't even use. RAG directly addresses this by providing the LLM with relevant, up-to-date information at inference time.

My core motivation for RAG was:

  • Accuracy: Eliminate hallucinations about AutoBlogger's internal workings.
  • Relevance: Ensure generated content is specific to our project's context.
  • Freshness: Allow the LLM to access information beyond its training cutoff date.
  • Control: Have direct influence over the knowledge base the LLM uses.

Choosing Open-Source: Cost, Control, and Customization

The decision to go open-source for the LLM itself was multifaceted. While commercial APIs offer convenience, they come with recurring costs that can quickly scale with usage. For a project like AutoBlogger, generating potentially hundreds of articles, these costs would become prohibitive. Beyond cost, I valued:

  • Data Privacy: Keeping our proprietary documentation away from third-party APIs.
  • Fine-tuning Potential: The long-term vision includes fine-tuning an LLM on AutoBlogger's specific writing style and technical jargon. This is far more feasible with self-hosted open-source models.
  • Architectural Flexibility: Integrating the LLM directly into my MLOps platform (which I've written about previously in How I Built a Low-Code MLOps Platform for My Small Team's AutoBlogger AI Projects) gave me complete control over the inference stack.

I experimented with a few models. Initially, I looked at Llama 2 7B and 13B for their strong community support and performance. However, for the initial production deployment, I settled on Mistral 7B Instruct v0.2 and later explored Mixtral 8x7B Instruct. Mistral 7B offers an excellent balance of quality and inference speed on consumer-grade GPUs, making it ideal for my initial deployment budget. Mixtral provided a significant boost in quality for more complex tasks, though it naturally required more substantial compute resources.

The AutoBlogger RAG Architecture: A Deep Dive

My RAG pipeline for AutoBlogger consists of several key components, orchestrated to provide the LLM with the most relevant context:

1. Data Ingestion and Indexing

This is where AutoBlogger's brain truly starts to form. I gathered all pertinent project information: markdown files from our documentation, internal blog posts (like these DevLogs!), code comments, and even structured data exported from our project management tools. This raw data needed to be processed into a searchable format.

The first step was document loading. I used libraries like LlamaIndex and LangChain for their robust document loaders, handling various formats. Once loaded, the documents needed to be broken down into smaller, manageable chunks. This is a critical step because feeding an entire multi-page document to the LLM isn't practical due to context window limitations and can dilute the relevance of specific information.

My chunking strategy evolved. Initially, I simply split by paragraphs or fixed character counts. This often led to fragmented context. I found more success with a recursive character text splitter, prioritizing splitting by larger delimiters (like \n\n) first, then falling back to smaller ones (\n, ., ,). I also experimented with overlap between chunks to ensure continuity, typically around 10-15% of the chunk size. For AutoBlogger, a chunk size of 500 characters with 50 characters overlap proved to be a good starting point for technical documentation.


from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

# Example document content (imagine this is loaded from a markdown file)
document_content = """
# AutoBlogger Architecture Overview

AutoBlogger is designed with a microservices approach. The core components include the Posting Service, Content Generation Engine, and the MLOps Platform.

## Posting Service
The Posting Service is responsible for scheduling, publishing, and managing blog posts across various platforms. It integrates with external APIs for social media distribution and SEO analysis. Asynchronous task queuing is handled by Redis Queue (RQ) for reliability and scalability.

### Redis Queue (RQ) Implementation
RQ is used to manage background tasks such such as image optimization, post scheduling, and external API calls. Workers poll the RQ instance for new jobs, processing them in parallel.
"""

chunks = text_splitter.create_documents([document_content])
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk.page_content)
    print("\n")
    

Once chunked, each chunk needed to be converted into a numerical representation called an embedding. These embeddings capture the semantic meaning of the text, allowing us to perform similarity searches. I initially used `all-MiniLM-L6-v2` for its speed and efficiency, but later upgraded to `BAAI/bge-small-en-v1.5` for improved semantic search quality, which directly impacts retrieval accuracy. These are sentence-transformer models, easily loaded and used.


from sentence_transformers import SentenceTransformer

# Load a pre-trained embedding model
# model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

def get_embedding(text):
    return model.encode(text, normalize_embeddings=True).tolist()

# Example: Get embedding for a chunk
example_chunk_text = chunks.page_content
embedding = get_embedding(example_chunk_text)
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 dimensions of embedding: {embedding[:5]}")
    

2. Vector Database: The Knowledge Store

The embeddings, along with their original text content, are stored in a vector database. This specialized database is optimized for storing and querying high-dimensional vectors, enabling fast similarity searches. For AutoBlogger, after evaluating several options, I chose Qdrant. My reasons were:

  • Open-source and self-hostable: Aligned with my open-source LLM strategy.
  • Performance: Excellent for similarity search at scale.
  • Filtering capabilities: Qdrant allows for payload filtering alongside vector search, which is crucial for more advanced RAG scenarios (e.g., filtering documents by author, date, or specific tags).
  • Ease of deployment: Docker images made it straightforward to integrate into my MLOps platform.

Here's a simplified look at how I'd add documents and query Qdrant:


from qdrant_client import QdrantClient, models
import uuid

# Initialize Qdrant client (assuming it's running locally or at a specific URL)
client = QdrantClient(host="localhost", port=6333) # Or your specific Qdrant endpoint

collection_name = "autoblogger_docs"
vector_size = model.get_sentence_embedding_dimension() # Get dimension from embedding model

# Create collection if it doesn't exist
try:
    client.get_collection(collection_name=collection_name)
except:
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(size=vector_size, distance=models.Distance.COSINE),
    )

# Add chunks to Qdrant
points = []
for i, chunk in enumerate(chunks):
    points.append(
        models.PointStruct(
            id=str(uuid.uuid4()), # Unique ID for each point
            vector=get_embedding(chunk.page_content),
            payload={"text": chunk.page_content, "source": "autoblogger_docs_v1"} # Store original text and metadata
        )
    )

if points:
    client.upsert(
        collection_name=collection_name,
        wait=True,
        points=points
    )
    print(f"Upserted {len(points)} points to Qdrant.")

# Example query
query_text = "How does AutoBlogger handle background jobs?"
query_embedding = get_embedding(query_text)

search_result = client.search(
    collection_name=collection_name,
    query_vector=query_embedding,
    limit=3 # Retrieve top 3 most relevant chunks
)

print("\n--- Qdrant Search Results ---")
for hit in search_result:
    print(f"Score: {hit.score:.2f}")
    print(f"Content: {hit.payload['text']}\n")
    

3. Retrieval and Generation

When a request comes in to generate a blog post or answer a specific question for AutoBlogger, the process unfolds:

  1. The user query (e.g., "Write a DevLog about our Redis Queue implementation") is embedded using the same `BAAI/bge-small-en-v1.5` model.
  2. This query embedding is sent to Qdrant to retrieve the top k most semantically similar document chunks.
  3. These retrieved chunks (the "context") are then combined with the original query into a carefully crafted prompt for the open-source LLM (Mistral 7B Instruct or Mixtral 8x7B Instruct).
  4. The LLM generates the final response, grounded by the provided context.

My prompt engineering here was crucial. I needed to instruct the LLM to use only the provided context and to refuse to answer if the context didn't contain enough information. This significantly reduced hallucination.


def generate_response_with_rag(query, retrieved_context_chunks, llm_api_endpoint):
    context_str = "\n\n".join([chunk['payload']['text'] for chunk in retrieved_context_chunks])

    # This is a simplified example. In reality, I use an API call to my self-hosted LLM.
    # For local testing, you might use the transformers library.

    system_prompt = """You are an expert technical writer for the AutoBlogger project.
    Your task is to answer the user's question or generate content based ONLY on the provided context.
    Do NOT use any outside knowledge. If the context does not contain enough information to answer,
    state that you cannot answer based on the provided information.
    Be concise, technical, and accurate.
    """

    full_prompt = f"{system_prompt}\n\nContext:\n{context_str}\n\nUser Question: {query}\n\nResponse:"

    # Simulate LLM call (in production, this would be an actual API call)
    # Using a placeholder for actual LLM inference for brevity and focus
    print("\n--- Sending to LLM with Context ---")
    print(f"Prompt sent to LLM (truncated):\n{full_prompt[:1000]}...")
    
    # In a real scenario, you'd call your LLM endpoint:
    # response = requests.post(llm_api_endpoint, json={"prompt": full_prompt}).json()
    # generated_text = response['generated_text']
    
    generated_text = "This is a simulated response from the LLM, incorporating the retrieved context about AutoBlogger's Redis Queue implementation."
    
    return generated_text

# Simulate an LLM API endpoint (replace with your actual self-hosted LLM)
llm_endpoint = "http://localhost:8000/v1/completions" # e.g., an Ollama endpoint or vLLM

# Generate response
final_response = generate_response_with_rag(query_text, search_result, llm_endpoint)
print(f"\n--- Final Generated Response ---")
print(final_response)
    

What I Learned: The Gritty Challenges of Production RAG

Building this wasn't just a straightforward plug-and-play. I hit several roadblocks that forced me to iterate and optimize:

Chunking Strategy is Everything

I cannot overstate this: the quality of your retrieved chunks directly impacts the quality of your LLM's output. Too small, and context is lost. Too large, and irrelevant information dilutes the signal, or you hit context window limits. Finding the "goldilocks zone" for chunk size and overlap was an iterative nightmare. For technical documentation, I found that focusing on logical sections (headings, paragraphs) rather than arbitrary character counts yielded far better results. I also experimented with "parent document retrieval," where smaller chunks are retrieved, but then a larger parent document is provided to the LLM for richer context, though I didn't fully implement this for the initial AutoBlogger RAG release due to complexity.

Embedding Model Selection Matters

Switching from `all-MiniLM-L6-v2` to `BAAI/bge-small-en-v1.5` for embeddings was a noticeable upgrade. While MiniLM is fast, BGE-small offered superior semantic understanding, leading to more accurate retrievals. This directly translated to fewer instances of the LLM receiving irrelevant context, which in turn reduced "garbage in, garbage out" scenarios. It's a trade-off; better models often mean slightly larger embeddings or slower inference, but the quality gain for AutoBlogger was worth it.

Vector Database Performance and Scaling

Initially, I underestimated the performance requirements of the vector database. As AutoBlogger's documentation grew, the indexing time and query latency became concerns. Qdrant performed admirably, but I still had to optimize. This involved ensuring proper indexing settings (e.g., HNSW parameters), choosing the right hardware for the Qdrant instance, and batching upserts. Monitoring the Qdrant instance closely was key to identifying bottlenecks. For more on the deployment aspects, my post on How I Built a Low-Code MLOps Platform for My Small Team's AutoBlogger AI Projects dives into the infrastructure challenges I faced.

Cost vs. Performance with Open-Source LLMs

While open-source LLMs save on API costs, they introduce infrastructure costs. Running Mistral 7B on a GPU instance (e.g., a T4 on Google Cloud or an A10 on AWS) is manageable for development and moderate production loads. However, scaling to Mixtral 8x7B or larger models requires more substantial GPUs (like A100s) which dramatically increase hosting expenses. I had to carefully balance the quality of the LLM output with the cost of inference. For AutoBlogger, I implemented a tiered system: simpler content generation tasks use Mistral 7B, while more complex, high-value articles might leverage a Mixtral 8x7B instance, spun up on demand. This dynamic scaling is critical for cost efficiency.

Hallucination Mitigation is an Ongoing Battle

Even with RAG, LLMs can still "hallucinate" or generate plausible-sounding but incorrect information. My strict system prompt ("ONLY on the provided context") helps, but it's not foolproof. I found that adding a "confidence score" or "source citation" mechanism (where the LLM attempts to cite which retrieved chunk it used) can help in debugging and validating outputs. This is a feature I'm still actively refining for AutoBlogger.

Related Reading

If you're interested in the broader context of AI and how these components fit into the larger picture, I recommend checking out my post: Artificial Intelligence Explained: Concepts, Applications, Future. It provides a good foundation for understanding the underlying principles that make RAG systems so powerful.

For those curious about the nuts and bolts of deploying and managing the infrastructure behind these AI projects, my DevLog: How I Built a Low-Code MLOps Platform for My Small Team's AutoBlogger AI Projects, delves into the specific challenges and solutions I implemented for orchestrating GPU instances, containerization, and automated deployments – all essential for getting this RAG system into production.

My Takeaway and Next Steps

Building a production-ready RAG system with open-source LLMs for AutoBlogger has been a challenging but incredibly rewarding experience. It's transformed AutoBlogger from a generic content generator into a truly knowledgeable assistant, capable of producing highly accurate and relevant technical content. My key takeaway is that context is king. No matter how powerful the LLM, grounding it in specific, trusted information is paramount for real-world applications.

Next, I plan to integrate more sophisticated evaluation metrics for the RAG pipeline, focusing on retrieval precision and recall, as well as the factual consistency of the LLM's output. I also want to explore multi-modal RAG, allowing AutoBlogger to reference images and diagrams in its knowledge base. Finally, I'm looking into dynamic chunking strategies that adapt to document structure more intelligently, potentially using LLMs themselves to determine optimal chunk boundaries. The journey to truly autonomous and intelligent content creation for AutoBlogger is far from over, but we've made a significant leap forward.

--- 📝 **Editor's Note:** Parts of this content were assisted by AI tools as part of the **AutoBlogger** automation experiment. However, the experiences and code shared are based on real development challenges.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI