How I Slashed My LLM Embedding API Bill with Local Inference

How I Slashed My LLM Embedding API Bill with Local Inference

It was a Monday morning, and my coffee hadn't quite kicked in when I opened our monthly cloud bill. My eyes widened. Our LLM API costs had surged by nearly 180% in the last quarter. This wasn't just growth; this was an anomaly. We'd been diligently optimizing our LLM interactions, from prompt compression to dynamic batching, and even carefully comparing different providers' pricing. So, what was going on?

A deep dive into the usage metrics quickly pointed to the culprit: embedding generation. While the per-token cost of embeddings is often lower than generation, the sheer volume of text we were processing for search, recommendation, and RAG (Retrieval Augmented Generation) contexts meant the cumulative cost had become astronomical. Each document, each user query, each chunk of text for vector search – all of it was hitting an external API, token by token, and the bill was reflecting it.

My goal became clear: drastically reduce the cost of generating embeddings without compromising the quality of our vector search and retrieval systems. The obvious solution, which I'll detail in this post, was to offload this workload from expensive, external LLM APIs to a local, self-hosted embedding model.

The Hidden Cost of External Embeddings

For context, our primary use case for embeddings in AutoBlogger involves:

  • Indexing thousands of articles for semantic search.
  • Generating embeddings for user queries to find relevant content.
  • Creating vectors for content similarity recommendations.
  • Pre-processing documents for RAG pipelines feeding into our LLM generation tasks.

Initially, we relied on a popular cloud provider's embedding API. It was convenient, scalable, and integrated seamlessly with our existing LLM calls. The API offered robust models that produced high-quality embeddings, and frankly, it was a "set it and forget it" solution for a long time. The problem was, as our content library grew and user engagement scaled, the "forget it" part became a liability.

Looking at the breakdown, approximately 45% of our total LLM-related API costs were attributed solely to embedding generation. This was a shocking revelation, especially considering the relatively static nature of embeddings compared to the dynamic, context-dependent nature of LLM generation. While we had made strides in optimizing fine-tuning costs for our custom models, this embedding issue was a different beast altogether.

Here's a simplified view of our cost before the change (hypothetical numbers, but representative of the proportions):


Monthly LLM API Bill Breakdown (Before Optimization):

Total Bill: $1,200

- LLM Generation (Prompting, Completion): $660 (55%)
- LLM Embeddings (Text-to-Vector): $540 (45%)

This $540 for embeddings alone was a significant chunk, and it was growing. My mission was to shrink that $540 to near zero.

Choosing the Right Local Embedding Model

The first step was to identify suitable open-source embedding models that could run efficiently on our infrastructure. My criteria were:

  1. Performance: Must generate embeddings quickly to avoid latency bottlenecks.
  2. Quality: Embeddings needed to be semantically rich and perform well in our existing vector search benchmarks.
  3. Size: A smaller model footprint was preferable for easier deployment and lower memory consumption.
  4. Licensing: Open-source and permissive licensing was a must for our project.

After some research and experimentation, I narrowed down the choices to a few strong contenders from the Sentence-Transformers library, specifically focusing on models fine-tuned for general-purpose sentence embeddings. Models like all-MiniLM-L6-v2, all-MiniLM-L12-v2, and various E5 models (e.g., intfloat/e5-large-v2) stood out. While E5 models generally offer superior performance, their larger size (up to 1.1GB) and higher computational demands were a concern for our initial deployment strategy, which aimed for minimal resource overhead on existing VMs.

Ultimately, I chose all-MiniLM-L6-v2. It strikes an excellent balance: it's incredibly compact (around 90MB), fast, and provides surprisingly good semantic similarity performance for its size. For our initial proof-of-concept and immediate cost reduction, it was the perfect candidate. We could always upgrade later if performance metrics demanded it.

Setting Up the Local Embedding Service

To integrate the local model seamlessly, I decided to wrap it in a lightweight FastAPI service. This allowed us to maintain an API-like interaction from our main application, minimizing code changes and providing a clean separation of concerns. The service would run on a dedicated, but modest, virtual machine instance (4 vCPUs, 16GB RAM) that was already part of our infrastructure, meaning no additional infrastructure costs initially.

The FastAPI Service (embedding_service.py):


# embedding_service.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
import uvicorn
import logging
import os

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

app = FastAPI(
    title="Local Embedding Service",
    description="Provides fast, local text embeddings using Sentence-Transformers.",
    version="1.0.0"
)

class EmbeddingRequest(BaseModel):
    texts: list[str]

class EmbeddingResponse(BaseModel):
    embeddings: list[list[float]]
    model: str
    dimension: int

# Global variable for the model
model = None
MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", "all-MiniLM-L6-v2")

@app.on_event("startup")
async def startup_event():
    global model
    try:
        logger.info(f"Loading Sentence-Transformer model: {MODEL_NAME}...")
        model = SentenceTransformer(MODEL_NAME)
        logger.info(f"Model {MODEL_NAME} loaded successfully.")
    except Exception as e:
        logger.error(f"Failed to load model {MODEL_NAME}: {e}")
        raise HTTPException(status_code=500, detail=f"Failed to load embedding model: {e}")

@app.post("/embed", response_model=EmbeddingResponse)
async def get_embeddings(request: EmbeddingRequest):
    if not model:
        raise HTTPException(status_code=500, detail="Embedding model not loaded.")

    if not request.texts:
        return EmbeddingResponse(embeddings=[], model=MODEL_NAME, dimension=model.get_sentence_embedding_dimension())

    try:
        # Generate embeddings
        sentence_embeddings = model.encode(request.texts, convert_to_numpy=False, convert_to_tensor=False)
        
        # Convert embeddings to list of floats for JSON serialization
        embeddings_list = [embedding.tolist() for embedding in sentence_embeddings]
        
        return EmbeddingResponse(
            embeddings=embeddings_list,
            model=MODEL_NAME,
            dimension=model.get_sentence_embedding_dimension()
        )
    except Exception as e:
        logger.error(f"Error generating embeddings: {e}")
        raise HTTPException(status_code=500, detail=f"Error generating embeddings: {e}")

if __name__ == "__main__":
    # You can run this with: uvicorn embedding_service:app --host 0.0.0.0 --port 8001
    # For production, consider Gunicorn + Uvicorn workers.
    uvicorn.run(app, host="0.0.0.0", port=8001)

I containerized this service using Docker for easy deployment and management. The Dockerfile was straightforward:


# Dockerfile
FROM python:3.10-slim-buster

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY embedding_service.py .

ENV EMBEDDING_MODEL_NAME="all-MiniLM-L6-v2"

EXPOSE 8001

CMD ["uvicorn", "embedding_service:app", "--host", "0.0.0.0", "--port", "8001"]

And the requirements.txt:


fastapi~=0.110.0
uvicorn[standard]~=0.29.0
sentence-transformers~=2.7.0
torch~=2.2.0 # Or specify your preferred ML backend (tensorflow, onnxruntime)

Once the container was built and deployed to our VM, it was accessible via an internal network endpoint. This setup provided a robust and isolated environment for our embedding operations.

Integrating with the Application

The next step was to modify our main application's embedding utility to switch from calling the external LLM API to our new local service. I created a simple configuration toggle to allow for easy switching and fallback if needed.

Before (Simplified External API Call):


# utils/embeddings.py (simplified before)
import requests
import os

EXTERNAL_EMBEDDING_API_URL = os.getenv("EXTERNAL_EMBEDDING_API_URL")
EXTERNAL_EMBEDDING_API_KEY = os.getenv("EXTERNAL_EMBEDDING_API_KEY")

def get_external_embeddings(texts: list[str]) -> list[list[float]]:
    headers = {
        "Authorization": f"Bearer {EXTERNAL_EMBEDDING_API_KEY}",
        "Content-Type": "application/json"
    }
    payload = {
        "input": texts,
        "model": "text-embedding-ada-002" # Example external model
    }
    try:
        response = requests.post(EXTERNAL_EMBEDDING_API_URL, headers=headers, json=payload)
        response.raise_for_status()
        data = response.json()
        # Assuming the API returns embeddings in a specific structure
        return [item["embedding"] for item in data["data"]]
    except requests.exceptions.RequestException as e:
        print(f"Error calling external embedding API: {e}")
        raise

After (Switching to Local Service):


# utils/embeddings.py (after refactoring)
import requests
import os
import logging

logger = logging.getLogger(__name__)

# Configuration for local service
LOCAL_EMBEDDING_SERVICE_URL = os.getenv("LOCAL_EMBEDDING_SERVICE_URL", "http://localhost:8001/embed")
USE_LOCAL_EMBEDDINGS = os.getenv("USE_LOCAL_EMBEDDINGS", "true").lower() == "true"

# Fallback to external if local fails or not enabled (optional)
EXTERNAL_EMBEDDING_API_URL = os.getenv("EXTERNAL_EMBEDDING_API_URL")
EXTERNAL_EMBEDDING_API_KEY = os.getenv("EXTERNAL_EMBEDDING_API_KEY")

def get_local_embeddings(texts: list[str]) -> list[list[float]]:
    payload = {"texts": texts}
    try:
        response = requests.post(LOCAL_EMBEDDING_SERVICE_URL, json=payload, timeout=30)
        response.raise_for_status()
        data = response.json()
        return data["embeddings"]
    except requests.exceptions.RequestException as e:
        logger.error(f"Error calling local embedding service: {e}")
        raise

def get_embeddings(texts: list[str]) -> list[list[float]]:
    if USE_LOCAL_EMBEDDINGS:
        try:
            logger.info("Attempting to get embeddings from local service...")
            return get_local_embeddings(texts)
        except Exception as e:
            logger.warning(f"Local embedding service failed: {e}. Falling back to external API.")
            # Fallback logic: if local fails, try external (if configured)
            if EXTERNAL_EMBEDDING_API_URL and EXTERNAL_EMBEDDING_API_KEY:
                return get_external_embeddings(texts)
            else:
                raise # Re-raise if no fallback or fallback also fails
    else:
        logger.info("Using external embedding API as per configuration.")
        return get_external_embeddings(texts)

# The get_external_embeddings function from before remains,
# or is refactored into a separate external_embeddings.py
# for cleaner separation.
def get_external_embeddings(texts: list[str]) -> list[list[float]]:
    # ... (same implementation as before)
    pass

This refactoring allowed us to switch our embedding source with an environment variable, providing flexibility and a safety net during the transition. I deployed these changes cautiously, first to staging, then gradually to production, monitoring performance and error rates closely.

Metrics and the Sweet Taste of Savings

The results were almost immediate and incredibly satisfying. Within the first month of full production rollout, our embedding-related API costs plummeted. The $540 monthly expenditure dropped to virtually zero for API calls.

Of course, "zero" isn't entirely accurate. We now incurred compute costs for the VM running our local embedding service. However, this was a VM already provisioned and underutilized, meaning the incremental cost was negligible – perhaps an extra $10-20/month in CPU/memory usage, which was easily absorbed by our existing infrastructure budget. Even if we had to provision a new, dedicated small instance, the cost would be in the realm of $50-100/month, still a massive saving.

Here’s the updated (and much happier) cost breakdown:


Monthly LLM API Bill Breakdown (After Optimization):

Total Bill: $670 (Previously $1,200)

- LLM Generation (Prompting, Completion): $660 (98.5%)
- LLM Embeddings (External API): $0 (0%)
- Local Embedding Service Compute Cost: ~$10 (1.5%)

Total Savings on Embeddings: $530/month (98% reduction on embedding costs)
Overall LLM-related Cost Reduction: ~$530/month (44% reduction on total bill)

This translated to a whopping 44% reduction in our overall LLM-related expenses! The impact was profound, freeing up budget for further experimentation and scaling our LLM generation tasks.

From a performance standpoint, the local service introduced a slight increase in latency for individual embedding requests compared to the highly optimized cloud APIs, but this was offset by the elimination of network overheads to an external service and the ability to batch requests more aggressively on our side. For our use cases, which often involve batch processing documents, the overall throughput remained excellent, and in some scenarios, improved due to less reliance on external rate limits.

Quality-wise, the all-MiniLM-L6-v2 model proved to be perfectly adequate for our semantic search and recommendation tasks. We ran offline evaluations comparing its performance on our specific datasets against the external API's embeddings and found the difference in relevance ranking to be minimal and acceptable for our production needs.

What I Learned / The Challenge

The biggest lesson here was to never take any cloud service's pricing model for granted, especially with LLMs. What seems like a small per-token cost can quickly snowball into a significant expenditure as usage scales. Always scrutinize your usage metrics and break down costs by component.

The main challenge wasn't technical implementation, but rather the initial inertia of questioning an established, seemingly "working" system. The convenience of managed services often masks their true long-term cost implications. Once I committed to investigating the cost spike, the path to a local solution became clear.

Another challenge was ensuring the local model's performance and stability. Running a model locally means taking on the operational burden: monitoring resource usage (CPU, RAM), ensuring the service is up, handling updates, and potentially scaling it horizontally if a single instance becomes a bottleneck. For now, our current volume is handled well by a single instance, but future scaling will require more thought into distributed model serving.

This experience reinforced the value of open-source models and the flexibility they offer. Being able to choose a model, deploy it on our own terms, and directly control the compute resources is a powerful lever for cost optimization and architectural freedom.

Related Reading

Looking ahead, I'm keen to explore further optimizations for our local embedding service. This includes experimenting with quantization (e.g., 8-bit or 4-bit) to reduce the model's memory footprint and potentially improve inference speed, as well as investigating ONNX Runtime for even faster inference. The goal isn't just to save money, but to build a more resilient, performant, and cost-effective LLM infrastructure that gives us maximum control over our AI stack.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI