LLM Token Optimization: Reducing AI Agent Costs by 64%

LLM Token Optimization: Reducing AI Agent Costs by 64%

LLM token optimization is achieved by reducing redundant data through semantic caching, dynamic context pruning, and routing tasks to smaller models. These strategies minimize API costs while maintaining reasoning performance by ensuring only essential information enters the context window. Implementing these techniques can reduce operational expenses by over 60% in production environments.

Three weeks ago, I woke up to a Google Cloud billing alert that made my stomach drop. My production environment had burned through $4,200 in less than 48 hours. For a mid-sized agentic system that was supposed to be in a "controlled" beta, this wasn't just a bug; it was a financial emergency. After digging through the logs, I realized my agentic loop had entered a recursive state where it was re-sending 128k context windows to Gemini 1.5 Pro every 30 seconds to solve a relatively simple reasoning task.

The problem wasn't the model's intelligence—it was my own failure to manage token lifecycle. In the rush to build a "smart" system, I had ignored the fundamental physics of LLM deployments: tokens are the only currency that matters. If you don't control them, they will control your burn rate. I spent the last fortnight refactoring our entire backend to implement aggressive LLM token optimization strategies. My goal was to maintain the agent's reasoning capabilities while slashing the cost per request. I reduced operational costs by 64% using these techniques. Here is exactly how I did it, the code I used, and the failures I encountered along the way.

Why Multi-Stage AI Pipelines Lead to High Context Debt

Multi-stage pipelines often accumulate redundant history, leading to exponential cost increases known as context debt. In my previous post about building a scalable multi-stage Python AI pipeline, I discussed how to chain different LLM calls together to handle complex tasks. What I didn't emphasize enough was the "Context Debt" that accumulates in these chains. In a multi-stage pipeline, each step often inherits the history of the previous steps. By the time you reach the fifth or sixth stage, your system prompt and conversation history have ballooned into a massive payload.

When I analyzed our traffic, I found that 70% of our token usage was redundant. We were sending the same core instructions and the same static data over and over again. I had built the system inside a Python monorepo architecture, which made it easy to deploy, but the lack of a centralized token management layer meant every microservice was acting like it had an infinite budget.

How to Implement Semantic Caching with Redis for LLM Token Optimization

Semantic caching uses vector embeddings to identify similar prompts and serve cached responses, bypassing expensive API calls. The most immediate win for LLM token optimization was implementing a semantic cache. Standard exact-match caching is useless for LLMs because even a single character change in a prompt results in a cache miss. Semantic caching, however, uses vector embeddings to determine if a new prompt is "close enough" to a previously answered one.

I used Redis with the RediSearch module to store prompt embeddings. When a request comes in, I generate an embedding of the prompt and perform a vector similarity search. If the cosine similarity is above 0.96, I return the cached response instead of calling the Gemini API. Semantic caching with a 0.96 similarity threshold reduced API calls by 22%.

import redis
from sentence_transformers import SimilarityModel
from google.generativeai import GenerativeModel

class SemanticCache:
    def __init__(self, threshold=0.96):
        self.redis_client = redis.Redis(host='localhost', port=6379, db=0)
        self.model = SimilarityModel('all-MiniLM-L6-v2')
        self.threshold = threshold

    def get_cached_response(self, prompt_text):
        embedding = self.model.encode(prompt_text).tolist()
        # Perform vector search in Redis (simplified for brevity)
        results = self.redis_client.ft("idx:cache").search(
            Query("*=>[KNN 1 @vector $vec AS score]").return_fields("response", "score"),
            query_params={"vec": embedding}
        )
        
        if results.docs and float(results.docs[0].score) < (1 - self.threshold):
            return results.docs[0].response
        return None

    def set_cache(self, prompt_text, response_text):
        embedding = self.model.encode(prompt_text).tolist()
        self.redis_client.hset(f"cache:{hash(prompt_text)}", mapping={
            "vector": embedding,
            "response": response_text
        })

The failure here was setting the threshold too low. At 0.90, the agent started giving "hallucinated" cached responses—it would answer a question about 'User A' with 'User B's' data because the prompts looked similar to the encoder. I found 0.96 to be the sweet spot for our specific domain. This alone reduced our API calls by 22%.

Reducing Costs with Dynamic Context Pruning and Summarization

Dynamic context pruning maintains model performance by replacing long conversation histories with concise summaries. The biggest token sink in agentic workflows is the conversation history. Most developers just append every message to a list and pass it to the model. This is lazy and expensive. I moved to a "sliding window with summary" approach. Instead of sending the last 50 messages, I send the last 5 in full detail and a summarized version of everything that came before.

I implemented a middleware in FastAPI that calculates the token count of the history using the official Gemini Python SDK and triggers a summarization job whenever the history exceeds 4,000 tokens. This keeps the prompt size predictable and prevents the "lost in the middle" phenomenon where models ignore information buried in long contexts.

def prune_context(messages, max_tokens=4000):
    current_tokens = count_tokens(messages)
    if current_tokens <= max_tokens:
        return messages

    # Keep the system prompt (index 0) and the last 3 messages
    system_prompt = messages[0]
    recent_messages = messages[-3:]
    
    # Summarize the 'middle' messages
    middle_content = " ".join([m['content'] for m in messages[1:-3]])
    summary = call_summarization_model(middle_content)
    
    return [
        system_prompt,
        {"role": "system", "content": f"Summary of previous interaction: {summary}"},
        *recent_messages
    ]

This approach is significantly more effective than simple truncation. Truncation often removes the very context the agent needs to understand the current task. Summarization preserves the intent while discarding the fluff.

Implementing Token-Aware Routing Between Gemini Flash and Pro

Routing simple tasks to smaller models like Gemini 1.5 Flash can reduce total token spend by up to 40%. Not every task requires a high-reasoning model like Gemini 1.5 Pro. I was using Pro for everything, including basic classification and data formatting. That was a massive waste of money. I implemented a router that evaluates the complexity of the task and directs it to either Gemini 1.5 Flash (cheap and fast) or Pro (expensive and smart).

The routing logic is based on two factors: the length of the input and a "complexity score" generated by a small, local classifier (I used a fine-tuned DistilBERT model). If the task is just "format this JSON" or "summarize this paragraph," it goes to Flash. If it involves multi-step reasoning or complex logic, it goes to Pro. Routing 60% of traffic to Gemini 1.5 Flash resulted in a 40% reduction in daily spend.

Cost Comparison Table (Estimated per 1M tokens)

Model Tier Input Cost Output Cost Use Case
Gemini 1.5 Flash $0.075 $0.30 Summarization, Extraction, Routing
Gemini 1.5 Pro $3.50 $10.50 Complex Logic, Code Gen, Long Context

By routing 60% of our traffic to Flash, we saw an immediate 40% reduction in our daily spend without a measurable drop in user satisfaction. The key is to ensure the router itself is lightweight; you don't want to spend more on the routing logic than you save on the LLM call. Gemini 1.5 Flash costs $0.075 per 1M input tokens compared to $3.50 for Gemini 1.5 Pro.

How Compressing System Prompts Lowers Operational Overhead

Optimizing system prompts involves removing conversational filler and moving static data to RAG pipelines to minimize the input payload. I used to write system prompts like I was writing a legal contract—exhaustive, repetitive, and full of edge-case handling. I realized that for an agent that runs 10,000 times a day, a 2,000-token system prompt is a disaster. Every single word in that prompt is charged on every single turn.

I went through a "prompt audit" where I removed all conversational filler. Instead of "You are a helpful assistant that specializes in technical support and you should always be polite," I moved to "Role: Tech Support. Tone: Professional/Concise." I also moved static reference data (like product lists) out of the system prompt and into a RAG (Retrieval-Augmented Generation) pipeline. Now, the agent only pulls the specific product info it needs for the current query, rather than carrying the entire catalog in its context window. This is a critical step in LLM token optimization.

I also started using "Prompt Caching" where available. Some providers allow you to cache a prefix of the prompt. If your system prompt is identical across thousands of requests, you can pay a fraction of the cost for the model to "remember" that prefix. This is a game-changer for agentic workflows where the first 1,000 tokens are always the same instructions.

Key Takeaways for Managing Production LLM Token Usage

  • Observability is the first step: You cannot optimize what you don't measure. I had to build a custom dashboard in Grafana to track token usage per user, per agent, and per stage before I could identify the biggest offenders.
  • The "Infinite Context" Trap: Just because a model can handle 1M tokens doesn't mean you should send them. Performance degrades and costs skyrocket. Always aim for the smallest context possible that still solves the problem.
  • Semantic caching is high-risk, high-reward: It can save a fortune, but if your similarity threshold is too loose, your agent will start hallucinating based on old data. Test your embeddings thoroughly.
  • State management is cost management: In multi-stage pipelines, passing a "state" object that keeps growing is a recipe for a billing disaster. Clean your state after every stage.
  • Flash-first mentality: Start with the cheapest model and only "promote" the task to a larger model if the cheap one fails a validation check.

Related Reading

Looking ahead, I’m experimenting with local SLMs (Small Language Models) like Phi-3 to handle the routing and summarization tasks entirely on-premise. If I can move the "pre-processing" tokens away from the cloud providers entirely, I can potentially drop our costs by another 20%. The next challenge will be managing the latency trade-offs of running local inference alongside cloud-based agents, but that’s a problem for next month’s sprint.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI