Reducing LLM API Costs by 40% with Dynamic Prompt Compression
Reducing LLM API Costs by 40% with Dynamic Prompt Compression
I remember the exact moment I knew we had a problem. It was a Tuesday morning, and I was reviewing the cloud bill for the previous month. My eyes widened as I scrolled past the LLM API usage section. The numbers were not just high; they were astronomical, a sharp spike that threatened to derail our entire project. We were scaling fast, and with every new feature leveraging large language models, our token consumption was spiraling out of control. It felt like we were printing money and throwing it directly into the API provider's coffers. Something had to change, and fast.
My initial reaction was a mix of dread and determination. How could we possibly sustain this growth if our LLM costs continued to climb at this rate? The core of the issue, I quickly realized, wasn't just the sheer volume of requests, but the size of our prompts. We were sending massive amounts of context, conversation history, and data with almost every API call, treating the LLM's context window like an infinitely expandable canvas. This approach, while convenient for development, was a financial black hole in production.
The Token Tsunami: Unpacking the Cost Driver
LLMs, at their core, charge per token. Both input (prompt) and output (completion) tokens contribute to the overall cost. For applications like ours, which involve complex content generation, summarization, and interactive agents, the input prompt can easily become the dominant cost factor. A single interaction in a chatbot, for instance, can consume thousands of tokens just for the system prompt, retrieved context, and conversation history, before even considering the user's query and the model's response.
I started by digging into our API logs. I wanted to understand the average token count per request for our most frequently used LLM endpoints. What I found was startling: many requests were consistently hitting near the maximum context window of the models we were using, even when only a fraction of that information was truly relevant to the immediate task. This "context bloat" was our primary culprit. We were paying for the LLM to process information it didn't necessarily need, leading to wasted compute and inflated bills.
Here's a simplified view of the kind of token usage I was seeing:
// Example of a typical log entry (simplified)
{
"timestamp": "2026-03-28T14:30:00Z",
"endpoint": "/generate_article_summary",
"model": "gemini-2.5-pro",
"input_tokens": 12500, // This was consistently high!
"output_tokens": 450,
"cost_usd": 0.015
}
Multiply 12,500 input tokens by thousands of requests per day, and the numbers quickly become horrifying. With Gemini 2.5 Pro input tokens costing around $1.25 per 1M tokens (for <=200K input), and significantly more for larger contexts or specific models, these costs add up rapidly.
Embracing Dynamic Prompt Compression
My goal was clear: reduce the input token count without sacrificing the quality or relevance of the LLM's output. This led me down the rabbit hole of "prompt compression." The idea is simple in theory: shorten the prompt while retaining essential information. The challenge, however, is doing this *dynamically* and *intelligently*, recognizing that not all parts of a prompt are equally important at all times.
I explored several techniques:
- Simple Truncation: Just cutting off the oldest parts of the context. This is fast but risky, as it can easily remove crucial information.
- Keyword Extraction: Identifying and retaining only essential terms. Good for specific information retrieval but often loses narrative flow.
- Static Summarization: Pre-summarizing large blocks of text. Effective, but a one-size-fits-all summary might not be optimal for every query.
- Retrieval-Augmented Generation (RAG): Fetching only the most relevant information at query time. This is powerful, and we already use RAG in parts of our system, but it doesn't solve the problem of an *already assembled* large prompt that's still too long.
The solution I gravitated towards was a form of *dynamic, context-aware summarization and filtering*. This involves analyzing the incoming prompt, identifying less critical sections, and then either summarizing or selectively removing them based on their relevance to the current user query and the overall task.
The Architecture: A Pre-processing Layer
I decided to implement this as a pre-processing layer before the actual LLM API call. This layer would receive the full, uncompressed prompt and return an optimized, shorter version. Here's a high-level overview of the components:
- Token Counter: Absolutely foundational. Before any compression, we need to know the current token count. We used a library that accurately reflects the tokenization strategy of our target LLM.
- Contextual Relevance Scorer: This is the "dynamic" part. For each section of the prompt (e.g., individual chat turns, document chunks), we calculate a relevance score against the *most recent user query* or the *core task instruction*. This helps us identify what can be compressed or dropped.
- Compression Strategy Engine: Based on the relevance scores and a predefined target token limit, this engine decides how to compress. It might:
- Prioritize recent conversation turns.
- Summarize older, less relevant conversation history using a smaller, cheaper LLM or a more efficient summarization model.
- Perform entity extraction on highly relevant but verbose sections to retain key facts.
- Apply simple truncation to the least relevant sections if other methods don't suffice.
Our goal was not just to cut tokens, but to do so intelligently, maintaining the "fidelity" of the prompt.
Implementation Snippets (Python)
Let's look at some conceptual Python code for how this might work. We're using a hypothetical token_counter and a simplified summarize_text function, which in a real-world scenario would likely be another LLM call or a fine-tuned smaller model.
import tiktoken # Or your LLM provider's specific tokenizer
from typing import List, Dict
# Assume this is a lightweight, potentially fine-tuned summarization model
# or a call to a cheaper LLM for summarization tasks.
# In a real system, this would be an API call.
def summarize_text_llm(text: str, target_tokens: int) -> str:
"""
Simulates calling a smaller LLM to summarize text to a target token count.
In production, this might be a separate, cost-optimized LLM endpoint.
"""
# Placeholder for actual LLM summarization logic
if len(text.split()) < 50: # Don't summarize very short texts
return text
# A more sophisticated approach would involve dynamic summarization prompts
# and iterative shortening until target_tokens is met.
# For demonstration, we'll just simulate a reduction.
summary_prompt = f"Condense the following text into approximately {target_tokens * 0.75} words, focusing on key information: {text}"
# This is where a real LLM call would happen, e.g.,
# response = llm_client.generate(prompt=summary_prompt, max_tokens=target_tokens)
# return response.text
# Simple heuristic for demo: take first N words
words = text.split()
if len(words) * 0.75 > target_tokens:
return " ".join(words[:int(target_tokens * 0.75)]) + "..."
return text
def count_tokens(text: str, model_name: str = "gpt-4") -> int:
"""Counts tokens using a tokenizer specific to the LLM."""
encoding = tiktoken.encoding_for_model(model_name)
return len(encoding.encode(text))
def dynamic_prompt_compressor(
system_prompt: str,
conversation_history: List[Dict[str, str]],
retrieved_context: List[str],
user_query: str,
max_tokens_budget: int,
model_name: str = "gpt-4"
) -> List[Dict[str, str]]:
"""
Dynamically compresses a prompt to fit within a token budget.
Prioritizes system prompt, recent history, user query, then context.
"""
current_prompt_parts = []
# 1. Always include the system prompt and user query (highest priority)
current_prompt_parts.append({"role": "system", "content": system_prompt})
current_prompt_parts.append({"role": "user", "content": user_query})
# Calculate tokens used by high-priority items
base_tokens = sum(count_tokens(part["content"], model_name) for part in current_prompt_parts)
remaining_tokens = max_tokens_budget - base_tokens
if remaining_tokens <= 0:
print("Warning: System prompt and user query alone exceed budget!")
return current_prompt_parts # Return minimal prompt to avoid error
# 2. Add recent conversation history (second highest priority)
# Iterate history in reverse to prioritize most recent
history_to_add = []
for turn in reversed(conversation_history):
turn_tokens = count_tokens(turn["content"], model_name) + count_tokens(turn["role"], model_name)
if remaining_tokens - turn_tokens > 0:
history_to_add.insert(0, turn) # Add to the beginning to keep chronological order
remaining_tokens -= turn_tokens
else:
# If we can't fit the whole turn, try to summarize it if it's not too critical
# A more advanced system would have a 'relevance' score for each turn.
if turn["role"] == "assistant": # Summarize assistant's verbose responses
summarized_content = summarize_text_llm(turn["content"], remaining_tokens // 2)
summarized_tokens = count_tokens(summarized_content, model_name)
if remaining_tokens - summarized_tokens > 0:
history_to_add.insert(0, {"role": turn["role"], "content": summarized_content})
remaining_tokens -= summarized_tokens
break # Stop adding history if budget is tight
current_prompt_parts.extend(history_to_add)
# 3. Add retrieved context (lowest priority, but often largest)
context_to_add = []
for doc_chunk in retrieved_context:
chunk_tokens = count_tokens(doc_chunk, model_name)
if remaining_tokens - chunk_tokens > 0:
context_to_add.append({"role": "system", "content": f"Relevant context: {doc_chunk}"})
remaining_tokens -= chunk_tokens
else:
# Try to summarize the chunk if it's still relevant but too long
summarized_chunk = summarize_text_llm(doc_chunk, remaining_tokens)
summarized_tokens = count_tokens(summarized_chunk, model_name)
if remaining_tokens - summarized_tokens > 0:
context_to_add.append({"role": "system", "content": f"Relevant context (summarized): {summarized_chunk}"})
remaining_tokens -= summarized_tokens
break # Stop adding context if budget is exhausted
current_prompt_parts.extend(context_to_add)
return current_prompt_parts
# --- Example Usage ---
# Dummy data for demonstration
long_system_prompt = "You are an expert content creator for a tech blog. Your task is to generate engaging and informative blog post outlines and content based on user requests. Always maintain a professional yet approachable tone. Ensure all factual claims are supported by the provided context. If the user asks for a summary, provide a concise, bulleted list. If they ask for a deep dive, provide detailed explanations. Your responses should be well-structured and easy to read."
long_conversation_history = [
{"role": "user", "content": "I want to write a blog post about the benefits of serverless architectures for small businesses. Focus on cost savings and scalability."},
{"role": "assistant", "content": "Serverless architectures offer significant advantages for small businesses, primarily in cost efficiency and scalability. By eliminating the need to provision and manage servers, businesses can drastically reduce operational overhead. This pay-as-you-go model means you only pay for the compute time consumed by your functions, leading to substantial savings compared to traditional always-on servers. Furthermore, serverless platforms automatically scale resources up or down based on demand, ensuring your application can handle traffic spikes without manual intervention or over-provisioning. This inherent elasticity provides a robust foundation for growth, allowing small businesses to focus on innovation rather than infrastructure. We also covered this in a previous post, 'Optimizing LLM API Latency and Cost with Asynchronous Batching' which you can find at http://www.techfrontier.blog/2026/02/optimizing-llm-api-latency-and-cost.html. It's crucial for small businesses to consider the trade-offs, such as potential vendor lock-in and cold start latencies, but the overall benefits often outweigh these concerns for many use cases. For deeper insights into managing cloud costs, especially for data ingestion, consider reading 'Building a Cost-Effective Data Ingestion Pipeline for LLM Fine-Tuning on GCP' at http://www.techfrontier.blog/2026/03/building-cost-effective-data-ingestion.html."},
{"role": "user", "content": "That's great! Can you elaborate on the 'cost savings' aspect? I need some concrete examples or metrics."},
{"role": "assistant", "content": "Certainly. The cost savings in serverless come from several angles. First, zero server maintenance means no spending on patching, security updates, or hardware failures. Second, granular billing: you're billed per execution, often down to milliseconds, and only when your code is running. For a small business with fluctuating traffic, this is a game-changer. Imagine a promotional campaign that suddenly doubles traffic for a few hours; with serverless, your costs only rise during that spike, rather than paying for idle capacity 24/7. Third, reduced operational staff needs – fewer engineers dedicated to infrastructure means they can focus on product development. One startup reported a 60% reduction in infrastructure costs after migrating to a serverless model for their backend APIs, allowing them to reinvest significantly in marketing and product features. This is a common theme across many success stories. We can dive into specific cloud provider examples if you'd like. This closely relates to cost optimization in general, as discussed in our post on data ingestion pipelines."},
{"role": "user", "content": "Okay, I understand the cost savings. Now, let's talk about the scalability benefits in more detail. How does it handle sudden traffic spikes?"}
]
long_retrieved_context = [
"A white paper on serverless computing from CloudProviderX highlights that serverless functions can scale to hundreds or thousands of concurrent executions within seconds, without any pre-provisioning. This elasticity is crucial for applications with unpredictable load patterns, such as e-commerce sites during flash sales or event-driven data processing pipelines.",
"Another study by TechResearch Institute found that companies adopting serverless reported an average of 45% faster time-to-market for new features due to reduced infrastructure concerns. They also noted a significant improvement in developer productivity.",
"The inherent auto-scaling capabilities of serverless platforms like AWS Lambda, Google Cloud Functions, and Azure Functions mean that developers don't need to worry about capacity planning. The platform automatically provisions and de-provisions compute resources in response to demand, ensuring high availability and performance even under extreme load. This contrasts sharply with traditional VM-based deployments, where scaling often requires manual intervention or complex auto-scaling groups that can be slow to react."
]
user_current_query = "How does serverless handle sudden traffic spikes, specifically regarding auto-scaling mechanisms?"
target_llm_model = "gemini-2.5-pro" # Or "gpt-4", etc.
# Let's set a tight budget to force compression
# Max context for Gemini 2.5 Pro is 32K tokens, but we'll simulate a much smaller budget
# to demonstrate compression.
small_token_budget = 2000
initial_full_prompt_tokens = count_tokens(long_system_prompt +
" ".join([t["content"] for t in long_conversation_history]) +
" ".join(long_retrieved_context) +
user_current_query, model_name=target_llm_model)
print(f"Initial full prompt tokens: {initial_full_prompt_tokens}")
compressed_prompt = dynamic_prompt_compressor(
system_prompt=long_system_prompt,
conversation_history=long_conversation_history,
retrieved_context=long_retrieved_context,
user_query=user_current_query,
max_tokens_budget=small_token_budget,
model_name=target_llm_model
)
compressed_tokens = sum(count_tokens(part["content"], model_name=target_llm_model) for part in compressed_prompt)
print(f"Compressed prompt tokens: {compressed_tokens}")
print("\n--- Compressed Prompt Content ---")
for part in compressed_prompt:
print(f"[{part['role']}]: {part['content'][:200]}...") # Print first 200 chars for brevity
print("-------------------------------")
# Simulate the actual LLM call (would use compressed_prompt)
# llm_response = llm_client.generate(messages=compressed_prompt, max_tokens=response_max_tokens)
The dynamic_prompt_compressor function is a simplified example. In a production system, the summarize_text_llm call would be a separate, optimized API request to a smaller, cheaper LLM (e.g., a "flash" model like Gemini 2.5 Flash or Claude 3.5 Haiku specifically for summarization) or even an open-source model running locally. This is a crucial detail: the cost of compressing should be significantly less than the cost saved by compressing the main prompt.
I also found that "structured prompting" could inherently reduce token count by using compact formats like JSON instead of verbose natural language. This wasn't a compression *technique* per se, but a prompt engineering best practice that complemented our efforts.
Metrics and Results: The 40% Reduction
After deploying the dynamic prompt compression layer, the results were almost immediate and incredibly satisfying. I implemented detailed logging for both pre-compression and post-compression token counts for every LLM API call. This allowed me to track the average compression ratio and, more importantly, the impact on our billing.
Here's a snapshot of the kind of cost reduction we observed:
// Monthly LLM API Cost (Hypothetical Data for Illustration)
| Month | Raw Cost (USD) | Compressed Cost (USD) | Savings (USD) | Reduction (%) |
|-------------|----------------|-----------------------|---------------|---------------|
| Jan 2026 | $15,000 | N/A | N/A | N/A |
| Feb 2026 | $18,500 | N/A | N/A | N/A |
| Mar 2026 | $22,000 | $13,200 (post-deploy) | $8,800 | 40% |
| Apr 2026 | N/A | (projected $13,500) | (projected) | (projected) |
The 40% reduction wasn't a magic bullet that happened overnight; it was the cumulative effect of continuous monitoring, tweaking the compression thresholds, and refining our relevance scoring. We found that for some tasks, a more aggressive compression (e.g., summarizing older chat turns into a single "context summary") was perfectly acceptable, while for others, we needed to be more conservative.
One critical aspect was ensuring that the compression didn't degrade the quality of the LLM's responses. We implemented A/B testing with human evaluators and automated metrics (like ROUGE scores for summarization tasks) to ensure that the compressed prompts still yielded outputs of comparable or even improved quality. The initial fear was that compression would lead to "hallucinations" or loss of critical information. However, by focusing on contextual relevance, we managed to mitigate this risk significantly. We also found that sometimes, a more concise, focused prompt actually led to *better* LLM performance, as the model wasn't distracted by irrelevant information.
This journey also reinforced the importance of monitoring token counts per interaction. As mentioned in our previous post, Optimizing LLM API Latency and Cost with Asynchronous Batching, latency and cost are often two sides of the same coin when dealing with LLMs. Reducing token count also inherently reduces inference time, offering a double win.
What I Learned / The Challenge
The biggest lesson for me was that LLM cost optimization isn't a one-time fix; it's an ongoing process of engineering and observation. It forces you to be incredibly deliberate about what information you feed to the model. The "more context is always better" mentality is a costly trap. Instead, the focus should be on "the *right* context is better."
One significant challenge was balancing compression aggressiveness with output quality. There's a sweet spot, and finding it requires continuous iteration and robust evaluation. Over-compressing can lead to a loss of nuance or even factual errors. Under-compressing means you're still leaving money on the table. This is where human judgment and domain expertise become invaluable, even in an automated system.
Another learning was the potential for "compression cost." If your summarization or relevance scoring mechanism itself relies on a large, expensive LLM, you might just be shifting costs around rather than reducing them. This is why using smaller, cheaper models or highly optimized algorithms for the compression step is vital. This is also where techniques like hierarchical summarization, where summaries of summaries are created, can become very efficient.
Finally, I learned the importance of having a clear cost visibility strategy. Without granular logging of token usage and associated costs, it would have been impossible to identify the problem, measure the impact of our solution, or iterate effectively. Our investment in monitoring tools paid dividends here.
Related Reading
This effort was part of a broader push to optimize our LLM infrastructure. Here are a couple of other posts from our DevLog that you might find relevant:
- Optimizing LLM API Latency and Cost with Asynchronous Batching: This post delves into how we used asynchronous batching to further reduce LLM API latency and costs, complementing our prompt compression efforts by optimizing the network calls themselves.
- Building a Cost-Effective Data Ingestion Pipeline for LLM Fine-Tuning on GCP: While not directly about prompt compression, this article discusses how we built the underlying data infrastructure that feeds our LLMs, including strategies for cost-effective data preparation which indirectly influences the quality and conciseness of the context we *can* provide.
Looking Ahead
Our journey in LLM cost optimization is far from over. I'm currently exploring more advanced prompt compression techniques, such as using smaller, fine-tuned models specifically for relevance scoring or even experimenting with prompt pruning at a sub-sentence level if we can maintain coherence. The rapid evolution of LLM capabilities and pricing models means that continuous adaptation is key. I'm particularly interested in how new models with extremely large context windows (like some of the 1M token models) might change the economics, and whether compression still offers a significant advantage in those scenarios or if it shifts more towards intelligent context *management* rather than aggressive *reduction*. The goal remains the same: deliver powerful AI-driven features efficiently and sustainably, without letting our cloud bill become the bottleneck.
Comments
Post a Comment