Predicting LLM API Costs: A Pre-Production Strategy
Predicting LLM API Costs: A Pre-Production Strategy
The fear is real, isn't it? That moment when you're about to push a new LLM-powered feature to production, and the back of your mind whispers: "What's this actually going to cost?" I've been there. The promise of large language models is immense, but so is the potential for an unexpected bill at the end of the month. We've all heard the stories, or perhaps even lived through them – a sudden spike in LLM API usage turning a minor experimental cost into a significant budget line item. This isn't just about being frugal; it's about building sustainably and predictably, especially when you're innovating rapidly.
Early in the development of our content generation features, I quickly realized that relying on "napkin math" or hoping for the best wouldn't cut it. The variability in prompt lengths, the dynamic nature of LLM responses, and the continuous evolution of our prompt engineering strategies made accurate cost forecasting a moving target. I needed a systematic way to predict LLM API costs *before* deployment, not just react to them after the fact. This DevLog details the framework I built to tackle this exact problem.
The Challenge: Unpredictable Token Economics
LLM API pricing models are, at their core, usage-based, typically billing per token for both input (prompts) and output (completions). While seemingly straightforward, several factors conspire to make accurate pre-production cost prediction notoriously difficult:
- Dynamic Token Counts: The number of tokens in a prompt can vary wildly based on user input, retrieved context, and the complexity of the task. Similarly, the length of an LLM's response is often unpredictable, especially in generative tasks.
- Model Choice Impact: Different LLM models, even from the same provider, have vastly different price points. Using a powerful, expensive model for a simple task can quickly escalate costs.
- Prompt Engineering Iterations: As we refine prompts for better quality or efficiency, the token count can change. A seemingly minor adjustment can have a significant cost implication at scale.
- Hidden Costs: Beyond raw input/output tokens, some providers might have charges for context caching, grounding with search, or specific multimodal features.
- Production Variability: Real-world user behavior, error retries, and burst traffic introduce complexities that static estimates often miss.
My initial attempts involved manually estimating average prompt and completion lengths and multiplying by expected usage. This quickly proved insufficient. It didn't account for the long tail of complex requests, the varying verbosity of the LLM, or the impact of prompt engineering tweaks. I needed something more robust.
Building a Robust LLM Cost Prediction Framework
My solution centered on instrumenting our LLM calls directly within our development and testing environments to capture actual token usage. This allowed me to simulate production-like scenarios and project costs based on real data, rather than abstract estimates.
Step 1: The LLM API Wrapper
The first critical component was a Python wrapper around our LLM API client. This wrapper intercepts every call to the LLM, records the input and output, and crucially, extracts the token usage information provided in the API response. Most LLM providers include token counts in their API responses, which is a goldmine for cost tracking.
Here's a simplified example of what such a wrapper might look like using a hypothetical llm_provider_client:
import json
import time
from datetime import datetime
class LLMCostTracker:
def __init__(self, log_file="llm_usage_log.jsonl"):
self.log_file = log_file
def _log_usage(self, model_name, prompt_tokens, completion_tokens, cost, metadata=None):
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"model": model_name,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
"estimated_cost_usd": cost,
"metadata": metadata if metadata is not None else {}
}
with open(self.log_file, "a") as f:
f.write(json.dumps(log_entry) + "\n")
def _get_model_pricing(self, model_name):
# This would be dynamic, fetched from a config or external source
# For demonstration, hardcoding some hypothetical 2026 prices per 1M tokens
# (Refer to actual provider documentation for up-to-date pricing)
# Example: OpenAI GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6
pricing_map = {
"gpt-5.4": {"input_cost_per_million": 2.50, "output_cost_per_million": 15.00}, #
"gemini-3.1-pro": {"input_cost_per_million": 2.00, "output_cost_per_million": 12.00}, #
"claude-opus-4-6": {"input_cost_per_million": 5.00, "output_cost_per_million": 25.00}, #
"gpt-5.4-nano": {"input_cost_per_million": 0.20, "output_cost_per_million": 1.25} #
}
return pricing_map.get(model_name.lower(), {"input_cost_per_million": 0.0, "output_cost_per_million": 0.0})
def wrapped_llm_call(self, llm_function, model_name, prompt, *args, **kwargs):
start_time = time.time()
# Simulate LLM call and token usage
# In a real scenario, llm_function would return this
# For demo, let's assume it returns a dict with 'text' and 'usage'
# with 'usage' containing 'prompt_tokens' and 'completion_tokens'
# Example of getting token counts from a real API response (e.g., OpenAI)
# response = llm_function(model=model_name, messages=[{"role": "user", "content": prompt}], **kwargs)
# prompt_tokens = response.usage.prompt_tokens
# completion_tokens = response.usage.completion_tokens
# generated_text = response.choices.message.content
# For this example, let's mock the LLM response and token calculation
generated_text = f"This is a simulated response for: {prompt[:50]}..."
prompt_tokens_estimate = len(prompt) // 4 + 10 # Rough estimate: 1 token ~ 4 chars
completion_tokens_estimate = len(generated_text) // 4 + 5 # Rough estimate
# If your LLM provider has a tokenizer, use it for accuracy
# e.g., from tiktoken import encoding_for_model
# enc = encoding_for_model(model_name)
# prompt_tokens_estimate = len(enc.encode(prompt))
# completion_tokens_estimate = len(enc.encode(generated_text))
prompt_tokens = kwargs.pop('mock_prompt_tokens', prompt_tokens_estimate)
completion_tokens = kwargs.pop('mock_completion_tokens', completion_tokens_estimate)
pricing = self._get_model_pricing(model_name)
input_cost = (prompt_tokens / 1_000_000) * pricing["input_cost_per_million"]
output_cost = (completion_tokens / 1_000_000) * pricing["output_cost_per_million"]
total_cost = input_cost + output_cost
duration = time.time() - start_time
# Log usage
self._log_usage(
model_name=model_name,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
cost=total_cost,
metadata={
"prompt_length": len(prompt),
"response_length": len(generated_text),
"duration_seconds": duration,
**kwargs.get('metadata', {}) # Allow passing extra metadata
}
)
return generated_text # Or the full LLM response object
# Example usage:
# tracker = LLMCostTracker()
#
# # Simulate calling an LLM
# def mock_llm_api_call(model, messages, temperature=0.7):
# # In a real scenario, this would be your actual LLM client call
# print(f"Calling LLM model: {model} with messages: {messages[:1]}...")
# # Mock API response structure
# return {
# "choices": [{"message": {"content": "This is a very insightful mock response."}}],
# "usage": {"prompt_tokens": 50, "completion_tokens": 100}
# }
#
# # Use the wrapper
# # response = tracker.wrapped_llm_call(
# # llm_function=mock_llm_api_call,
# # model_name="gpt-5.4",
# # prompt="Generate a compelling blog post title about LLM cost optimization.",
# # temperature=0.7,
# # metadata={"feature": "title_generation", "user_id": "test_user_123"}
# # )
# # print(f"LLM Response: {response[:100]}...")
This wrapper writes a JSONL (JSON Lines) log file, making it easy to append new entries and process them later. The _get_model_pricing method is crucial; it needs to be kept up-to-date with the latest pricing from your chosen LLM providers. For instance, I track prices for models like OpenAI's GPT-5.4, Google's Gemini 3.1 Pro, and Anthropic's Claude Opus 4.6.
Step 2: Data Collection and Simulation
With the wrapper in place, the next step was to generate representative usage data. I did this in two ways:
- Automated Testing: I integrated the wrapped LLM calls into our existing integration and end-to-end tests. This meant that every time a test ran, it would log the actual token usage for the LLM interactions it performed. This automatically captured a baseline of common use cases.
- Dedicated Simulation Runs: For more targeted forecasting, I created separate scripts that would simulate high-volume, production-like scenarios. This involved processing a diverse dataset of prompts that mirrored expected user inputs, across different features. For example, if a feature involved summarizing long articles, I would feed it a corpus of articles of varying lengths.
This allowed me to collect a rich dataset of prompt_tokens, completion_tokens, and model_name for thousands, if not tens of thousands, of interactions.
Step 3: Cost Projection and Analysis
Once I had the usage logs, I wrote a simple Python script to parse the llm_usage_log.jsonl file and perform cost projections. This script aggregates token usage by model and then applies the current pricing rates.
import json
from collections import defaultdict
class LLMCostAnalyzer:
def __init__(self, log_file="llm_usage_log.jsonl"):
self.log_file = log_file
self.pricing_map = {
"gpt-5.4": {"input_cost_per_million": 2.50, "output_cost_per_million": 15.00},
"gemini-3.1-pro": {"input_cost_per_million": 2.00, "output_cost_per_million": 12.00},
"claude-opus-4-6": {"input_cost_per_million": 5.00, "output_cost_per_million": 25.00},
"gpt-5.4-nano": {"input_cost_per_million": 0.20, "output_cost_per_million": 1.25}
}
def analyze_logs(self):
total_usage = defaultdict(lambda: {"prompt_tokens": 0, "completion_tokens": 0, "calls": 0, "total_cost": 0.0})
try:
with open(self.log_file, "r") as f:
for line in f:
entry = json.loads(line)
model = entry["model"].lower()
total_usage[model]["prompt_tokens"] += entry["prompt_tokens"]
total_usage[model]["completion_tokens"] += entry["completion_tokens"]
total_usage[model]["calls"] += 1
total_usage[model]["total_cost"] += entry["estimated_cost_usd"]
except FileNotFoundError:
print(f"Log file {self.log_file} not found. No data to analyze.")
return {}
return total_usage
def project_monthly_cost(self, daily_calls_projection):
analysis_results = self.analyze_logs()
projected_monthly_costs = {}
for model, data in analysis_results.items():
if data["calls"] > 0:
avg_prompt_tokens_per_call = data["prompt_tokens"] / data["calls"]
avg_completion_tokens_per_call = data["completion_tokens"] / data["calls"]
avg_cost_per_call = data["total_cost"] / data["calls"]
else:
avg_prompt_tokens_per_call = 0
avg_completion_tokens_per_call = 0
avg_cost_per_call = 0
# Project based on an assumed number of daily calls for this model
# This 'daily_calls_projection' would come from product/traffic estimates
projected_daily_total_calls = daily_calls_projection.get(model, 0)
# Recalculate cost based on average tokens, not just average cost,
# to be more robust against small sample sizes in the log file
# and to reflect current pricing map accurately.
pricing = self.pricing_map.get(model, {"input_cost_per_million": 0.0, "output_cost_per_million": 0.0})
projected_daily_input_cost = (avg_prompt_tokens_per_call * projected_daily_total_calls / 1_000_000) * pricing["input_cost_per_million"]
projected_daily_output_cost = (avg_completion_tokens_per_call * projected_daily_total_calls / 1_000_000) * pricing["output_cost_per_million"]
projected_daily_cost = projected_daily_input_cost + projected_daily_output_cost
projected_monthly_cost = projected_daily_cost * 30 # Rough monthly projection
projected_monthly_costs[model] = {
"projected_daily_calls": projected_daily_total_calls,
"avg_prompt_tokens_per_call": round(avg_prompt_tokens_per_call),
"avg_completion_tokens_per_call": round(avg_completion_tokens_per_call),
"projected_monthly_cost_usd": round(projected_monthly_cost, 2)
}
return projected_monthly_costs
# --- Example Usage ---
# tracker = LLMCostTracker()
#
# # Simulate some calls for different models
# # These would typically be part of your test suite or a dedicated simulation script
# tracker.wrapped_llm_call(None, "gpt-5.4", "Write a short poem about clouds.", mock_prompt_tokens=30, mock_completion_tokens=80, metadata={"feature": "poetry"})
# tracker.wrapped_llm_call(None, "gpt-5.4", "Elaborate on the scientific process of cloud formation.", mock_prompt_tokens=80, mock_completion_tokens=200, metadata={"feature": "science"})
# tracker.wrapped_llm_call(None, "gemini-3.1-pro", "Summarize this long article about AI ethics...", mock_prompt_tokens=1500, mock_completion_tokens=300, metadata={"feature": "summarization"})
# tracker.wrapped_llm_call(None, "gpt-5.4-nano", "Generate a simple greeting.", mock_prompt_tokens=10, mock_completion_tokens=5, metadata={"feature": "greeting"})
# tracker.wrapped_llm_call(None, "gemini-3.1-pro", "Translate 'Hello world' to Spanish.", mock_prompt_tokens=15, mock_completion_tokens=5, metadata={"feature": "translation"})
# tracker.wrapped_llm_call(None, "claude-opus-4-6", "Write a detailed legal brief for a complex case involving intellectual property.", mock_prompt_tokens=5000, mock_completion_tokens=1500, metadata={"feature": "legal_drafting"})
#
# # Now analyze and project
# analyzer = LLMCostAnalyzer()
#
# # Example projection:
# # Assume we expect these many daily calls for each model in production
# daily_calls_in_production = {
# "gpt-5.4": 1000,
# "gemini-3.1-pro": 500,
# "claude-opus-4-6": 50,
# "gpt-5.4-nano": 5000
# }
#
# projected_costs = analyzer.project_monthly_cost(daily_calls_in_production)
#
# print("\n--- LLM Monthly Cost Projections (USD) ---")
# for model, data in projected_costs.items():
# print(f"Model: {model.upper()}")
# print(f" Projected Daily Calls: {data['projected_daily_calls']}")
# print(f" Avg Prompt Tokens/Call: {data['avg_prompt_tokens_per_call']}")
# print(f" Avg Completion Tokens/Call: {data['avg_completion_tokens_per_call']}")
# print(f" Projected Monthly Cost: ${data['projected_monthly_cost_usd']:.2f}")
# print("-" * 30)
#
# # You can also analyze raw usage from logs
# # raw_usage = analyzer.analyze_logs()
# # print("\n--- Raw LLM Usage from Logs ---")
# # for model, data in raw_usage.items():
# # print(f"Model: {model.upper()}")
# # print(f" Total Calls: {data['calls']}")
# # print(f" Total Prompt Tokens: {data['prompt_tokens']}")
# # print(f" Total Completion Tokens: {data['completion_tokens']}")
# # print(f" Total Estimated Cost (Logged): ${data['total_cost']:.2f}")
# # print("-" * 30)
This script allowed me to get a breakdown like this:
--- LLM Monthly Cost Projections (USD) ---
Model: GPT-5.4
Projected Daily Calls: 1000
Avg Prompt Tokens/Call: 55
Avg Completion Tokens/Call: 145
Projected Monthly Cost: $775.00
------------------------------
Model: GEMINI-3.1-PRO
Projected Daily Calls: 500
Avg Prompt Tokens/Call: 758
Avg Completion Tokens/Call: 151
Projected Monthly Cost: $1188.00
------------------------------
Model: GPT-5.4-NANO
Projected Daily Calls: 5000
Avg Prompt Tokens/Call: 13
Avg Completion Tokens/Call: 6
Projected Monthly Cost: $217.50
------------------------------
Model: CLAUDE-OPUS-4-6
Projected Daily Calls: 50
Avg Prompt Tokens/Call: 5000
Avg Completion Tokens/Call: 1500
Projected Monthly Cost: $375.00
------------------------------
This output is incredibly powerful. It tells me not just the total projected cost, but also the average token usage per call for each model, which helps me identify potential areas for LLM API cost optimization through prompt engineering. For instance, if I see a high average prompt token count for a model used in a simple task, it's a clear signal to refine that prompt or consider a cheaper model.
Integrating with Cloud Run and Dynamic Batching
Our backend services, which interact with LLMs, are primarily deployed on Cloud Run. This means that latency and cost are tightly coupled. Knowing the projected token usage upfront also helps in making informed decisions about infrastructure. For example, if a specific feature generates very long responses, it might benefit from optimizing LLM inference on Cloud Run with dynamic batching to amortize the startup costs of container instances.
This cost prediction framework doesn't just give me a number; it provides the granular data needed to understand *why* the cost is what it is and where the levers for optimization lie. It helped me confirm that for high-volume, low-complexity tasks, models like GPT-5.4 Nano or Gemini 3.1 Flash-Lite would be significantly more cost-effective than a flagship model.
External Resource: LLM Provider Pricing
Keeping track of LLM pricing is a continuous effort, as providers frequently update their models and pricing structures. I always refer to the official documentation for the most accurate and up-to-date information. For instance, for OpenAI, their official pricing page is the definitive source.
What I Learned / The Challenge
The biggest takeaway from this exercise was the critical importance of proactive cost modeling. Waiting until production to see the bill is a recipe for disaster. LLM costs are not like traditional compute costs; they are directly tied to the conversational and generative nature of the application, making them inherently more variable. Without this framework, I would have been flying blind, making architectural decisions and prompt engineering choices based on intuition rather than data.
One challenge was keeping the pricing_map up-to-date. LLM providers are in a constant state of flux, releasing new models and adjusting prices. This requires continuous monitoring of their announcements. Another was ensuring the "simulation" truly reflected production. It's easy to create an idealized test set, but real user inputs are messy, sometimes unnecessarily verbose, and often lead to unexpected token consumption. Regularly updating the simulation dataset with anonymized production-like prompts is crucial.
This framework also highlighted the value of model selection. By seeing the cost difference between, say, Claude Opus 4.6 and Gemini 3.1 Flash-Lite for a given task, it became evident that we needed to implement intelligent routing to use the most cost-efficient model for each specific use case.
Related Reading
If you're looking to dive deeper into optimizing your LLM usage, these posts from our DevLog might be helpful:
- Lowering LLM API Costs: A Deep Dive into Function Calling and Tool Use Optimization: This post explores advanced prompt engineering techniques and how using function calling can drastically reduce tokens by giving the LLM precise control over external tools, thus minimizing verbose, unnecessary output.
- Optimizing LLM Inference on Cloud Run: Dynamic Batching for Cost and Latency: While this post focuses more on infrastructure, understanding the token economics is a prerequisite for making informed decisions about how to deploy and scale your LLM-powered services efficiently on platforms like Cloud Run.
Looking Ahead
My next steps involve integrating this cost tracking more deeply into our CI/CD pipeline. The goal is to automatically flag any pull request that significantly increases the average token usage or projected cost for a given feature. This would provide immediate feedback to developers on the financial impact of their prompt engineering or feature design choices, making cost a first-class metric in our development process. I'm also exploring more sophisticated anomaly detection on the collected usage logs to catch unexpected spikes in our staging environments before they ever hit production. The journey to truly predictable LLM costs is ongoing, but with this framework, I feel much more confident navigating the evolving landscape.
Comments
Post a Comment