Optimizing LLM API Costs for Multi-Turn Conversations
Optimizing LLM API Costs for Multi-Turn Conversations
Date: 2026-04-02
As the Lead Developer, I've seen my share of unexpected production issues, but few hit the budget quite as directly as a runaway LLM API bill. A couple of months ago, I noticed a alarming trend in our cloud spending report: a significant spike, almost doubling our daily operational costs. After some digging, the culprit became painfully clear: our shiny new AI-powered chat feature, designed to help users refine their content, was bleeding us dry.
The problem wasn't a bug in the traditional sense, but a fundamental misunderstanding of the LLM cost model in the context of multi-turn conversations. We had implemented the most straightforward approach: for every user message, we sent the entire conversation history back to the LLM to maintain context. It worked perfectly, from a functional standpoint. From a financial one? It was a disaster waiting to happen.
The Problem: Context Window Bloat and Exploding Bills
Let me paint a clearer picture of the issue. Our chat feature allowed users to interact with an AI assistant to brainstorm blog post ideas, generate outlines, and refine content snippets. The initial implementation, for simplicity and quick iteration, involved sending the complete conversation history with each new turn. This meant that if a conversation had N turns, the N-th API call would include N-1 previous user and assistant messages, plus the current user message, all within the prompt.
Consider a typical interaction using a model like gpt-4-turbo, with its higher per-token costs (e.g., $0.01 per 1K input tokens, $0.03 per 1K output tokens). Let's assume an average user message is 100 tokens and an assistant response is 200 tokens, plus a 200-token system prompt at the start of each API call.
For a 10-turn conversation (5 user messages, 5 assistant responses):
- Turn 1: 200 (system) + 100 (user) = 300 input tokens. 200 output tokens.
- Turn 2: 200 (system) + 100 (user1) + 200 (assist1) + 100 (user2) = 600 input tokens. 200 output tokens.
- Turn 3: 200 (system) + 100 (u1) + 200 (a1) + 100 (u2) + 200 (a2) + 100 (u3) = 900 input tokens. 200 output tokens.
- ...
- Turn 10: 200 (system) + 9 * (100 user + 200 assist) + 100 (user10) = 3000 input tokens. 200 output tokens.
Summing the input tokens across all 10 turns: 300 + 600 + 900 + ... + 3000 = 16,500 input tokens. Total output tokens: 10 * 200 = 2,000 output tokens.
Total tokens per 10-turn conversation: 16,500 + 2,000 = 18,500 tokens. Cost per conversation: (16,500 / 1000 * $0.01) + (2,000 / 1000 * $0.03) = $0.165 + $0.06 = $0.225.
This might not seem like much, but when you scale this to thousands of users engaging in multiple conversations daily, the numbers quickly become astronomical. With just 10,000 such conversations a day, we were looking at $2,250 daily, or nearly $70,000 a month, just for this one feature! This was simply unsustainable.
My first step was to instrument our API calls to precisely track token usage. You can't optimize what you don't measure. While LLM providers usually return token counts in their responses, it's crucial to have a client-side estimation to prevent exceeding context windows and to proactively manage costs. For this, I leaned on a tokenizer library.
package main
import (
"fmt"
"strings"
// For production, use a robust, model-specific tokenizer like tiktoken-go
// For demonstration, we'll use a widely available Go encoder for GPT-like models.
"github.com/pkoustaras/go-gpt3-encoder"
)
// Message represents a single turn in the conversation
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
// TokenCounter provides a way to count tokens for a given model.
// In a real application, this would use a robust library like tiktoken-go
// for exact token counts matching the LLM provider's.
type TokenCounter struct {
encoder *gpt3encoder.Encoder
}
// NewTokenCounter initializes a new TokenCounter.
// For GPT-3.5/4 models, the 'cl100k_base' encoding is commonly used.
func NewTokenCounter() (*TokenCounter, error) {
encoder, err := gpt3encoder.NewEncoder()
if err != nil {
return nil, fmt.Errorf("failed to create gpt3 encoder: %w", err)
}
return &TokenCounter{encoder: encoder}, nil
}
// CountTokens estimates the number of tokens in a given text.
// This is an estimation and might not perfectly match the LLM API's internal count,
// but it's crucial for managing context window and estimating costs.
func (tc *TokenCounter) CountTokens(text string) int {
if tc.encoder == nil {
// Fallback or error handling if encoder isn't initialized
return len(strings.Fields(text)) // Simple word count as a very rough proxy
}
ids, err := tc.encoder.Encode(text)
if err != nil {
// Log error and return a fallback estimate
fmt.Printf("Warning: Failed to encode text for token count: %v. Falling back to word count.\n", err)
return len(strings.Fields(text))
}
return len(ids)
}
// CalculateConversationTokens calculates the total tokens for a slice of messages,
// including boilerplate for a typical API call structure.
// NOTE: This is a simplified calculation. Real API calls have specific token counts
// for roles, separators, and other overhead. Always consult your LLM provider's
// documentation for exact calculation rules. For OpenAI models, it's typically
// 3-4 tokens per message for metadata.
func (tc *TokenCounter) CalculateConversationTokens(messages []Message, systemPrompt string) int {
totalTokens := 0
// System prompt tokens
if systemPrompt != "" {
totalTokens += tc.CountTokens(systemPrompt)
totalTokens += 3 // For system role overhead
}
// Message tokens
for _, msg := range messages {
totalTokens += tc.CountTokens(msg.Role) // Role tokens (e.g., "user", "assistant")
totalTokens += tc.CountTokens(msg.Content) // Content tokens
totalTokens += 4 // Per-message overhead (e.g., <|start|>user\n<|message|>content<|end|>\n)
}
// Add a few tokens for the final assistant response start (e.g., <|im_start|>assistant)
totalTokens += 3
return totalTokens
}
With token counting in place, I could clearly see the token usage for each turn escalating rapidly. The next step was to implement strategies to reign in that context window. This wasn't just about saving money; it was also about staying within the LLM's maximum context length, which, while large, isn't infinite, especially for more complex use cases.
Strategy 1: The Sliding Window – Simple but Effective
The simplest and often most effective initial approach to managing conversation context is the "sliding window." The idea is straightforward: only send the most recent N messages, or messages that fit within a predefined token limit, to the LLM. This ensures that the model always has the most immediate context, while discarding older, potentially less relevant information.
The primary benefit is its ease of implementation and predictable token usage. The downside, of course, is that older context is lost. For short, focused conversations, this is often perfectly acceptable. For longer, more intricate discussions where earlier points might become relevant again, it can lead to a "forgetful" AI.
My implementation in Go involved a function that would prune the message history slice. I opted for a token-based window rather than a fixed number of messages, as message length can vary wildly. This provides more consistent control over the actual token count.
// manageSlidingWindow prunes the conversation history to fit within a token limit.
// It prioritizes recent messages, removing older ones until the total conversation
// tokens (including the system prompt) are below maxConversationTokens.
func (tc *TokenCounter) manageSlidingWindow(history []Message, systemPrompt string, maxConversationTokens int) []Message {
if len(history) == 0 {
return []Message{}
}
// Start with the full history and iteratively remove the oldest messages
// until the token count is within the limit.
// We always want to keep at least the most recent user message if possible.
startIndex := 0
for startIndex < len(history) {
currentHistory := history[startIndex:]
currentTokens := tc.CalculateConversationTokens(currentHistory, systemPrompt)
if currentTokens <= maxConversationTokens {
if startIndex > 0 {
fmt.Printf("Pruned history: Reduced from %d to %d messages. Tokens from %d to %d.\n",
len(history), len(currentHistory), tc.CalculateConversationTokens(history, systemPrompt), currentTokens)
}
return currentHistory
}
startIndex++
}
// If even a single message (plus system prompt) exceeds the limit,
// which is highly unlikely with reasonable limits, we might return just the last message,
// or an error if strict adherence is needed. For robustness, we return the last message
// if it fits, otherwise an empty slice.
if len(history) > 0 {
lastMessage := []Message{history[len(history)-1]}
if tc.CalculateConversationTokens(lastMessage, systemPrompt) <= maxConversationTokens {
return lastMessage
}
}
return []Message{} // Fallback: return empty if nothing fits
}
This simple change immediately brought the per-conversation token usage down significantly. Instead of a linearly increasing cost, it plateaued once the conversation length exceeded the window size. This alone cut our costs by roughly 60% for longer conversations.
Strategy 2: Summarization – Compressing Context
While the sliding window was effective, the "forgetful AI" problem persisted for users who engaged in deep, multi-threaded discussions. This is where summarization came into play. The idea is to periodically take a chunk of the older conversation history, send it to the LLM with a specific instruction to summarize it, and then replace that chunk with its concise summary. This "lossy compression" allows us to retain the essence of older context without carrying the full token load.
The benefits are clear: a more coherent long-term memory for the AI, while still managing token counts. The challenges, however, are significant. Summarization quality depends heavily on prompt engineering, and there's always a risk of the LLM hallucinating or omitting critical details during the summarization process. This requires careful testing and iteration on the summarization prompt. I found that a clear, directive prompt instructing the LLM to retain key facts, decisions, and discussion points worked best.
I typically trigger summarization when the conversation history reaches a certain token threshold, but before it hits the hard limit of the sliding window. The summary itself then becomes part of the "system" or "context" messages for subsequent turns.
// llmClient is a mock struct for demonstration purposes,
// representing your actual LLM API client (e.g., OpenAI, Google Gemini).
type LLMClient struct{}
// Chat simulates an LLM API call, returning a mock response.
// In a real scenario, this would involve HTTP requests to the LLM provider.
func (c *LLMClient) Chat(messages []Message) (Message, error) {
// Simulate API latency and response
fmt.Println("Calling LLM API...")
// In a real application, you'd parse the actual LLM response.
// For this example, we'll return a simple mock message.
for _, msg := range messages {
if strings.Contains(strings.ToLower(msg.Content), "summarize") {
return Message{Role: "assistant", Content: "The previous discussion revolved around optimizing LLM API costs for multi-turn conversations by implementing context management strategies like sliding windows and summarization. Key points included token counting and the trade-offs between context fidelity and cost."}, nil
}
}
return Message{Role: "assistant", Content: "Understood. How can I help further?"}, nil
}
// summarizeConversation takes a chunk of conversation history and asks the LLM to summarize it.
// This is a key part of managing long-running contexts without losing all past information.
func summarizeConversation(tc *TokenCounter, llmClient *LLMClient, history []Message, summarizationSystemPrompt string) (string, error) {
// Construct a prompt to ask the LLM to summarize the provided history.
// The quality of this prompt is critical for effective summarization.
summarizationUserPrompt := "The following is a segment of a conversation. Please summarize it concisely, retaining all key facts, decisions, and important discussion points. Focus on the core topic and outcomes, and avoid conversational filler. Your summary should be usable as context for future turns.\n\n"
var conversationTextBuilder strings.Builder
for _, msg := range history {
conversationTextBuilder.WriteString(fmt.Sprintf("%s: %s\n", msg.Role, msg.Content))
}
summarizationUserPrompt += conversationTextBuilder.String()
// Prepare messages for the LLM call.
messagesForLLM := []Message{
{Role: "system", Content: summarizationSystemPrompt},
{Role: "user", Content: summarizationUserPrompt},
}
response, err := llmClient.Chat(messagesForLLM)
if err != nil {
return "", fmt.Errorf("failed to get summarization from LLM: %w", err)
}
return response.Content, nil
}
This strategy significantly improved the AI's ability to recall older context without inflating token counts. However, it introduced a new cost: the summarization API calls themselves. It's a trade-off, but for valuable, long-running conversations, the improved user experience and retained context often justify the additional summarization cost, especially when compared to sending the entire uncompressed history. For more on the nuances of prompt engineering for cost, I've previously written about Lowering LLM API Costs: A Deep Dive into Function Calling and Tool Use Optimization, which is highly relevant here.
Strategy 3: Retrieval-Augmented Generation (RAG) – Beyond Simple Context
For scenarios demanding truly long-term memory or access to external knowledge beyond the immediate conversation, even summarization has its limits. This is where Retrieval-Augmented Generation (RAG) shines. RAG allows us to store conversation turns, key facts extracted from them, or even entire knowledge base documents in a vector database. When a new user query comes in, we retrieve the most semantically relevant pieces of information from this database and inject them into the LLM's prompt.
This approach complements sliding windows and summarization beautifully. Instead of trying to cram *all* context into the prompt, RAG enables us to bring in *only the most relevant* context. This is particularly powerful for maintaining state across sessions, referencing detailed user profiles, or connecting conversations to a broader knowledge base.
The full implementation of a robust RAG system is a substantial undertaking, involving embedding models, vector databases (like Pinecone, Weaviate, or Qdrant), and robust retrieval logic. However, the conceptual flow is critical to understand for advanced context management:
- Embed and Store: Each user message, assistant response, or even a summary of a conversation segment, is converted into a numerical vector (an embedding) and stored in a vector database, along with its original text.
- Retrieve: When a new user message arrives, it is also embedded. This query embedding is then used to search the vector database for the most similar (semantically relevant) past messages or facts.
- Augment: The retrieved text chunks are then added to the LLM's prompt, along with the current conversation window, providing highly targeted context.
// Mock structs for demonstration purposes.
// In a real application, these would be concrete clients for your embedding model
// (e.g., OpenAI Embeddings, Google's text-embedding-004) and your vector database.
type VectorDBClient struct{}
type EmbeddingClient struct{}
type Embedding struct {
Vector []float32
}
type RetrievedChunk struct {
Text string
// Other metadata like timestamp, source, etc., could be stored here.
}
// Embed simulates calling an embedding model to convert text into a vector.
func (c *EmbeddingClient) Embed(text string) (Embedding, error) {
fmt.Printf("Generating embedding for: '%s' (truncated)\n", text[:min(len(text), 50)])
// In a real scenario, this would be an API call.
// For now, return a dummy embedding.
return Embedding{Vector: []float32{0.1, 0.2, 0.3, 0.4, 0.5}}, nil
}
// Search simulates querying a vector database for relevant chunks.
func (c *VectorDBClient) Search(query Embedding, userID string, limit int) ([]RetrievedChunk, error) {
fmt.Printf("Searching vector DB for user %s with query embedding...\n", userID)
// In a real scenario, this would query your vector database.
// For now, return some mock relevant chunks.
return []RetrievedChunk{
{Text: "User previously mentioned wanting to write a blog post about LLM cost optimization."},
{Text: "Key discussion point: the trade-offs between different context management strategies."},
{Text: "A previous draft outline included sections on token counting and sliding windows."},
}, nil
}
// retrieveRelevantContext orchestrates the RAG process.
func retrieveRelevantContext(query string, userID string, vectorDBClient *VectorDBClient, embeddingModel *EmbeddingClient) ([]string, error) {
// 1. Embed the user's current query.
queryEmbedding, err := embeddingModel.Embed(query)
if err != nil {
return nil, fmt.Errorf("failed to embed query: %w", err)
}
// 2. Query the vector database for relevant chunks of past conversation or knowledge base.
relevantChunks, err := vectorDBClient.Search(queryEmbedding, userID, 3) // Get top 3 relevant chunks
if err != nil {
return nil, fmt.Errorf("failed to search vector database: %w", err)
}
// 3. Extract the text content from the retrieved chunks.
var contextTexts []string
for _, chunk := range relevantChunks {
contextTexts = append(contextTexts, chunk.Text)
}
return contextTexts, nil
}
func min(a, b int) int {
if a < b {
return a
}
return b
}
While RAG adds architectural complexity, its power to provide highly specific and relevant context for very long or knowledge-intensive conversations is unmatched. It effectively decouples the "memory" from the immediate context window, allowing for much richer and more accurate interactions without the prohibitive cost of sending massive prompts. For deeper dives into RAG, I highly recommend exploring the official documentation of leading vector database providers or resources like Pinecone's comprehensive guides on RAG.
Putting It All Together: A Hybrid Approach
In the end, I found that no single strategy was a silver bullet. The most effective solution was a hybrid approach, combining these techniques based on the specific needs of the conversation and the stage it was in:
- Initial Turns (Short-term memory): For the first few turns of a conversation, a simple sliding window strategy was sufficient. The context is fresh, and the token count is low.
- Mid-Conversation (Context Compression): Once the token count approached a predefined threshold (e.g., 50% of the maximum context window), I would trigger a summarization step. The older, summarized history would then replace the raw messages in the sliding window, effectively compressing the context.
- Long-term/Knowledge-intensive (Augmentation): For features requiring recall of past user preferences, previous conversation threads from days ago, or specific knowledge base articles, I integrated a RAG component. Key facts or summaries would be extracted, embedded, and stored in a vector database. When a new query came in, relevant information would be retrieved and prepended to the LLM prompt.
This layered approach allowed me to strike a balance between context fidelity, cost efficiency, and implementation complexity. The impact on our metrics was dramatic. My initial estimate of $2,250/day for 10,000 conversations dropped to roughly $350/day – an 84% reduction! This was achieved by:
- Reducing average input tokens per turn by 70-80% for longer conversations.
- Minimizing the number of expensive `gpt-4-turbo` calls by summarizing more efficiently.
- Ensuring we only paid for the *relevant* context, not the entire historical transcript.
The monitoring dashboards, which previously showed a scary upward trend in token usage, now displayed a much more stable and predictable pattern. This also had a positive side effect on the overall system performance, reducing the load on our Cloud Run instances by making API calls more efficient, though that was a separate battle I detailed in Go Cloud Run: Debugging and Fixing Persistent Connection Leaks.
What I Learned / The Challenge
This journey taught me several invaluable lessons about building AI-powered applications:
- Cost is a First-Class Concern: For LLM-powered applications, cost is not an afterthought; it's a core architectural constraint. Naive implementations can quickly spiral out of control.
- Token Counting is Non-Negotiable: You absolutely must have accurate token counting in your client-side logic to predict costs, manage context windows, and prevent errors.
- Trade-offs are Inherent: Every optimization strategy involves trade-offs between context fidelity, cost, and implementation complexity. There’s no one-size-fits-all solution; the best approach is context-dependent.
- Iterate and Monitor: LLM optimization is an iterative process. Continuously monitor token usage, API costs, and user feedback to refine your context management strategies.
- Prompt Engineering for Summarization is Key: The quality of your summarization directly impacts the AI's long-term memory. Invest time in crafting effective summarization prompts.
The biggest challenge wasn't just implementing these techniques, but understanding *when* and *how* to apply them in combination to get the best results without compromising the user experience too much. Balancing the desire for a perfectly "remembering" AI with the realities of API costs is a constant tension.
Related Reading
- Lowering LLM API Costs: A Deep Dive into Function Calling and Tool Use Optimization: This post delves into broader LLM cost reduction strategies, including efficient prompt engineering and leveraging function calling. It provides excellent foundational knowledge for understanding the cost implications discussed here.
- Go Cloud Run: Debugging and Fixing Persistent Connection Leaks: While not directly about LLM costs, this post covers the infrastructure challenges we faced on Cloud Run. The overall performance and stability of our services are intertwined, and efficient LLM usage contributes to a healthier, more cost-effective backend.
Looking ahead, I'm eager to explore more advanced RAG techniques, such as multi-hop retrieval and fine-tuning embedding models for our specific domain. I also want to experiment with dynamic context window resizing based on conversation complexity or user segment. The world of LLM optimization is constantly evolving, and staying on top of these techniques is crucial for building scalable, cost-effective AI applications. This journey has been a stark reminder that even with cutting-edge AI, the fundamentals of software engineering – measurement, iteration, and thoughtful architecture – remain paramount.
Comments
Post a Comment