LLM Cost Optimization: Context Window Management in Go

LLM Cost Optimization: Context Window Management in Go

Effective LLM cost optimization is achieved by implementing token-aware pruning, summarization buffers, and context caching to reduce input token volume. These strategies prevent quadratic cost growth in Go applications by ensuring only relevant data remains in the active context window.

I woke up on a Monday morning last month to a Slack alert from our GCP billing bot that I originally thought was a bug. Our Vertex AI spend for a single project had spiked by $4,200 over the weekend. For a mid-sized startup, that is not just a "rounding error"—it is a "please explain this to the CTO" moment. When I dug into the logs, the culprit was obvious and, in hindsight, entirely my fault. We had rolled out a new multi-turn conversation feature for our AI assistant, and I had been lazy with how I handled the conversation history. I was simply appending every new message to the context window and sending the entire blob back to the model for every single turn.

As the conversations grew longer, the input token count exploded. Because LLM pricing is usually linear based on tokens, but conversation growth is cumulative, we were paying for the same messages over and over again, with the cost increasing quadratically as the session progressed. This incident forced me to rethink our entire approach to context management. I spent the next two weeks building a more intelligent, token-aware pruning system in Go. This post is a breakdown of the strategies I implemented to bring our costs back under control without sacrificing the model's "memory."

Why Growing Context Windows Create a Quadratic Cost Problem

Unmanaged conversation history leads to quadratic cost increases because every new message requires resending the entire previous context to the LLM. Most developers treat the context window like an infinite bucket. With models supporting 128k, 200k, or even 1 million tokens, it is tempting to just throw everything in there. However, the math of LLM APIs is brutal. If a user has a 20-turn conversation and each turn is 500 tokens, by the 20th turn, you are sending 10,000 tokens of history just to get a 200-token response. You are billed for those 10,000 tokens every time the user hits "send."

I realized that 80% of that history was noise. The model didn't need to know the exact phrasing of a greeting from ten minutes ago to answer a technical question now. My first step was to quantify the waste. I used the techniques I wrote about in my previous post on building a real-time LLM API cost dashboard with OpenTelemetry to visualize the ratio of input tokens to output tokens. The data was damning: our input-to-output ratio was nearly 50:1 in long sessions. We were effectively burning money to remind the model of things it had already processed.

How to Implement a Token-Aware Sliding Window in Go

Implementing a token-aware sliding window ensures that context limits are respected based on actual token counts rather than simple message volume. The simplest approach is a sliding window, but the "naive" version—just keeping the last N messages—is dangerous. One very long message (like a pasted log file) can push out all the relevant context. I needed a window based on actual token counts, not message counts. In Go, this meant I couldn't just use len(messages). I had to integrate a tokenizer into the backend to calculate the weight of each message before deciding what to prune.

I used a Go implementation of the Tiktoken library to handle this. Here is the core logic I wrote to ensure our context stays under a specific "hard limit" while prioritizing the most recent interactions:

type Message struct {
    Role    string
    Content string
    Tokens  int
}

func (s *ConversationService) PruneContext(messages []Message, maxTokens int) []Message {
    currentTokens := 0
    var pruned []Message

    // We always want to keep the System Prompt at the start
    systemPrompt := messages[0]
    currentTokens += systemPrompt.Tokens

    // Iterate backwards from the most recent message
    for i := len(messages) - 1; i > 0; i-- {
        msg := messages[i]
        if currentTokens+msg.Tokens > maxTokens {
            break
        }
        // Prepend to maintain order
        pruned = append([]Message{msg}, pruned...)
        currentTokens += msg.Tokens
    }

    return append([]Message{systemPrompt}, pruned...)
}

By switching to this token-aware pruning, I immediately saw a 25% reduction in average input tokens. However, this introduced a new problem: the "Goldfish Effect." The model would suddenly "forget" the beginning of the conversation. If a user asked a question in turn 2 and referred back to it in turn 25, the model was lost. This led me to my next optimization.

LLM Cost Optimization via Summarization Buffers

The summarization buffer pattern achieves LLM cost optimization by replacing long conversation histories with a single, concise summary block. To solve the memory loss issue, I implemented a summarization strategy. Instead of just deleting old messages, I would take the "overflow" messages and ask a cheaper model (like Gemini Flash or GPT-4o-mini) to summarize them into a concise paragraph. This summary is then injected into the context as a "Memory" block.

This is significantly more cost-effective. You might spend 300 tokens once to generate a summary that replaces 5,000 tokens of raw chat history. That summary then stays in the context for the rest of the session. I found that this works best when you trigger summarization in "chunks" rather than for every message. For example, once the history hits 4,000 tokens, summarize the oldest 2,000 tokens and replace them with the summary.

I actually discussed the logic of using cheaper models for intermediate steps in my post on reducing LLM API costs with strategic prompt chaining. The same principle applies here: use the "expensive" model only for the final reasoning, and use the "cheap" model for the administrative task of managing history.

LLM Cost Optimization with Vertex AI Context Caching

Vertex AI Context Caching allows developers to store large context blocks on the server to achieve up to a 90% discount on input tokens, significantly aiding LLM cost optimization. Since we are heavily invested in GCP, I started looking into Vertex AI Context Caching. This is a game-changer for cost optimization, but it is often misunderstood. Context caching allows you to "freeze" a large block of tokens (like a massive documentation set or a long conversation history) on the server side. You pay a small storage fee, but the input tokens for that cached block are significantly discounted—sometimes up to 90% cheaper on subsequent calls.

The catch is the TTL (Time To Live). If your cache expires too quickly, you pay the full price to "warm" it again. I had to write a wrapper in Go to manage these cache IDs and associate them with our session IDs. Here is the pattern I used for managing the cache lifecycle:

func (s *GCPManager) GetOrUpdateCache(ctx context.Context, sessionID string, content []Part) (string, error) {
    cacheID := s.db.GetCacheID(sessionID)
    
    if cacheID != "" {
        // Check if cache is still valid and update TTL
        err := s.vertexClient.UpdateCacheTTL(ctx, cacheID, time.Now().Add(1*time.Hour))
        if err == nil {
            return cacheID, nil
        }
    }

    // Create new cache if none exists or update failed
    newCache, err := s.vertexClient.CreateContextCache(ctx, content)
    if err != nil {
        return "", err
    }
    
    s.db.SaveCacheID(sessionID, newCache.ID)
    return newCache.ID, nil
}

Using context caching reduced our per-request cost for long-running sessions from $0.12 to about $0.015. However, I learned the hard way that you shouldn't cache everything. Caching small fragments is actually more expensive due to the overhead. I set a threshold: only cache if the context exceeds 32,000 tokens.

Solving the 'Lost in the Middle' Performance Regression

Combining a sliding window with semantic retrieval prevents the 'Lost in the Middle' problem while keeping token counts low and maintaining model accuracy. While optimizing for cost, I ran into a performance regression. The model's accuracy dropped when I used very large context windows (even when cached). This is the "Lost in the Middle" phenomenon, where LLMs struggle to retrieve information located in the center of a long prompt. They are much better at recalling information at the very beginning or the very end.

To combat this, I modified my Go pruning service to prioritize "Recency" and "Relevance" rather than just a linear window. I implemented a simple RAG (Retrieval-Augmented Generation) layer for the conversation history itself. Instead of sending the last 50 messages, I would send the last 5 messages, plus the 5 most semantically relevant messages from the history based on vector similarity. This kept the context window small (saving money) while ensuring the most important information was present (maintaining quality).

Key Takeaways for Production LLM Cost Optimization

Successful LLM cost optimization requires moving state management out of the prompt and into application logic while maintaining strict visibility over token usage. After two weeks of refactoring and monitoring, here is what I now consider "best practice" for context management:

  • Never trust the model to manage its own memory. If you ask the model to "remember this," you are paying for every word of that memory in every subsequent turn. Manage state in your application code, not in the prompt.
  • Token counting is non-negotiable. You cannot build a production LLM app without a local tokenizer. If you are guessing token counts based on character length (e.g., characters/4), you are going to have edge cases that blow up your budget.
  • Visibility is the first step to optimization. I wouldn't have caught the $4,200 spike until the end of the month if I hadn't set up granular alerting. If you haven't already, read my guide on building cost dashboards to get ahead of this.
  • Summarization is a lossy compression. It saves money, but you lose nuance. I found that saving the original messages in a database (Bigtable or Firestore) and only using summaries for the LLM context is the best of both worlds. If a user asks for a specific quote, you can fetch the raw data via a tool call or RAG.

Related Reading

The landscape of LLM pricing is changing fast—with many providers now offering "batch" pricing and "cached" pricing—but the fundamental problem of context management remains. My next goal for LLM cost optimization is to automate the switching between different pruning strategies based on the predicted "value" of a conversation. If a user is just chatting, we prune aggressively. If they are in a high-value troubleshooting session, we expand the window and use more expensive caching. Building these kinds of "intelligent backends" is where I think the real engineering challenge lies for the next year.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI