How I Reduced LLM API Costs with a Custom Tokenization Strategy

How I Reduced LLM API Costs with a Custom Tokenization Strategy

You can reduce LLM API costs by replacing naive string truncation with a token-aware sliding window strategy that prioritizes system prompts and recent messages. By calculating token counts locally in Go before sending requests, developers can prune redundant context and lower billing by over 40% without sacrificing model accuracy.

Last month, I woke up to a GCP billing alert that made my stomach drop. My Vertex AI spend had jumped from a manageable $45 a day to nearly $140. For a mid-sized RAG (Retrieval-Augmented Generation) application, that kind of volatility isn't just a rounding error; it’s a signal that the architecture is fundamentally inefficient. When I dug into the Cloud Console, the culprit was obvious: input token counts were exploding. We were blindly feeding massive amounts of context into our prompts, paying for every redundant word, and hitting context window limits that triggered expensive retries. Managing LLM API costs became our top priority.

The problem wasn't the model's performance. It was my own lazy approach to context management. I was using simple string slicing to "trim" chat history before sending it to the LLM. I assumed that if I kept the last 10,000 characters, I’d be safe. I was wrong. Characters do not equal tokens, and in the world of LLM billing, tokens are the only currency that matters. To fix this, I had to build a custom, token-aware buffering strategy in Go that could precisely manage what we sent to the API, ensuring we maximized the utility of every cent spent.

In this post, I’ll walk through the technical failures of naive truncation, the Go implementation of a token-limited sliding window, and how this strategy saved our production budget without degrading the quality of the AI's responses.

Why Naive String Truncation Increases LLM API Costs

Naive string truncation leads to invalid UTF-8 characters and unpredictable token counts that inflate LLM API costs. When I first built the backend for our AI-driven support tool, I took a shortcut. I knew that models like Gemini and GPT-4 have context limits, so I implemented a basic utility function to truncate the chat history. It looked something like this:

func truncateHistory(history string, maxChars int) string {
    if len(history) > maxChars {
        return history[len(history)-maxChars:]
    }
    return history
}

This is a disaster for three reasons. First, because of UTF-8 encoding, slicing a string at an arbitrary byte index can mid-cut a multi-byte character, leading to invalid strings and API errors. Second, different models use different tokenizers. A 1,000-character string might be 200 tokens or 500 tokens depending on the vocabulary and the prevalence of special characters or code snippets. Third, and most importantly, this "dumb" truncation often cuts off the most important part of the prompt: the system instructions or the initial user intent.

By the time I noticed the cost spike, we were frequently sending 30,000+ tokens per request, where 20,000 of those tokens were just "noise" from previous turns in the conversation that the model no longer needed. We were essentially paying Google to ignore data. I realized that to optimize costs, I needed to treat the context window as a prioritized buffer, not a raw string. Characters do not equal tokens in LLM billing.

How to Count Tokens in Go for Accurate Cost Estimation

Local tokenization in Go requires specialized libraries to match the provider's encoding and prevent expensive API retries. To fix the spend, I needed to count tokens locally before the request ever left my VPC. If I could accurately predict the token count, I could make intelligent decisions about what to prune. However, Go doesn't have a native "countTokens" function in the standard library. Most LLM providers use Byte Pair Encoding (BPE), and while libraries like tiktoken exist for Python, the Go ecosystem is a bit more fragmented.

I initially tried calling an external microservice just for tokenization, but the latency hit was unacceptable. Every millisecond counts in a real-time chat interface. I eventually settled on using a Go port of the OpenAI tiktoken library, which allowed me to perform BPE encoding in-process. But even then, I ran into performance issues. Encoding a 50,000-word history on every request started putting pressure on the CPU, which led to a different kind of cost: horizontal scaling of our Cloud Run instances.

I had to find a balance between precision and performance. This is where I started applying lessons from my previous work on debugging Go concurrency. I needed a way to handle these token calculations concurrently without blocking the main request flow, while ensuring that my token buffer was thread-safe.

How to Implement a Token-Aware Sliding Window

A prioritized token buffer ensures that critical system instructions are always preserved while sliding the conversation window to stay within budget. The core of my solution was a TokenBuffer struct. Instead of treating the conversation as one big string, I broke it down into discrete messages. Each message was tokenized once and cached. When preparing a request, the buffer would start from the most recent message and work backward until it hit a hard token limit. Crucially, it would always preserve the "System Prompt" at the very beginning.

Here is a simplified version of the logic I implemented:

type Message struct {
    Role    string
    Content string
    Tokens  int
}

type TokenBuffer struct {
    mu           sync.RWMutex
    Messages     []Message
    MaxTokens    int
    SystemPrompt Message
}

func (b *TokenBuffer) GetPayload() []Message {
    b.mu.RLock()
    defer b.mu.RUnlock()

    var payload []Message
    currentCount := b.SystemPrompt.Tokens
    
    // Always include the system prompt
    payload = append(payload, b.SystemPrompt)

    // Work backwards from the most recent messages
    var recentMessages []Message
    for i := len(b.Messages) - 1; i >= 0; i-- {
        if currentCount+b.Messages[i].Tokens > b.MaxTokens {
            break
        }
        currentCount += b.Messages[i].Tokens
        recentMessages = append([]Message{b.Messages[i]}, recentMessages...)
    }

    return append(payload, recentMessages...)
}

This approach ensured that we never exceeded the MaxTokens limit I set (which I tuned to be about 70% of the model's actual limit to leave room for the response). By keeping the system prompt static and only sliding the "recent" window, the model always knew its identity and instructions, but it didn't get bogged down by chat history from twenty minutes ago.

One challenge I faced here was managing the memory footprint of these buffers. When you have thousands of concurrent users, keeping a list of tokenized messages in memory can lead to the same kind of resource exhaustion I wrote about in my post on fixing PostgreSQL connection spikes. I had to implement a TTL (Time-To-Live) for these buffers and offload older conversations to a Redis cache to keep the Go heap manageable.

How Semantic Pruning Further Reduces Token Usage

Removing low-value filler words through semantic pruning allows for more relevant context within the same token limit. Even with the sliding window, I felt I could do better. Some messages in a conversation are high-value (like a user's specific question), while others are low-value (like "Thanks!" or "Okay"). I experimented with a "semantic pruning" layer. Before adding a message to the TokenBuffer, I used a lightweight regex and keyword check to identify "filler" messages. If a message was determined to be low-value and the buffer was reaching 80% capacity, I would drop the filler message entirely rather than dropping an older, potentially more relevant message.

I also started monitoring the "Token Efficiency Ratio"—the ratio of tokens sent to the relevance of the answer. We found that reducing our context window from 128k tokens to a tightly managed 8k tokens resulted in a 0% drop in accuracy for 95% of our use cases, while slashing our Vertex AI bill nearly in half. Reducing input size directly impacts how fast the user sees a response.

The Impact of Model-Specific Tokenization

It's vital to remember that not all tokenizers are created equal. If you are using Google’s Gemini models, you should ideally use their specific vertex-specific tokenization logic. You can find the official guidance on this in the Vertex AI Model Documentation. I found that using the OpenAI cl100k_base encoding (used for GPT-4) as a proxy for Gemini was "close enough" for estimation, usually within a 5-10% margin of error, but for production cost-capping, I eventually moved to a model-agnostic abstraction layer.

Benchmarking the Financial Impact of Token Optimization

Optimizing token management reduced average tokens per request by 75% and lowered monthly spend by 42%. After deploying the custom tokenization strategy, I tracked the metrics over a 14-day period. The results were immediate and significant. Here is the breakdown of what changed:

  • Average Tokens Per Request: Dropped from 24,500 to 6,200.
  • Request Latency: Decreased by 180ms (since the LLM had less text to process before starting generation).
  • Monthly Spend: Projected to drop from $4,200 to roughly $2,430.
  • Context Overflow Errors: Zero.

The most surprising outcome wasn't the money saved, but the latency improvement. Because LLM pricing is often tied to the "Time to First Token," reducing the input size directly impacts how fast the user sees a response. By being greedy with our token management, we actually made the product feel faster.

Key Takeaways for Managing LLM API Costs

Building this system taught me that AI Engineering is often just Resource Management with a different name. Here are the main lessons I took away from this optimization sprint to control LLM API costs:

  • Never trust string length: If you are building on top of LLMs, your internal units of measurement must be tokens. Period. Anything else is a guess that will cost you money.
  • The "Middle" is often useless: In long conversations, the beginning (instructions) and the end (immediate context) are critical. The middle is often where you can prune most aggressively.
  • Pre-calculation is cheaper than API calls: Spending a few CPU cycles in your Go backend to tokenize and trim a prompt is orders of magnitude cheaper than sending those extra tokens to a provider like GCP or OpenAI.
  • State management is the bottleneck: As soon as you start managing token buffers, you have to worry about concurrency and memory. Use sync.RWMutex for your buffers and be careful with long-lived objects in the heap.
  • Monitor your billing as a unit test: I now treat our GCP billing export as a performance metric. If the cost per 1,000 requests spikes, it’s treated as a regression, just like a latency spike or a bug.

Related Reading

Moving forward, I’m looking into implementing dynamic context windows that adjust based on the current cost of the spot instances we use for our inference workers. There is also a lot of potential in using "small" models to summarize the chat history before passing it to the "large" model, effectively compressing the tokens even further. For now, simply being aware of our token usage and managing it proactively in the Go layer has turned our AI features from a financial liability into a sustainable part of our infrastructure. If you're still using len(string) to manage your prompts, it's time to refactor and optimize your LLM API costs.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI