Building a Real-time LLM API Cost Dashboard with OpenTelemetry and Grafana

Building a Real-time LLM API Cost Dashboard with OpenTelemetry and Grafana

I still remember the pit in my stomach. It was early March 2026, and I was reviewing our cloud billing dashboard. Our LLM API costs, which I had optimistically projected to be a manageable $500 for the month, were already pushing past $3,000 – and we were only halfway through. My heart sank. What was going on? Had a new feature gone rogue? Was there an unexpected traffic surge? The standard billing dashboards, updated daily at best, offered no immediate answers. I needed real-time data, and I needed it yesterday. This wasn't just about saving money; it was about understanding our system's behavior and preventing future financial shocks. This incident kicked off an urgent project: building a real-time LLM API cost dashboard for AutoBlogger.

The Problem: Lack of Granular, Real-time Cost Visibility

Our initial LLM integration was straightforward. We used OpenAI's API for various tasks – content generation, summarization, keyword extraction. We had implemented basic logging for debugging, but no dedicated metrics for token usage or cost. We relied on the provider's billing portal, which, while accurate, lacked the immediacy and granularity required for proactive cost management. When a cost spike occurred, pinpointing the exact cause – which model, which feature, which part of the day – was like searching for a needle in a haystack.

The core issue was simple: LLM costs are directly tied to token usage. Without real-time token tracking, we were flying blind. I needed to answer questions like:

  • What is our current spend rate per hour/minute?
  • Which LLM model is consuming the most tokens?
  • Are input or output tokens driving the cost?
  • How do specific features impact our LLM expenditure?

This wasn't just a "nice-to-have" for me; it was a critical operational requirement. The existing billing portal was like looking at a bank statement days after the transactions occurred. I needed a live ledger.

Initial Brainstorming & Why It Wasn't Enough

My first thought was to just dump all LLM request and response data into our existing log aggregator (Elasticsearch) and then build dashboards on top of that. While technically feasible, it felt heavy. Parsing structured logs for metrics is inefficient, and the sheer volume of data would lead to increased logging costs and slower dashboard queries. Plus, it wouldn't give me proper metric types (counters, gauges) for easy aggregation and rate calculation.

Next, I considered a dedicated microservice that would intercept all LLM calls, calculate costs, and push them to a simple Redis counter. This was better for real-time aggregation but lacked historical context and the rich querying capabilities of a dedicated time-series database. I also didn't want to introduce another critical service just for this. I needed something robust, scalable, and ideally, leveraging existing observability infrastructure.

The Solution: OpenTelemetry, Prometheus, and Grafana

After evaluating several options, I landed on a stack that felt right: OpenTelemetry for instrumentation, Prometheus for metric collection and storage, and Grafana for visualization. This combination offered the real-time capabilities, scalability, and flexibility I needed, integrating well with our existing monitoring ecosystem.

Step 1: Instrumenting LLM API Calls with OpenTelemetry

The first critical step was to capture token usage at the source. We primarily use Python for our LLM interactions, so I decided to create a simple decorator that would wrap our LLM API calls. This decorator would:

  1. Execute the original LLM API call.
  2. Extract the input and output token counts from the API response.
  3. Determine the model used.
  4. Calculate the approximate cost based on our current pricing strategy (more on this later).
  5. Record these as OpenTelemetry metrics.

Here's a simplified Python example of how I instrumented our LLM calls:


import functools
import os
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    ConsoleMetricExporter,
    PeriodicExportingMetricReader,
)
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# --- OpenTelemetry setup (simplified for brevity) ---
# In a real app, this would be configured globally and more robustly.
reader = PrometheusMetricReader() # Or other exporter like OTLP exporter
meter_provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(meter_provider)
meter = metrics.get_meter("llm.cost.monitor")

# Define counters for tokens and cost
llm_input_tokens_counter = meter.create_counter(
    name="llm_api_input_tokens_total",
    description="Total input tokens processed by LLM API calls",
    unit="tokens"
)
llm_output_tokens_counter = meter.create_counter(
    name="llm_api_output_tokens_total",
    description="Total output tokens generated by LLM API calls",
    unit="tokens"
)
llm_api_cost_usd_counter = meter.create_counter(
    name="llm_api_cost_usd_total",
    description="Total estimated cost of LLM API calls in USD",
    unit="USD"
)

# --- LLM Pricing Map (example values - fetch from config/DB in prod) ---
# This is a simplified example; real pricing is more complex (e.g., per 1K tokens)
# and varies by model and often by input vs. output.
LLM_PRICING = {
    "gpt-4-turbo": {
        "input_cost_per_token": 0.00001,  # Example: $10 / 1M tokens
        "output_cost_per_token": 0.00003, # Example: $30 / 1M tokens
    },
    "gpt-3.5-turbo": {
        "input_cost_per_token": 0.0000005, # Example: $0.50 / 1M tokens
        "output_cost_per_token": 0.0000015, # Example: $1.50 / 1M tokens
    },
    # Add other models as needed
}

def monitor_llm_cost(model_name: str):
    def decorator(func):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            try:
                # Execute the original LLM API call
                response = await func(*args, **kwargs)

                # Extract token usage (this depends heavily on the LLM client library)
                # Example for OpenAI API response structure
                prompt_tokens = response.usage.prompt_tokens
                completion_tokens = response.usage.completion_tokens
                
                # Get pricing for the specific model
                pricing = LLM_PRICING.get(model_name)
                if not pricing:
                    print(f"Warning: Pricing not found for model: {model_name}")
                    return response

                input_cost = prompt_tokens * pricing["input_cost_per_token"]
                output_cost = completion_tokens * pricing["output_cost_per_token"]
                total_cost = input_cost + output_cost

                # Record metrics with relevant attributes (labels)
                attributes = {
                    "llm.model": model_name,
                    "llm.api.provider": "openai", # Or dynamic if using multiple providers
                    "llm.api.operation": func.__name__ # e.g., "generate_content", "summarize"
                }

                llm_input_tokens_counter.add(prompt_tokens, attributes)
                llm_output_tokens_counter.add(completion_tokens, attributes)
                llm_api_cost_usd_counter.add(total_cost, attributes)

                return response
            except Exception as e:
                # Log the error, but don't break the original function
                print(f"Error monitoring LLM cost: {e}")
                raise # Re-raise the exception to maintain original behavior
        return wrapper
    return decorator

# --- Example usage ---
# Imagine this is your LLM client wrapper function
class MyLLMClient:
    @monitor_llm_cost(model_name="gpt-4-turbo")
    async def generate_content(self, prompt: str):
        # In a real scenario, this would call the actual OpenAI API
        print(f"Generating content with gpt-4-turbo for prompt: {prompt[:50]}...")
        # Simulate an OpenAI API response object
        class Usage:
            prompt_tokens = len(prompt.split()) * 1.3 # Simulate token count
            completion_tokens = 50 # Simulate
        class Response:
            usage = Usage()
            content = "Generated text for: " + prompt
        await asyncio.sleep(0.1) # Simulate network latency
        return Response()

    @monitor_llm_cost(model_name="gpt-3.5-turbo")
    async def summarize_text(self, text: str):
        print(f"Summarizing text with gpt-3.5-turbo for text: {text[:50]}...")
        class Usage:
            prompt_tokens = len(text.split()) * 1.2
            completion_tokens = 20
        class Response:
            usage = Usage()
            content = "Summarized text for: " + text
        await asyncio.sleep(0.05)
        return Response()

import asyncio

async def main():
    client = MyLLMClient()
    await client.generate_content("This is a very long prompt that will generate a lot of tokens and therefore cost us money. We need to be careful about how we use these models.")
    await client.summarize_text("Short text to summarize.")
    await client.generate_content("Another prompt for content generation.")
    await asyncio.sleep(5) # Give Prometheus time to scrape

if __name__ == "__main__":
    asyncio.run(main())

A crucial detail here is the LLM_PRICING map. In a production system, this should ideally be loaded from a configuration service or a database, allowing for dynamic updates as pricing changes. I even considered pulling it directly from the provider's pricing API, but decided against it for stability – I prefer to control when pricing updates affect my cost calculations, especially given the nuances of input vs. output token costs and different models. Speaking of optimizing costs, I’ve found that strategic model selection for different tasks can make a huge difference, which is why having this granular visibility is so important.

Step 2: Metric Collection with Prometheus

With OpenTelemetry metrics being emitted, the next step was to collect them. Since I opted for the PrometheusMetricReader in my OpenTelemetry setup, my application now exposes a Prometheus-compatible metrics endpoint (typically /metrics). Our existing Prometheus server was configured to scrape this endpoint at regular intervals (e.g., every 15 seconds).

The Prometheus configuration was straightforward:


# prometheus.yml
scrape_configs:
  - job_name: 'autoblogger-llm-service'
    # metrics_path defaults to /metrics
    # scheme defaults to http
    static_configs:
      - targets: ['localhost:8000'] # Replace with your service's actual host:port

Prometheus then stores these time-series metrics efficiently, ready for querying.

Step 3: Visualizing with Grafana

Once Prometheus was collecting the data, building the Grafana dashboard was relatively simple. I created several panels to visualize different aspects of our LLM costs:

Total Daily Cost

This panel shows the aggregated daily cost, giving me a high-level overview of our expenditure trends.


sum by (llm_api_provider) (increase(llm_api_cost_usd_total[24h]))

This PromQL query calculates the sum of the increase in llm_api_cost_usd_total over the last 24 hours, grouped by the API provider. This gives me a clear daily spend.

Cost by Model and Direction (Input vs. Output)

This was crucial for identifying which models and token types (input/output) were driving costs. I created separate graphs for input and output tokens, broken down by model.


# Input Tokens by Model
sum by (llm_model) (rate(llm_api_input_tokens_total[5m]))

# Output Tokens by Model
sum by (llm_model) (rate(llm_api_output_tokens_total[5m]))

# Cost by Model (hourly rate)
sum by (llm_model) (rate(llm_api_cost_usd_total[1h]))

The rate() function in PromQL is invaluable for showing the average per-second rate of increase of a counter over a specified time range, which I then sum up to get the total rate for each model. This immediately highlighted that our gpt-4-turbo output tokens were the primary culprit during the initial cost spike, pointing to an inefficient prompt engineering strategy for a specific feature.

Cost Per Feature/Operation

By adding the llm.api.operation attribute in my OpenTelemetry metrics, I could also visualize costs per specific feature or function within our application. This was a game-changer for understanding the financial impact of individual product capabilities.


sum by (llm_api_operation) (rate(llm_api_cost_usd_total[1h]))

Challenges and Iterations

Building this wasn't without its hurdles. Here are a few I encountered:

  1. Asynchronous Contexts: Integrating the decorator with our asynchronous Python application required careful handling of async/await. My example above reflects this.
  2. Token Counting Accuracy: Different LLM providers (and even different models from the same provider) have slightly different ways of counting tokens. My initial implementation assumed a universal method, but I quickly learned the importance of using the token counts directly from the API response to ensure accuracy. If the API doesn't return token counts, using a tokenizer library (like tiktoken for OpenAI models) is the next best thing, but it's an approximation.
  3. Pricing Model Complexity: LLM pricing is rarely a simple "X dollars per token." It's often "X dollars per 1,000 input tokens" and "Y dollars per 1,000 output tokens," with different tiers and regional variations. My LLM_PRICING map evolved to reflect this complexity. It's a constant battle to keep these values up-to-date, and any discrepancy means the dashboard shows "estimated" costs, not exact ones.
  4. High Cardinality: Initially, I considered adding attributes like user_id or session_id to the metrics. While tempting for deep debugging, this would lead to very high cardinality in Prometheus, potentially impacting its performance and storage. I decided to stick to higher-level attributes like llm.model, llm.api.provider, and llm.api.operation, which provided enough granularity for cost analysis without overwhelming the monitoring system. For specific user-level debugging, I still rely on our detailed application logs.
  5. Performance Impact of Instrumentation: While OpenTelemetry is designed to be lightweight, adding instrumentation to every LLM call could theoretically introduce overhead. I conducted thorough load testing to ensure the decorator didn't significantly impact latency or throughput. Fortunately, the impact was negligible for our use case.

This dashboard also highlighted the importance of robust retry mechanisms and circuit breakers for LLM workflows. If our LLM provider goes down or rate limits us, those costs aren't incurred, but our application's performance suffers. My colleagues recently shared insights on building resilient LLM workflows, which became even more relevant once we had this real-time cost visibility.

What I Learned / The Challenge

The most profound lesson from this experience was the absolute necessity of real-time, granular observability for any critical, cost-sensitive third-party API integration, especially with LLMs. Relying solely on delayed billing dashboards is a recipe for unpleasant surprises.

The challenge wasn't just about implementing a technical solution; it was about shifting our mindset. We moved from a reactive "check the bill at the end of the month" approach to a proactive "monitor spend continuously" strategy. This dashboard became an invaluable tool for:

  • Early Anomaly Detection: I can now spot unusual spikes within minutes, not days.
  • Optimizing Resource Usage: It directly informed our decisions on which models to use for which tasks and where to focus our prompt engineering efforts.
  • Capacity Planning: Understanding our token consumption trends helps us project future costs and plan for budget allocations.
  • Feature Cost Analysis: We can now attribute LLM costs to specific product features, providing critical data for product managers and business stakeholders.

It transformed LLM API costs from a nebulous, backend expense into a transparent, actionable metric.

Related Reading

Looking Forward

While the current dashboard provides excellent real-time visibility, my next steps involve enhancing its capabilities further. I'm looking into integrating actual billing data from our cloud provider to reconcile our estimated costs with the true invoice, identifying any discrepancies. I also want to explore more sophisticated alerting mechanisms – not just on total spend, but on rates of increase, cost per user, or even deviations from historical patterns. Ultimately, the goal is to fully automate cost governance, perhaps even implementing automated rate limiting or model downgrades if certain budget thresholds are unexpectedly crossed. The journey to fully optimized LLM operations is continuous, but with this dashboard, I finally feel like I have the steering wheel in my hands.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI