GCP Custom Metrics for AI Agent Anomaly Detection

GCP custom metrics allow developers to monitor application-specific logic beyond standard infrastructure health by using OpenTelemetry to export business-level data to Cloud Monitoring. This approach enables proactive anomaly detection for AI agents and web scrapers by tracking "silent failures," such as empty LLM responses or CAPTCHA blocks, that still return HTTP 200 status codes.

It was 3:14 AM on a Tuesday when my phone started screaming. I was expecting a catastrophic infrastructure failure—maybe a Cloud Run service crashing or a database hitting 100% CPU. Instead, I looked at the dashboard and saw a flatline. Not a flatline of activity, but a flatline of 200 OK responses. On the surface, my AI-powered automation system was "healthy." In reality, my Gemini-integrated agents were returning empty strings because of an unhandled rate-limit edge case, and my web scrapers were getting blocked by a new Cloudflare challenge that standard status codes weren't catching.

I spent four hours that morning digging through logs. My standard GCP metrics—CPU utilization, memory usage, and request count—were useless. They told me the service was running, but they couldn't tell me the quality of the work being done. I realized then that relying on generic infrastructure metrics is a recipe for silent failure, especially when building complex AI pipelines. I needed custom metrics that understood my application's business logic. Since then, I’ve moved entirely to a proactive anomaly detection setup using OpenTelemetry and GCP Cloud Monitoring. Here is how I built it and why I’ll never go back to basic monitoring.

Why Standard Monitoring Fails to Detect AI Agent Silent Failures

Standard infrastructure metrics often fail to capture the quality of AI outputs, leading to "silent failures" where services appear healthy but return useless data. When I wrote about Building AI Agents with Gemini API FastAPI Webhooks, I focused heavily on the logic of the agent itself. What I didn't realize at the time was how fragile these agents are in a production environment. An LLM might start hallucinating, or the response latency might creep up from 2 seconds to 15 seconds. Standard GCP metrics won't trigger an alert for a 15-second response if your timeout is set to 30 seconds, but your users will certainly feel it.

Similarly, when managing a fleet of scrapers, as I detailed in my post on Building a Scalable Web Scraper with Python Playwright and Cloud Run, a "Success" (HTTP 200) doesn't always mean success. If the target site returns a "Please verify you are human" page with a 200 status code, your scraper thinks it succeeded. Your Cloud Run instance shows healthy CPU and low memory. But your data pipeline is effectively dead.

This is the "Visibility Gap." To bridge it, I started instrumenting my Python code to export custom metrics that track what actually matters:

  • Gemini API Token Usage (to prevent budget overruns).
  • LLM Response Content Length (to detect empty or truncated responses).
  • Scraper "Real" Success Rate (detecting CAPTCHAs vs. actual data).
  • Processing Latency per Step (not just the total request time).

How to Configure OpenTelemetry for GCP Custom Metrics in FastAPI

Using a vendor-neutral OpenTelemetry SDK ensures your monitoring pipeline remains portable while integrating seamlessly with Google Cloud Monitoring. I chose OpenTelemetry (OTel) because it’s vendor-neutral. If I ever decide to migrate from GCP to another provider, I don't want to rewrite my instrumentation logic. In my FastAPI backend, I use the opentelemetry-sdk and the opentelemetry-exporter-gcp-monitoring package to push data directly to Google Cloud.

Here is the core setup I use for my AI agents. I initialize a meter and create specific instruments for tracking agent performance.

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.cloud_monitoring import CloudMonitoringMetricsExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

# Initialize the GCP Exporter
exporter = CloudMonitoringMetricsExporter(project_id="my-techfrontier-project")
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=60000)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)

# Create a meter for the AI Agent
meter = metrics.get_meter("ai_agent_metrics")

# Define custom instruments
token_counter = meter.create_counter(
    name="gemini_token_usage",
    description="Total tokens consumed by Gemini API",
    unit="1"
)

latency_histogram = meter.create_histogram(
    name="agent_inference_latency",
    description="Time taken for LLM to generate a response",
    unit="ms"
)

failure_counter = meter.create_counter(
    name="agent_logic_failures",
    description="Count of non-HTTP failures (e.g., empty responses, hallucinations)",
    unit="1"
)

The export_interval_millis=60000 is crucial. Exporting every second is overkill and will spike your GCP costs. Exporting every minute provides a good balance between "real-time" visibility and cost efficiency. I learned this the hard way when my first bill for custom metrics was higher than my actual compute bill.

Instrumenting Python Logic to Capture AI Performance Data

Real-time instrumentation allows you to correlate LLM token usage with response quality to identify prompt engineering issues or safety filter triggers. Now, instead of just logging errors, I record these metrics within my FastAPI routes. This allows me to see patterns that logs often obscure. For example, if token usage spikes but response length drops, I know my agent is stuck in a loop or the prompt is being rejected by the safety filters.

@app.post("/v1/agent/process")
async def process_agent_request(request: AgentRequest):
    start_time = time.time()
    try:
        response = await call_gemini_api(request.prompt)
        
        # Record token usage
        token_counter.add(response.usage_metadata.total_token_count, {"model": "gemini-1.5-pro"})
        
        # Record latency
        duration = (time.time() - start_time) * 1000
        latency_histogram.record(duration, {"status": "success"})
        
        if len(response.text) < 10:
            # This is a silent failure - the API worked but the output is useless
            failure_counter.add(1, {"reason": "short_response"})
            
        return {"output": response.text}
        
    except Exception as e:
        failure_counter.add(1, {"reason": "exception"})
        raise HTTPException(status_code=500, detail=str(e))

By adding attributes like {"reason": "short_response"}, I can create breakdowns in the GCP Monitoring dashboard. I can see exactly why things are failing without grepping through thousands of lines of Cloud Logging output.

How to Create Proactive Anomaly Alerts Using GCP Monitoring Query Language

Monitoring Query Language (MQL) allows you to move beyond simple thresholds and alert on failure rates or statistical anomalies in agent performance. Having the data is only half the battle. The real magic happens when you set up Monitoring Query Language (MQL) alerts. I don't want an alert when a single request fails; I want an alert when the rate of failure deviates from the norm.

In the Google Cloud Console, I navigate to Monitoring > Alerting and create a new policy using MQL. Here is a query I use to detect an anomaly in agent latency. It compares the last 5 minutes of data to the previous hour. If the P99 latency jumps by more than 50%, I get a notification on Slack.

fetch gce_instance
| metric 'custom.googleapis.com/agent_inference_latency'
| align delta(1m)
| every 1m
| group_by [resource.instance_id], [value_latency_mean: mean(value.agent_inference_latency)]
| condition value_latency_mean > 5000ms

Wait, the above is a basic threshold alert. For true anomaly detection, I use the abs_entropy or simple percentage deviations. For my scrapers, I use a "Success Rate" ratio. If the ratio of scraper_success to scraper_total_attempts drops below 80% over a rolling 10-minute window, I know the target site has updated its bot detection.

For more details on the nuances of MQL, I highly recommend checking the official Google Cloud MQL Reference. It’s dense, but it’s the most powerful way to handle time-series data in GCP.

Managing the Costs and Performance of Custom Metrics in Google Cloud

Optimizing metric cardinality is essential to prevent unexpected billing spikes when scaling custom observability solutions in GCP. Custom metrics in GCP are not free. Google charges based on the number of bytes ingested and the number of time series created. When I first started, I was creating a new time series for every single user_id. This was a massive mistake. If you have 10,000 users, that’s 10,000 unique time series, and your bill will explode.

Rule of thumb: Keep your labels (attributes) low-cardinality. Use model_type, region, or error_code. Never use user_id, request_id, or timestamp as a metric label. If you need that level of granularity, that’s what Cloud Trace and Cloud Logging are for.

Performance Impact

I was worried that adding instrumentation would slow down my FastAPI service. However, the OpenTelemetry SDK handles metrics asynchronously. The PeriodicExportingMetricReader runs in a background thread and batches the data. In my benchmarks, the overhead was less than 2ms per request, which is negligible compared to the 1000ms+ latency of an LLM call.

The "Aha!" Moment

Three weeks after implementing this, I saw a spike in agent_logic_failures with the reason safety_filter_trigger. The standard HTTP metrics showed everything was fine (200 OK), but my custom metrics revealed that a specific prompt template I was using for the agent was consistently triggering Gemini's safety filters, leading to empty responses. Because I had an alert on this specific custom metric, I fixed the prompt in 10 minutes. Without it, I might have gone weeks without realizing that 15% of my users were getting broken results.

Key Takeaways for Implementing GCP Custom Metrics Successfully

  • Standard metrics are for infrastructure; custom metrics are for business logic. CPU and RAM tell you if the server is breathing; GCP custom metrics tell you if it's actually working.
  • OpenTelemetry is the gold standard. Don't lock yourself into proprietary SDKs. Use OTel and export to GCP Monitoring for the best of both worlds.
  • Beware of Cardinality. Never use unique IDs as labels in your metrics. High cardinality will break your dashboard and your budget.
  • Alert on Ratios, not Thresholds. A flat failure count is less useful than a failure rate. Anomaly detection requires looking at the relationship between success and failure.
  • Silent failures are the most dangerous. Use custom metrics to catch the "200 OK" responses that are actually failures in disguise.

Further Resources on AI Agents and Scalable Scraping

Moving forward, I’m looking into integrating these metrics with Google Cloud's Managed Service for Prometheus. As my fleet of agents grows, having a more robust query language and the ability to use Grafana for visualization might outweigh the simplicity of Cloud Monitoring dashboards. But for now, these custom OTel metrics are the only thing keeping me from waking up to another 3 AM surprise. If you aren't tracking your business-level failures yet, start today—before your users do the tracking for you.

Comments

Popular posts from this blog

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs

How I Squeezed LLM Inference onto a Raspberry Pi for Local AI