GCP Monitoring and Alerting: Building a Production-Grade Pipeline

A production-grade GCP Monitoring and Alerting pipeline requires integrating OpenTelemetry for custom metrics, using Monitoring Query Language (MQL) for trend-based alerts, and organizing dashboards around Service Level Objectives (SLOs). This approach ensures early detection of issues like memory leaks and CPU spikes before they impact end-users.

At 3:14 AM last Tuesday, my production Go service running on Google Cloud Run died. It didn't just crash; it entered a death spiral where the container would start, process exactly three requests, and then trigger an Out-Of-Memory (OOM) kill. Because I hadn't properly configured my alerting, the only reason I caught it was a random manual check of the billing console where I noticed a spike in "vCPU Seconds." By the time I logged in, the service had restarted 1,400 times in four hours, and my error logs were a useless soup of "context canceled" messages.

The problem wasn't the code—well, it was, I had a slice growing indefinitely—but the real failure was my monitoring. I was relying on the default "Health Check" and the basic "Container CPU Utilization" dashboard. These are vanity metrics. They tell you the heart is beating, but they don't tell you the patient is bleeding out. I spent the next 72 hours rebuilding my entire observability stack from the ground up using Google Cloud Monitoring (formerly Stackdriver) and OpenTelemetry. This is the setup I now use for every production app to ensure I never get caught off-guard again.

Why Default GCP Dashboards Fail in Production Environments

Default GCP dashboards often fail because they lack the granularity and context needed to detect rapid container crashes or specific memory growth patterns. The primary issue is granularity and lag. Default metrics are often sampled at 1-minute intervals. If your service spikes and crashes in 15 seconds, the dashboard might show a slight "bump" in memory usage rather than the vertical line to 100% that actually occurred.

Furthermore, standard metrics lack context. Knowing that memory is at 80% is useless if you don't know if that's 80% of a 512MB limit or a 32GB limit, or if it's "Cached" vs. "Anonymous" memory. In my case, the OOM killer was hitting because of "resident set size" (RSS) growth, but the default Cloud Run dashboard was averaging the memory usage across all instances, masking the failure of the specific crashing container.

I realized I needed three things: custom metrics exported directly from my Go runtime, Monitoring Query Language (MQL) alerts that look at rates of change rather than static thresholds, and a notification system that distinguishes between "The system is busy" and "The system is dying."

How to Export Custom Metrics Using OpenTelemetry in Go

Integrating the OpenTelemetry Go SDK allows you to export high-fidelity application metrics directly to Google Cloud, providing visibility into internal states like heap size and goroutine counts. To get real visibility, I stopped relying on the infrastructure to guess what my app was doing. This allows me to track things like active goroutines, heap size, and custom business logic metrics like "LLM tokens processed per second."

If you've read my previous post on LLM cost optimization and context window management, you know how critical it is to track token usage. Without custom metrics, you're flying blind on costs until the bill arrives.

Here is the boilerplate setup I use to initialize the OTel exporter in a Go/FastAPI-style backend. I prefer using the Google Cloud monitoring exporter directly over a collector for smaller Cloud Run deployments to keep the footprint low.


import (
    "context"
    "log"
    "time"

    mexporter "github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/metric"
    "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/sdk/resource"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func initMetrics(projectID string) func() {
    ctx := context.Background()
    
    // Create the Cloud Monitoring exporter
    exporter, err := mexporter.New(mexporter.WithProjectID(projectID))
    if err != nil {
        log.Fatalf("failed to create google cloud stats exporter: %v", err)
    }

    // Set up a reader that pushes metrics every 60 seconds
    // Note: GCP charges per metric write, so don't make this too frequent!
    reader := metric.NewPeriodicReader(exporter, metric.WithInterval(60*time.Second))

    res, _ := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-backend-service"),
            semconv.ServiceVersionKey.String("1.2.0"),
        ),
    )

    meterProvider := metric.NewMeterProvider(
        metric.WithReader(reader),
        metric.WithResource(res),
    )

    // Set global meter provider
    return func() {
        if err := meterProvider.Shutdown(ctx); err != nil {
            log.Printf("Error shutting down meter provider: %v", err)
        }
    }
}

The key here is the WithInterval(60*time.Second). When I first set this up, I had it at 5 seconds because I wanted "real-time" data. That mistake cost me $140 in a single week. Google Cloud Monitoring charges roughly $0.258 per million samples. If you have 50 instances each exporting 20 metrics every 5 seconds, the math gets ugly fast. Stick to 60 seconds for general health and use 10 seconds only for critical performance debugging.

How to Use MQL for Proactive Memory Burn Rate Alerting

Monitoring Query Language (MQL) enables proactive alerting by analyzing the rate of change in metrics rather than relying on static, noisy thresholds. Once the metrics are flowing into GCP, you need to write Alerting Policies. Most people use the "Menu-driven" UI to create alerts like "If CPU > 80% for 5 minutes, alert me." This is a recipe for alert fatigue. If a service is doing a heavy batch job, it might sit at 90% for an hour quite safely.

I switched to using MQL. MQL is far more powerful because it allows for ratio-based alerting and trend analysis. For example, I don't care if my memory is high; I care if my memory is growing faster than a specific rate over a 10-minute window. This is a core component of a robust GCP Monitoring and Alerting strategy.

Here is an MQL query I wrote to detect "Memory Burn Rate." It calculates the rate of change of memory usage and alerts if the slope suggests we will OOM within the next 30 minutes.


fetch cloud_run_revision
| metric 'run.googleapis.com/container/memory/utilization'
| filter (resource.service_name == 'my-backend-service')
| group_by 5m, [value_utilization_mean: mean(value.utilization)]
| every 5m
| derivative
| condition val() > 0.05 '1/min'

This query doesn't look at the absolute value. It looks at the derivative (the rate of change). If the utilization is increasing by more than 5% per minute consistently, I get a Slack notification. This caught the OOM leak in my staging environment two days after I implemented it, long before the container actually crashed. If you are struggling with CPU spikes instead of memory, I highly recommend checking out my guide on debugging Go CPU spikes with pprof to understand what's actually happening inside those cycles.

How to Reduce Alert Noise Using Incident Alignment and Logs

Setting longer alignment periods and using logs-based metrics helps filter out transient network blips and focuses alerts on sustained service degradation. One of the biggest mistakes I made early on was setting the "Alignment Period" too short. In GCP Monitoring, the alignment period is the window of time the system looks at to "smooth" the data. If you set it to 1 minute, a momentary spike will trigger an alert. I now use a 5-minute or 10-minute "rolling window" for almost everything.

I also learned to use Log-based Metrics for things that aren't easily captured by counters. For example, if my Go service logs "database connection refused," I want an alert. But I don't want an alert for one failure (network blips happen). I want an alert if the rate of those logs exceeds 5 per minute.

To do this, I go to Logging > Logs-based Metrics and create a counter with a filter like:

textPayload:"database connection refused" OR textPayload:"EOF"

Then, I create an alerting policy on that metric using a "Rate" transformation. This separates "background noise" from "service outage."

How to Organize GCP Dashboards Using SLIs and SLOs

Effective GCP dashboards should be structured hierarchically, prioritizing Service Level Indicators (SLIs) and the Four Golden Signals to minimize time-to-detection. I don't look at my dashboards every day. If you have to look at a dashboard to know if your system is healthy, your monitoring has failed. Dashboards are for investigation, not notification.

I organize my GCP Dashboards into three tiers:

The Executive View (SLOs): This shows four charts: Request Latency (p99), Error Rate, Availability, and Saturation (the "Four Golden Signals"). If these are green, I close the tab.
The Infrastructure View: This is where I track Cloud Run instance counts, cold start latency, and CPU/Memory throttles. This is where I go when the "Saturation" SLO turns red.
The Application View: This shows my custom OTel metrics—goroutine counts, DB connection pool stats, and cache hit ratios.

For a detailed look at how to build these, the official Google Cloud SLO documentation is actually surprisingly good, though it's a bit dense. The key takeaway I used was defining a "Burn Rate" for my error budget. If I'm allowed 0.1% errors per month, and I've used up 10% of that budget in the last hour, I need to be paged.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

GCP Monitoring and Alerting: Building a Production-Grade Pipeline

GCP Monitoring and Alerting: Building a Production-Grade Pipeline

Why Default GCP Dashboards Fail in Production Environments

How to Export Custom Metrics Using OpenTelemetry in Go

How to Use MQL for Proactive Memory Burn Rate Alerting

How to Reduce Alert Noise Using Incident Alignment and Logs

How to Organize GCP Dashboards Using SLIs and SLOs

How to Optimize GCP Monitoring Costs and Reduce Billing Sp

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs