Debugging Go CPU Spikes with pprof in Production

Go CPU spikes are typically triggered by inefficient JSON serialization, catastrophic regex backtracking, or excessive memory allocations. Using the pprof tool allows developers to visualize these bottlenecks and implement fixes like sync.Pool or code-generated marshaling to restore system stability and reduce infrastructure costs.

It was 3:14 AM on a Tuesday when my phone vibrated off the nightstand. PagerDuty wasn't just notifying me of a minor hiccup; it was screaming that my primary Go-based API service on Google Cloud Run was hitting 95% CPU utilization across all instances. By 3:20 AM, the service had auto-scaled from its usual 5 instances to 40, and latency had ballooned from a crisp 180ms to a sluggish 2.5 seconds. My first thought was a DDoS attack, but the request volume was normal. Something was eating my cycles from the inside out.

In the world of Go, when your CPU decides to go into orbit without a corresponding increase in traffic, you usually have two culprits: a tight loop or an explosion of allocations causing the Garbage Collector (GC) to thrash. Looking at my GCP dashboard, I saw the "Billable Instance Time" metric trending toward a very expensive month. I needed to see exactly what the Go runtime was doing at that very second. This is where pprof moved from being a "nice-to-know" tool to my absolute lifeline.

I’ve spent years building systems, and I’ve learned that guessing where a bottleneck is constitutes the biggest waste of engineering time. I could have spent hours auditing my recent commits, but instead, I reached for the profiler. In this post, I’ll walk you through how I used net/http/pprof to catch a performance regression that was buried deep within a JSON parsing logic and a poorly optimized regex pattern.

How to Securely Expose pprof for Production Debugging

Including the net/http/pprof package in production binaries is the most effective way to diagnose real-time performance issues in a Go environment. The first mistake I see developers make is forgetting to include pprof in their production binaries, or worse, exposing it to the public internet. I always include the net/http/pprof package in my projects, but I keep it guarded. In this specific service, I had it configured to run on a separate internal port that isn't exposed via the Cloud Run ingress.

package main

import (
    "log"
    "net/http"
    _ "net/http/pprof" // Side-effect import registers handlers
)

func main() {
    // Main API server
    go func() {
        log.Println("Starting pprof server on :6060")
        if err := http.ListenAndServe(":6060", nil); err != nil {
            log.Fatalf("pprof server failed: %v", err)
        }
    }()

    // ... existing API logic ...
}

Because Cloud Run instances are ephemeral and sit behind a load balancer, I couldn't just curl the instance directly. I had to use a bastion host within my VPC and the gcloud compute ssh tunnel to reach the internal port of one of the spiking instances. If you're running on GKE, a simple kubectl port-forward does the trick. Once I had a tunnel to port 6060, I was ready to pull the data that would tell me why my cloud bill was doubling every hour.

How to Capture a CPU Profile During a Live Performance Spike

A 30-second CPU profile provides the necessary data to identify which functions are consuming the most execution time during a service disruption. When a service is struggling, you want a CPU profile. A CPU profile works by sampling the call stack 100 times per second. It doesn't capture every single function call (that would be too expensive), but it gives you a statistically significant view of where the CPU is spending its time. I ran the following command to capture 30 seconds of data:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

While the profiler was running, I felt that familiar anxiety. If the profile didn't show a clear winner, I was looking at a long night of rollbacks and binary searches through the git history. But after 30 seconds, the pprof interactive shell opened. I typed top10, and the culprit was staring me in the face.

(pprof) top10
Showing nodes accounting for 18.4s, 72.16% of 25.5s total
      flat  flat%   sum%        cum   cum%
     8.2s     32.1%  32.1%      8.2s  32.1%  runtime.cgocall
     4.1s     16.1%  48.2%      9.4s  36.8%  runtime.mallocgc
     2.2s      8.6%  56.8%      2.2s   8.6%  syscall.syscall
     1.5s      5.9%  62.7%      5.8s  22.7%  encoding/json.(*encodeState).marshal
     1.2s      4.7%  67.4%      1.2s   4.7%  runtime.memmove
     0.8s      3.1%  70.5%      4.4s  17.2%  regexp.(*bitState).backtrack
     ...

The numbers didn't lie. runtime.mallocgc at 16% meant the Garbage Collector was working overtime because of excessive allocations. But more interestingly, encoding/json and regexp were taking up nearly 30% of the total CPU time. This was a massive red flag. Our API is essentially a middleman for LLM responses, and while we do some parsing, it shouldn't be this expensive.

Visualizing Performance Bottlenecks with pprof Flame Graphs

Visualizing profile data through flame graphs reveals hidden costs in the Go runtime, such as reflection overhead during JSON marshaling. The text output is great, but the flame graph is where the story truly unfolds. I ran web (which opens a SVG in the browser) and saw a massive, wide tower belonging to a function I had written just two days prior: processLLMResponse. This function was responsible for cleaning up the raw text coming back from our model providers before sending it to the frontend.

I noticed that the encoding/json.Marshal call was spending a huge amount of time in reflect.Value.Interface. In Go, the standard library JSON encoder uses reflection to inspect structs at runtime. If you're marshaling large, nested objects—which we were doing as part of our LLM tool use optimization analysis—this reflection overhead adds up quickly.

But the real "Aha!" moment came when I looked at the regexp calls. I had implemented a "simple" regex to strip out some unwanted metadata from the LLM strings. It looked something like this:

var metadataRegex = regexp.MustCompile(`(?s)\[metadata\].*?\[/metadata\]`)

func cleanResponse(input string) string {
    return metadataRegex.ReplaceAllString(input, "")
}

In the profile, regexp.(*bitState).backtrack was consuming a significant chunk of CPU. This suggested that my regex was causing "catastrophic backtracking" on certain inputs. When the LLM returned a particularly long response without the closing [/metadata] tag, the regex engine was attempting every possible permutation before giving up, burning CPU cycles the whole time.

How to Replace JSON Reflection with Generated Marshaling Code

Replacing reflection-heavy standard libraries with generated code is a proven strategy for reducing CPU utilization in high-throughput APIs. I had two problems to solve: the JSON reflection overhead and the inefficient regex. For the JSON issue, I decided to move away from the standard library's encoding/json for this specific high-traffic endpoint. I've had great success with segmentio/encoding/json or sonic in the past, but for this project, I wanted to try easyjson, which generates marshaling code at compile time, bypassing reflection entirely.

You can find the official documentation for the Go profiler and these performance nuances in the official Go pprof documentation. It's a resource I keep bookmarked because the details on memory profiling are just as critical as CPU profiling.

Here is how I refactored the JSON logic. Instead of:

// Slow reflection-based marshaling
data, err := json.Marshal(largeResponseStruct)

I used the generated methods:

// Fast generated marshaling
data, err := largeResponseStruct.MarshalJSON()

For the regex, I realized I didn't even need a regular expression. A simple strings.Index and some slice manipulation would be orders of magnitude faster and wouldn't suffer from backtracking issues. I’ve written before about building resilient LLM workflows, and part of that resilience is ensuring that your parsing logic doesn't fall over when the LLM gives you garbage or unexpected formats.

func cleanResponseManual(input string) string {
    start := strings.Index(input, "[metadata]")
    if start == -1 {
        return input
    }
    end := strings.Index(input[start:], "[/metadata]")
    if end == -1 {
        // If tag isn't closed, just strip from start to end of string
        return input[:start]
    }
    return input[:start] + input[start+end+len("[/metadata]"):]
}

Using Go Benchmarks to Validate Performance Improvements

Benchmarking performance fixes ensures that optimizations provide measurable improvements rather than just shifting the bottleneck elsewhere. I never push a "performance fix" without a benchmark. I wrote a small test to compare the old regex approach with the manual string manipulation using a sample 10KB LLM response.

func BenchmarkCleanResponse(b *testing.B) {
    input := "Some prefix " + strings.Repeat("a", 10000) + "[metadata]secret[/metadata] suffix"
    b.Run("Regex", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            cleanResponse(input)
        }
    })
    b.Run("Manual", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            cleanResponseManual(input)
        }
    })
}

The results were staggering:

Regex: 142,500 ns/op | 12 allocations/op
Manual: 1,200 ns/op | 1 allocation/op

The manual string manipulation was over 100 times faster. By removing the regex and switching to easyjson, I deployed the changes and watched the GCP console. Within minutes of the new revision rolling out, the CPU usage across the fleet plummeted from 90% to 15%. The auto-scaler immediately began terminating instances, bringing our count back down to 5. Latency dropped back to its sub-200ms baseline.

How to Reduce Garbage Collection Overhead Using sync.Pool

Managing memory allocations through object pooling is critical for preventing the Go garbage collector from consuming excessive CPU resources and causing Go CPU spikes. One thing I haven't mentioned yet is the runtime.mallocgc spike I saw earlier. While the JSON and Regex fixes helped, I noticed that we were still allocating a lot of temporary buffers. This is a common issue when dealing with batch processing workloads where you're handling thousands of strings per second.

I used pprof again, but this time I looked at the heap: go tool pprof http://localhost:6060/debug/pprof/heap. I discovered that we were creating new bytes.Buffer objects for every request. By implementing a sync.Pool to reuse these buffers, I managed to reduce our mallocgc overhead by another 40%.

var bufferPool = sync.Pool{
    New: func() interface{} {
        return new(bytes.Buffer)
    },
}

func processRequest(data []byte) {
    buf := bufferPool.Get().(*bytes.Buffer)
    buf.Reset()
    defer bufferPool.Put(buf)
    
    // Use buf for processing...
}

This change was the "cherry on top." Not only did the CPU usage stabilize, but the "sawtooth" pattern in our memory usage graph—caused by the GC constantly cleaning up short-lived objects—smoothed out into a much more predictable line. This is crucial for Cloud Run, where you pay for the memory you allocate; lower memory pressure means you can safely run on smaller (and cheaper) instance types.

Key Takeaways for Resolving Go CPU Spikes

Always profile before you optimize. I was convinced the issue was our database driver. If I hadn't used pprof, I would have spent a day optimizing SQL queries that weren't the problem.
Standard library JSON has limits. encoding/json is fantastic for its compatibility, but its reliance on reflection makes it a bottleneck for high-throughput APIs processing large structs.
Regex is a power tool, not a hammer. For simple string stripping or prefix/suffix checks, the strings package is almost always faster and safer.
GC pressure is CPU pressure. If you see runtime.mallocgc or runtime.gcBgMarkWorker at the top of your pprof output, you don't have a logic problem; you have an allocation problem. Reach for sync.Pool.
Keep pprof accessible but private. Being able to profile a live production instance while it’s under load is the difference between a 20-minute fix and a 5-hour outage.

Search This Blog

TechFrontier | AI Automation, Python & Cloud Engineering

Debugging Go CPU Spikes with pprof in Production

Debugging Go CPU Spikes with pprof in Production

How to Securely Expose pprof for Production Debugging

How to Capture a CPU Profile During a Live Performance Spike

Visualizing Performance Bottlenecks with pprof Flame Graphs

How to Replace JSON Reflection with Generated Marshaling Code

Using Go Benchmarks to Validate Performance Improvements

How to Reduce Garbage Collection Overhead Using sync.Pool

Key Takeaways for Resolving Go CPU Spikes

Related Reading

Comments

Post a Comment

Popular posts from this blog

Why I Switched from FastAPI to Rust Axum for High-Performance AI Microservices

Optimizing LLM API Latency: Async, Streaming, and Pydantic in Production

How I Built a Semantic Cache to Reduce LLM API Costs