Go Performance Tuning Quick Start: Metrics, Profiling & Beyond

Share This Post

Continuous profiling in Go often surfaces slowdowns that never appear in benchmarks; this guide will help you get started

This is a guest post by P99 CONF speaker Paweł Obrępalski (Staff Engineer @ ShareChat AI). Paweł will be presenting “Go Faster: Tuning the Go Runtime for Latency and Throughput” at P99 CONF 2025 (free + virtual). His session description:

Most Go services don’t need runtime tuning—until they do. At ShareChat, we run hundreds of Go services across thousands of cores, where even small improvements can lead to significant impact. We’ve seen real gains by understanding how the Go runtime behaves under production load—and tuning accordingly.

In this talk, I’ll break down two core components of Go’s runtime: the scheduler and garbage collector—explaining how they influence latency, throughput, and resource usage. Then, I’ll walk through practical tuning techniques using runtime variables (GOGC, GOMEMLIMIT, GOMAXPROCS) and Profile-Guided Optimisation (PGO). Finally, I’ll share lightweight strategies to monitor system behaviour and spot when runtime tuning is warranted—so you can optimise your services without guesswork.

Join P99 CONF 2025 (free + virtual)

This article was originally posted on Paweł’s blog.

***

When working with microservices at scale, performance optimization becomes important for maintaining both system reliability and cost efficiency. Getting visibility into how your Go services behave under load is the first step towards making them faster and more efficient. This post will guide you through setting up essential observability tools: metrics and profiling. While the examples will be specific to Go, the concepts apply to other languages as well.

If you want to follow along, you will need to have Go installed: https://go.dev/doc/install.

You can find all code examples in this site’s repository.

Metrics: Your Service’s Dashboard Gauges

Think of metrics as the dashboard gauges for your application. They provide quantifiable measurements of your service’s health and performance over time, like request rates, error counts, latency distributions, and resource utilization (CPU, memory). Starting with metrics gives you a high-level overview, helps identify trends, and allows you to set up alerts for abnormal behavior.

Runtime metrics can be easily added using Prometheus, a popular open-source monitoring and alerting toolkit. You can install the Go client library by running:

go get github.com/prometheus/client_golang

After that, adding a basic set of Go runtime metrics (like garbage collection stats, goroutine counts, memory usage) is as simple as exposing an HTTP endpoint using the default promhttp.Handler:

package main

import (
    "log"
    "math/rand"
    "net/http"
    "time"

    // Import pprof package for side effects: registers HTTP handlers.
    // We use the blank identifier _ because we only need the side effects (handler registration)
    // from its init() function, not any functions directly from the package.
    _ "net/http/pprof"
)

func main() {
    // Start the pprof HTTP server on a separate port and goroutine.
    // Running it in a separate goroutine ensures it doesn't block the main application logic.
    // Using a different port (e.g., 6060) is common practice to avoid interfering
    // with the main application's port (e.g., 8080).
    go func() {
        log.Println("Starting pprof server on localhost:6060")
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // Your main service logic would go here...
    // For demonstration, we'll randomly pick a number every 100ms
    log.Println("Main application running...")
    for {
        log.Println(rand.Intn(100))
        time.Sleep(100 * time.Millisecond)
    }
}

You can run this example with go run .. Metrics will now be available at localhost:8080/metrics.

Tip: In a real setup, you wouldn’t typically access this endpoint directly in your browser. Instead, a Prometheus server would periodically scrape this endpoint, storing the time-series data. You’d then use tools like Grafana to query Prometheus and visualize the metrics. However, checking the /metrics endpoint manually is a great way for quickly verifying that your application is exposing metrics as expected, especially when adding custom ones.

Adding More Detailed Metrics with Prometheus

The default handler is a good start, but often you’ll want more detailed information about your application code or the Go process itself. The collectors package provides fine-grained metrics about the Go runtime and the process.

Here’s how you can create a custom registry and add these specific collectors:

package main

import (
    "log"
    "net/http"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/collectors"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
    // Create a non-global registry.
    reg := prometheus.NewRegistry()

    // Add collectors for Go runtime stats and process stats.
    // TODO: check if this can be done more easily
    reg.MustRegister(
        collectors.NewGoCollector(),
        collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}),
    )

    // Expose metrics using the custom registry.
    http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))

    log.Println("Starting server on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

This setup gives you deeper insights into Garbage Collector (GC) performance, memory allocation details, and process-level stats like CPU and memory usage, file descriptors, etc. Beyond these, you can easily add your own custom application-specific metrics (e.g., number of cache hits, specific business operation counters). For a complete example of runtime metrics collection, see the prometheus/client_golang example.

Profiling: Finding the Needles in the Haystack

While metrics give us a great overview (the “what”), they often don’t tell us the “why.” If your latency is high or CPU usage is spiking, metrics alone might not pinpoint the exact lines of code responsible. Their granularity is usually too low for deep optimization. This is where profiling comes in – it helps us look inside the application’s execution to see exactly where time is spent or memory is allocated, revealing the bottlenecks.

Profiling Fundamentals

Profiling captures detailed runtime behavior. For performance optimization, two types are most commonly used:

CPU Profiling: Captures stack traces over time to show where your program is spending its CPU cycles. Essential for identifying “hot paths” – functions that consume significant processing time.
Memory Profiling: Takes snapshots of the heap to show where memory is being allocated. Helps understand object lifetimes, identify memory leaks, and analyze garbage collection pressure. Understanding allocation patterns is key to performance in Go.

Go also provides specialized profiles (goroutine, block, mutex), which are useful for diagnosing concurrency issues but are generally used less frequently than CPU and memory profiling for general optimization.

Local Profiling with pprof

Go’s built-in pprof package makes profiling straightforward. It can collect profiling data and expose it over an HTTP endpoint for analysis.

Local profiling with pprof is invaluable during the development cycle and for investigating specific, reproducible performance issues. It’s the go-to tool when you need immediate feedback on the performance impact of code changes or when analyzing behavior that might not be present or easy to isolate in a production environment captured by continuous profiling.

Tip: Enabling the pprof endpoint in production-like environments (perhaps on a specific instance, canary, or behind authentication) can be invaluable for quick troubleshooting of live issues.

The easiest way to enable this is via a side effect import of net/http/pprof.

package main

import (
    "log"
    "net/http"

    // Import pprof package for side effects: registers HTTP handlers.
    // We use the blank identifier _ because we only need the side effects (handler registration)
    // from its init() function, not any functions directly from the package.
    _ "net/http/pprof"
)

func main() {
    // Start the pprof HTTP server on a separate port and goroutine.
    // Running it in a separate goroutine ensures it doesn't block the main application logic.
    // Using a different port (e.g., 6060) is common practice to avoid interfering
    // with the main application's port (e.g., 8080).
    go func() {
        log.Println("Starting pprof server on localhost:6060")
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // Your main service logic would go here...
    // For demonstration, we'll just block forever.
    log.Println("Main application running...")
    select {}
}

This import registers several endpoints under /debug/pprof/ on port 6060:

/debug/pprof/profile: CPU profile (collects data for a duration, typically 30s by default).
/debug/pprof/heap: Memory profile (snapshot of heap allocations).
/debug/pprof/goroutine: Goroutine profile (stack traces of all current goroutines).
/debug/pprof/block: Block profile (stack traces leading to blocking sync primitives).
/debug/pprof/mutex: Mutex profile (stack traces of contended mutex holders).

Once your service is running, you can analyze these profiles using the go tool pprof command. The most convenient way is often using the -http flag, which fetches the profile data and launches an interactive web UI:

# Analyze CPU profile (will collect data for 30s)
go tool pprof -http=:9090 localhost:6060/debug/pprof/profile

# Analyze memory profile (instantaneous snapshot)
go tool pprof -http=:9090 localhost:6060/debug/pprof/heap

Running these commands will fetch the profile data from your running application and open a web browser interface served on port 9090, allowing you to explore the data visually.

Tip: If profiling reveals significant time spent in standard library functions (like marshalling/unmarshalling or compression), consider evaluating high-performance third-party alternatives (e.g., sonic for JSON). Always benchmark to confirm improvements in your specific use case.

Understanding Flame Graphs

One of the most powerful visualizations in the pprof web UI is the flame graph. Below is an interactive flame graph generated from a simple Go program designed to highlight CPU usage and allocation patterns. You can find the source code for this example in the /examples/flamegraph directory and play with full version here.

(Interactive embed omitted in this plain HTML.)

Flame graphs visualize hierarchical data (like call stacks) effectively. Key things to understand when reading any flame graph:

Y-axis: Represents the stack depth (function calls), with the root function (main) typically at the bottom and deeper calls stacked on top.
X-axis: Spans the sample population. The width of a function block indicates the proportion of time (for CPU profiles) or allocated memory (for heap profiles) spent directly in that function or functions it called. Wider blocks mean more time/memory consumption relative to the total profile duration or allocation size.
Reading: Look for wide plateaus, especially near the top of the graph. These represent functions where significant time is being spent directly. Clicking on a block in an interactive viewer zooms in on that part of the hierarchy.

The pprof UI also offers other views like Top (tabular list of most expensive functions), Graph (call graph visualization – requires Graphviz), and Source (line-by-line annotation).

You need graphviz installed locally for some visualization options (like the “Graph” view) within the web UI. See downloads.

Benchmarking Specific Functions with `go test`

While profiling helps you find bottlenecks in your running application, sometimes you want to measure the performance of a specific piece of code in isolation or compare different implementations of a function. Go has excellent built-in support for this via its testing package, which includes benchmarking capabilities.

Benchmarks live in _test.go files alongside your regular tests. They look similar to tests but follow the BenchmarkXxx naming convention and accept a *testing.B parameter.

Here’s a simple example: Suppose we have a function ConcatenateStrings:

package main

import "strings"

func ConcatenateStrings(parts []string) string {
    return strings.Join(parts, "")
}

// Slower implementation for comparison:
func ConcatenateStringsSlowly(parts []string) string {
    var result string
    for _, s := range parts {
        result += s
    }
    return result
}

We can write benchmarks for these in concat_test.go:

package main

import "testing"

var input = []string{"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}

func BenchmarkConcatenateStrings(b *testing.B) {
    // The loop runs b.N times. b.N is adjusted by the testing framework
    // until the benchmark runs for a stable, measurable duration.
    for i := 0; i < b.N; i++ {
        ConcatenateStrings(input)
    }
}

func BenchmarkConcatenateStringsSlowly(b *testing.B) {
    for i := 0; i < b.N; i++ {
        ConcatenateStringsSlowly(input)
    }
}

You run benchmarks using the go test command with the -bench flag. The . argument tells it to benchmark functions in the current folder:

# Run all benchmarks in the current folder
go test -bench=.

# Add memory allocation stats
go test -bench=. -benchmem

The results clearly show that ConcatenateStrings is not only ~2.5x faster (186 vs 73 ns) but also results in fewer allocations (1 vs 9).

Benchmarking is invaluable for:

Validating the performance impact of code changes before merging.
Comparing performance of different algorithms/libraries.
Micro-optimizing critical functions identified through profiling.

Beyond timing and allocations, Go also offers execution tracing specifically during tests via go test -trace=trace.out. This generates a trace file that can be visualized with go tool trace trace.out.

Tip: You can generate CPU and memory profiles specifically for your benchmark execution using flags like -cpuprofile cpu.prof and -memprofile mem.prof.

Continuous Profiling: Performance Insights from Production

While local profiling is great for development and debugging specific issues, continuous profiling captures data from your live production environment over time. This provides invaluable insights into real-world performance, helps catch regressions early, and allows comparison across deployments.

Benefits:

Understand performance under actual production load and traffic patterns.
Easily compare performance between versions (e.g., canary vs. stable).
Quickly identify performance regressions introduced by new code.
Low overhead; profiles are collected periodically across deployments.
Available on most major cloud providers and as third-party solutions.
Optimize resource utilization by identifying CPU and memory trends over time.

Google Cloud Profiler Example

Cloud Profiler makes it easy to get started with continuous profiling on Google Cloud. It allows version-to-version comparisons, which is perfect for analyzing the impact of a new deployment.

Ensure the service account your application runs under has the roles/cloudprofiler.agent IAM role to submit profiles.

Integrating the profiler is straightforward:

package main

import (
    "log"
    "os"

    "cloud.google.com/go/profiler"
)

func main() {
    // Configuration for the profiler.
    cfg := profiler.Config{
        Service:        "your-service-name",      // Replace with your service name
        ServiceVersion: os.Getenv("APP_VERSION"), // Use an env var for version (e.g., BUILD_ID, git SHA)
        // ProjectID is optional if running on GCP infra (inferred)
        // ProjectID: "your-gcp-project-id",
    }

    // Start the profiler. Errors are logged if it fails to start.
    if err := profiler.Start(cfg); err != nil {
        log.Fatalf("WARN: Failed to start profiler: %v", err)
        // Usually, you wouldn't stop the app if the profiler fails, so just log the error.
    }

    // ... rest of your application startup and logic ...
    log.Println("Application started...")
}

Tip: Using an environment variable like APP_VERSION is highly recommended. This allows Cloud Profiler (and other tools) to correlate performance data directly with specific code versions.

For details, see the Google Cloud Profiler Go setup documentation.

Profile-Guided Optimization (PGO): Letting Profiles Drive Compilation

Continuous profiling gives us insights into production behavior. What if we could feed those insights back into the compiler? Since Go 1.21, the compiler includes built-in support for Profile-Guided Optimization (PGO), which is enabled by default. PGO uses CPU profiles gathered from real-world application runs to make more informed optimization decisions during the build process.

The core idea is simple: if the compiler knows which parts of your code are executed most frequently (the “hot paths” identified in a CPU profile), it can apply more aggressive optimizations to those specific areas. A primary example of such optimization is improved inlining.

Getting Started with PGO

Obtain a Profile: Collect a representative CPU profile (pprof format) from your application running under realistic load.
Place the Profile: Copy the profile file (e.g., cpu.pprof) into the root directory of your main package and rename it to default.pgo.
Build: Run go build. Go 1.21+ automatically detects default.pgo and uses it to guide optimizations.

You can also explicitly specify a profile using go build -pgo=/path/to/profile.pprof or disable PGO with go build -pgo=off.

Benefits and Considerations

Performance Gains: Typical CPU-bound benchmarks may see 2–7% improvements; I/O-bound services may see less.
Increased Build Times: PGO can significantly increase build times because dependencies may be rebuilt. Ensure good build caching.
Measure Impact: Benchmark with and without PGO for your workload before committing to it.
Evolving Feature: Expect improvements in future Go releases.

See the official Go docs on Profile-Guided Optimization.

A Brief Word on Tracing

Metrics give you the overview (the “what”), and profiling gives you the deep dive into a single service’s internals. Tracing provides a detailed view of a request or operation’s journey by breaking it down into timed steps called spans. In microservices, distributed tracing propagates context (trace IDs) across services to visualize the end-to-end path.

Implementing tracing often involves instrumenting code to create spans. OpenTelemetry is the emerging standard for observability data, including tracing.

Auto-instrumentation in Go: Go is compiled to native code, making runtime auto-instrumentation harder than in languages like Java/Python. Techniques include compile-time code generation and eBPF-based observation.

Conclusion

Optimizing Go microservices effectively starts with visibility. This post focused on the foundational pillars for observing your application’s behavior:

Begin with metrics to get a high-level, quantitative view of health and performance.
When metrics indicate a problem, dive deeper with profiling (local pprof or continuous profilers).
Leverage collected CPU profiles with PGO so the compiler can optimize real hot paths.

With these tools you can refactor CPU-intensive functions, optimize memory usage, remove unnecessary work, and set meaningful SLOs based on latency distributions. Up next: tuning the Go runtime itself.

More To Explore

P99 CONF 2025 Recap: Latency to LLMs

A look at P99 CONF from the host perspective

Tim Koopmans October 28, 2025

eBPF Tech Talks at P99 CONF 2025

Three P99 CONF eBPF talks on thread-level observability, concurrency testing, and reliable, memory-efficient eBPF instrumentation.

Cynthia Dunlop September 29, 2025

Go Performance Tuning Quick Start: Metrics, Profiling & Beyond

Share This Post

Metrics: Your Service’s Dashboard Gauges

Adding More Detailed Metrics with Prometheus

Profiling: Finding the Needles in the Haystack

Profiling Fundamentals

Local Profiling with pprof

Understanding Flame Graphs

Benchmarking Specific Functions with `go test`

Continuous Profiling: Performance Insights from Production

Google Cloud Profiler Example

Other Profiling Solutions

Profile-Guided Optimization (PGO): Letting Profiles Drive Compilation

Getting Started with PGO

Benefits and Considerations

A Brief Word on Tracing

Conclusion

Further Reading

More To Explore

P99 CONF 2025 Recap: Latency to LLMs

eBPF Tech Talks at P99 CONF 2025

Proudly supported by

Go Performance Tuning Quick Start: Metrics, Profiling & Beyond

Share This Post

Metrics: Your Service’s Dashboard Gauges

Adding More Detailed Metrics with Prometheus

Profiling: Finding the Needles in the Haystack

Profiling Fundamentals

Local Profiling with pprof

Understanding Flame Graphs

Benchmarking Specific Functions with go test

Continuous Profiling: Performance Insights from Production

Google Cloud Profiler Example

Other Profiling Solutions

Profile-Guided Optimization (PGO): Letting Profiles Drive Compilation

Getting Started with PGO

Benefits and Considerations

A Brief Word on Tracing

Conclusion

Further Reading

More To Explore

P99 CONF 2025 Recap: Latency to LLMs

eBPF Tech Talks at P99 CONF 2025

Proudly supported by

Benchmarking Specific Functions with `go test`