Continuous profiling in Go often surfaces slowdowns that never appear in benchmarks; this guide will help you get started
This is a guest post by P99 CONF speaker Paweł Obrępalski (Staff Engineer @ ShareChat AI). Paweł will be presenting “Go Faster: Tuning the Go Runtime for Latency and Throughput” at P99 CONF 2025 (free + virtual). His session description:
Most Go services don’t need runtime tuning—until they do. At ShareChat, we run hundreds of Go services across thousands of cores, where even small improvements can lead to significant impact. We’ve seen real gains by understanding how the Go runtime behaves under production load—and tuning accordingly.
In this talk, I’ll break down two core components of Go’s runtime: the scheduler and garbage collector—explaining how they influence latency, throughput, and resource usage. Then, I’ll walk through practical tuning techniques using runtime variables (GOGC, GOMEMLIMIT, GOMAXPROCS) and Profile-Guided Optimisation (PGO). Finally, I’ll share lightweight strategies to monitor system behaviour and spot when runtime tuning is warranted—so you can optimise your services without guesswork.
Join P99 CONF 2025 (free + virtual)
This article was originally posted on Paweł’s blog.
***
When working with microservices at scale, performance optimization becomes important for maintaining both system reliability and cost efficiency. Getting visibility into how your Go services behave under load is the first step towards making them faster and more efficient. This post will guide you through setting up essential observability tools: metrics and profiling. While the examples will be specific to Go, the concepts apply to other languages as well.
If you want to follow along, you will need to have Go installed: https://go.dev/doc/install.
You can find all code examples in this site’s repository.
Metrics: Your Service’s Dashboard Gauges
Think of metrics as the dashboard gauges for your application. They provide quantifiable measurements of your service’s health and performance over time, like request rates, error counts, latency distributions, and resource utilization (CPU, memory). Starting with metrics gives you a high-level overview, helps identify trends, and allows you to set up alerts for abnormal behavior.
Runtime metrics can be easily added using Prometheus, a popular open-source monitoring and alerting toolkit. You can install the Go client library by running:
go get github.com/prometheus/client_golang
After that, adding a basic set of Go runtime metrics (like garbage collection stats, goroutine counts, memory usage) is as simple as exposing an HTTP endpoint using the default promhttp.Handler
:
package main
import (
"log"
"math/rand"
"net/http"
"time"
// Import pprof package for side effects: registers HTTP handlers.
// We use the blank identifier _ because we only need the side effects (handler registration)
// from its init() function, not any functions directly from the package.
_ "net/http/pprof"
)
func main() {
// Start the pprof HTTP server on a separate port and goroutine.
// Running it in a separate goroutine ensures it doesn't block the main application logic.
// Using a different port (e.g., 6060) is common practice to avoid interfering
// with the main application's port (e.g., 8080).
go func() {
log.Println("Starting pprof server on localhost:6060")
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// Your main service logic would go here...
// For demonstration, we'll randomly pick a number every 100ms
log.Println("Main application running...")
for {
log.Println(rand.Intn(100))
time.Sleep(100 * time.Millisecond)
}
}
You can run this example with go run .
. Metrics will now be available at localhost:8080/metrics
.
Tip: In a real setup, you wouldn’t typically access this endpoint directly in your browser. Instead, a Prometheus server would periodically scrape this endpoint, storing the time-series data. You’d then use tools like Grafana to query Prometheus and visualize the metrics. However, checking the
/metrics
endpoint manually is a great way for quickly verifying that your application is exposing metrics as expected, especially when adding custom ones.
Adding More Detailed Metrics with Prometheus
The default handler is a good start, but often you’ll want more detailed information about your application code or the Go process itself. The collectors
package provides fine-grained metrics about the Go runtime and the process.
Here’s how you can create a custom registry and add these specific collectors:
package main
import (
"log"
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/collectors"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
// Create a non-global registry.
reg := prometheus.NewRegistry()
// Add collectors for Go runtime stats and process stats.
// TODO: check if this can be done more easily
reg.MustRegister(
collectors.NewGoCollector(),
collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}),
)
// Expose metrics using the custom registry.
http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
log.Println("Starting server on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
This setup gives you deeper insights into Garbage Collector (GC) performance, memory allocation details, and process-level stats like CPU and memory usage, file descriptors, etc. Beyond these, you can easily add your own custom application-specific metrics (e.g., number of cache hits, specific business operation counters). For a complete example of runtime metrics collection, see the prometheus/client_golang example.
Profiling: Finding the Needles in the Haystack
While metrics give us a great overview (the “what”), they often don’t tell us the “why.” If your latency is high or CPU usage is spiking, metrics alone might not pinpoint the exact lines of code responsible. Their granularity is usually too low for deep optimization. This is where profiling comes in – it helps us look inside the application’s execution to see exactly where time is spent or memory is allocated, revealing the bottlenecks.
Profiling Fundamentals
Profiling captures detailed runtime behavior. For performance optimization, two types are most commonly used:
- CPU Profiling: Captures stack traces over time to show where your program is spending its CPU cycles. Essential for identifying “hot paths” – functions that consume significant processing time.
- Memory Profiling: Takes snapshots of the heap to show where memory is being allocated. Helps understand object lifetimes, identify memory leaks, and analyze garbage collection pressure. Understanding allocation patterns is key to performance in Go.
Go also provides specialized profiles (goroutine, block, mutex), which are useful for diagnosing concurrency issues but are generally used less frequently than CPU and memory profiling for general optimization.
Local Profiling with pprof
Go’s built-in pprof
package makes profiling straightforward. It can collect profiling data and expose it over an HTTP endpoint for analysis.
Local profiling with pprof
is invaluable during the development cycle and for investigating specific, reproducible performance issues. It’s the go-to tool when you need immediate feedback on the performance impact of code changes or when analyzing behavior that might not be present or easy to isolate in a production environment captured by continuous profiling.
Tip: Enabling the
pprof
endpoint in production-like environments (perhaps on a specific instance, canary, or behind authentication) can be invaluable for quick troubleshooting of live issues.
The easiest way to enable this is via a side effect import of net/http/pprof
.
package main
import (
"log"
"net/http"
// Import pprof package for side effects: registers HTTP handlers.
// We use the blank identifier _ because we only need the side effects (handler registration)
// from its init() function, not any functions directly from the package.
_ "net/http/pprof"
)
func main() {
// Start the pprof HTTP server on a separate port and goroutine.
// Running it in a separate goroutine ensures it doesn't block the main application logic.
// Using a different port (e.g., 6060) is common practice to avoid interfering
// with the main application's port (e.g., 8080).
go func() {
log.Println("Starting pprof server on localhost:6060")
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// Your main service logic would go here...
// For demonstration, we'll just block forever.
log.Println("Main application running...")
select {}
}
This import registers several endpoints under /debug/pprof/
on port 6060
:
/debug/pprof/profile
: CPU profile (collects data for a duration, typically 30s by default)./debug/pprof/heap
: Memory profile (snapshot of heap allocations)./debug/pprof/goroutine
: Goroutine profile (stack traces of all current goroutines)./debug/pprof/block
: Block profile (stack traces leading to blocking sync primitives)./debug/pprof/mutex
: Mutex profile (stack traces of contended mutex holders).
Once your service is running, you can analyze these profiles using the go tool pprof
command. The most convenient way is often using the -http
flag, which fetches the profile data and launches an interactive web UI:
# Analyze CPU profile (will collect data for 30s)
go tool pprof -http=:9090 localhost:6060/debug/pprof/profile
# Analyze memory profile (instantaneous snapshot)
go tool pprof -http=:9090 localhost:6060/debug/pprof/heap
Running these commands will fetch the profile data from your running application and open a web browser interface served on port 9090
, allowing you to explore the data visually.
Tip: If profiling reveals significant time spent in standard library functions (like marshalling/unmarshalling or compression), consider evaluating high-performance third-party alternatives (e.g.,
sonic
for JSON). Always benchmark to confirm improvements in your specific use case.
Understanding Flame Graphs
One of the most powerful visualizations in the pprof
web UI is the flame graph. Below is an interactive flame graph generated from a simple Go program designed to highlight CPU usage and allocation patterns. You can find the source code for this example in the /examples/flamegraph
directory and play with full version here.
(Interactive embed omitted in this plain HTML.)
Flame graphs visualize hierarchical data (like call stacks) effectively. Key things to understand when reading any flame graph:
- Y-axis: Represents the stack depth (function calls), with the root function (
main
) typically at the bottom and deeper calls stacked on top. - X-axis: Spans the sample population. The width of a function block indicates the proportion of time (for CPU profiles) or allocated memory (for heap profiles) spent directly in that function or functions it called. Wider blocks mean more time/memory consumption relative to the total profile duration or allocation size.
- Reading: Look for wide plateaus, especially near the top of the graph. These represent functions where significant time is being spent directly. Clicking on a block in an interactive viewer zooms in on that part of the hierarchy.
The pprof
UI also offers other views like Top (tabular list of most expensive functions), Graph (call graph visualization – requires Graphviz), and Source (line-by-line annotation).
You need
graphviz
installed locally for some visualization options (like the “Graph” view) within the web UI. See downloads.
Benchmarking Specific Functions with go test
While profiling helps you find bottlenecks in your running application, sometimes you want to measure the performance of a specific piece of code in isolation or compare different implementations of a function. Go has excellent built-in support for this via its testing
package, which includes benchmarking capabilities.
Benchmarks live in _test.go
files alongside your regular tests. They look similar to tests but follow the BenchmarkXxx
naming convention and accept a *testing.B
parameter.
Here’s a simple example: Suppose we have a function ConcatenateStrings
:
package main
import "strings"
func ConcatenateStrings(parts []string) string {
return strings.Join(parts, "")
}
// Slower implementation for comparison:
func ConcatenateStringsSlowly(parts []string) string {
var result string
for _, s := range parts {
result += s
}
return result
}
We can write benchmarks for these in concat_test.go
:
package main
import "testing"
var input = []string{"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}
func BenchmarkConcatenateStrings(b *testing.B) {
// The loop runs b.N times. b.N is adjusted by the testing framework
// until the benchmark runs for a stable, measurable duration.
for i := 0; i < b.N; i++ {
ConcatenateStrings(input)
}
}
func BenchmarkConcatenateStringsSlowly(b *testing.B) {
for i := 0; i < b.N; i++ {
ConcatenateStringsSlowly(input)
}
}
You run benchmarks using the go test
command with the -bench
flag. The .
argument tells it to benchmark functions in the current folder:
# Run all benchmarks in the current folder
go test -bench=.
# Add memory allocation stats
go test -bench=. -benchmem
The results clearly show that ConcatenateStrings
is not only ~2.5x faster (186 vs 73 ns) but also results in fewer allocations (1 vs 9).
Benchmarking is invaluable for:
- Validating the performance impact of code changes before merging.
- Comparing performance of different algorithms/libraries.
- Micro-optimizing critical functions identified through profiling.
Beyond timing and allocations, Go also offers execution tracing specifically during tests via go test -trace=trace.out
. This generates a trace file that can be visualized with go tool trace trace.out
.
Tip: You can generate CPU and memory profiles specifically for your benchmark execution using flags like -cpuprofile cpu.prof
and -memprofile mem.prof
.
Continuous Profiling: Performance Insights from Production
While local profiling is great for development and debugging specific issues, continuous profiling captures data from your live production environment over time. This provides invaluable insights into real-world performance, helps catch regressions early, and allows comparison across deployments.
Benefits:
- Understand performance under actual production load and traffic patterns.
- Easily compare performance between versions (e.g., canary vs. stable).
- Quickly identify performance regressions introduced by new code.
- Low overhead; profiles are collected periodically across deployments.
- Available on most major cloud providers and as third-party solutions.
- Optimize resource utilization by identifying CPU and memory trends over time.
Google Cloud Profiler Example
Cloud Profiler makes it easy to get started with continuous profiling on Google Cloud. It allows version-to-version comparisons, which is perfect for analyzing the impact of a new deployment.
Ensure the service account your application runs under has the
roles/cloudprofiler.agent
IAM role to submit profiles.
Integrating the profiler is straightforward:
package main
import (
"log"
"os"
"cloud.google.com/go/profiler"
)
func main() {
// Configuration for the profiler.
cfg := profiler.Config{
Service: "your-service-name", // Replace with your service name
ServiceVersion: os.Getenv("APP_VERSION"), // Use an env var for version (e.g., BUILD_ID, git SHA)
// ProjectID is optional if running on GCP infra (inferred)
// ProjectID: "your-gcp-project-id",
}
// Start the profiler. Errors are logged if it fails to start.
if err := profiler.Start(cfg); err != nil {
log.Fatalf("WARN: Failed to start profiler: %v", err)
// Usually, you wouldn't stop the app if the profiler fails, so just log the error.
}
// ... rest of your application startup and logic ...
log.Println("Application started...")
}
Tip: Using an environment variable like APP_VERSION
is highly recommended. This allows Cloud Profiler (and other tools) to correlate performance data directly with specific code versions.
For details, see the Google Cloud Profiler Go setup documentation.
Other Profiling Solutions
Most major cloud providers offer profiling tools, though language support varies (e.g., AWS CodeGuru for Java/Python, Azure Monitor for .NET). There are also excellent third-party, language-agnostic observability platforms that include continuous profiling, such as Datadog, New Relic, Honeycomb, and Grafana Pyroscope.
Profile-Guided Optimization (PGO): Letting Profiles Drive Compilation
Continuous profiling gives us insights into production behavior. What if we could feed those insights back into the compiler? Since Go 1.21, the compiler includes built-in support for Profile-Guided Optimization (PGO), which is enabled by default. PGO uses CPU profiles gathered from real-world application runs to make more informed optimization decisions during the build process.
The core idea is simple: if the compiler knows which parts of your code are executed most frequently (the “hot paths” identified in a CPU profile), it can apply more aggressive optimizations to those specific areas. A primary example of such optimization is improved inlining.
Getting Started with PGO
- Obtain a Profile: Collect a representative CPU profile (
pprof
format) from your application running under realistic load. - Place the Profile: Copy the profile file (e.g.,
cpu.pprof
) into the root directory of your main package and rename it todefault.pgo
. - Build: Run
go build
. Go 1.21+ automatically detectsdefault.pgo
and uses it to guide optimizations.
You can also explicitly specify a profile using go build -pgo=/path/to/profile.pprof
or disable PGO with go build -pgo=off
.
Benefits and Considerations
- Performance Gains: Typical CPU-bound benchmarks may see 2–7% improvements; I/O-bound services may see less.
- Increased Build Times: PGO can significantly increase build times because dependencies may be rebuilt. Ensure good build caching.
- Measure Impact: Benchmark with and without PGO for your workload before committing to it.
- Evolving Feature: Expect improvements in future Go releases.
See the official Go docs on Profile-Guided Optimization.
A Brief Word on Tracing
Metrics give you the overview (the “what”), and profiling gives you the deep dive into a single service’s internals. Tracing provides a detailed view of a request or operation’s journey by breaking it down into timed steps called spans. In microservices, distributed tracing propagates context (trace IDs) across services to visualize the end-to-end path.
Implementing tracing often involves instrumenting code to create spans. OpenTelemetry is the emerging standard for observability data, including tracing.
Auto-instrumentation in Go: Go is compiled to native code, making runtime auto-instrumentation harder than in languages like Java/Python. Techniques include compile-time code generation and eBPF-based observation.
Conclusion
Optimizing Go microservices effectively starts with visibility. This post focused on the foundational pillars for observing your application’s behavior:
- Begin with metrics to get a high-level, quantitative view of health and performance.
- When metrics indicate a problem, dive deeper with profiling (local
pprof
or continuous profilers). - Leverage collected CPU profiles with PGO so the compiler can optimize real hot paths.
With these tools you can refactor CPU-intensive functions, optimize memory usage, remove unnecessary work, and set meaningful SLOs based on latency distributions. Up next: tuning the Go runtime itself.