Speakers calling BS on the very essence of a conference’s name is typically not socially acceptable. But P99 CONF isn’t your usual conference. The community – and the hosts – loved it when three daring speakers openly challenged the value of P99 latencies at P99 CONF 2023.
The P99 shakedown started with Gil Tene (known for his “oh sh*t” talk on how NOT to measure latency) kicking off the conference with his take on “Misery Metrics & Consequences.” Alex Hidalgo picked up the baton with “Throw Away Your Nines.” Then, Charity Majors took it across the finish line with the conference’s most colorful (literally and figuratively) expression of “P99 is BS.”
So what’s the problem with P99s? And if they truly are BS, what should we look at instead?
P99 CONF 2023 is a free virtual conference on low latency engineering strategies. Join the community in October for a no-holds-barred exploration of topics like Rust, observability, edge, performance tuning, AI/ML, Kubernetes, Linux kernel and more – with an all-star lineup of experts, including engineers from almost any impressive tech company you can imagine.
Gil Tene: How I Learned to Stop Worrying and Love Misery
Gil Tene, CTO at Azul, launched his takedown of P99 “percentlies” (his spelling, not a typo) with a seemingly pretty chart – filled with pretty little lies:
This dashboard shows 25th, 50th, 90th and 95th percentiles over a period of two hours, and it seems to give us a pretty good idea of what’s going on. What stands out is the fact that the 95th percentile spikes up for some reason around 12:40, the 90th percentile spikes up a little bit under it, and the other percentiles seem to stay the same. You might read this to mean that there are some outliers you might want to explore, but the rest seems generally OK.
But looking at this is a pure waste of time, according to Tene. “What is not shown on this chart is the 5% of things that are worse than the 95th percentile. This chart shows the good stuff; it only shows happy results. For this chart to show even this spike, it has to be so bad that more than 5% of the things that we see have reached this level. This is the chart you show if you want to hide from reality or to hide reality from other people. It’s a great chart to show if you are not doing your job well, but you want your bonus anyway.”
From Bad to Worse
So bring on the P99s and it’s all better? Well, it does provide more insight into how bad it really is. With P99s, something like this…
turns into this…
But, even as bad as it seems now, you’re still totally overlooking the 1% of things that are worse than everything shown here. As Tene emphasizes, “The act of showing percentiles – only to some depth, to a few numbers, usually nines – is the act of hiding from reality… If you care about service levels, if you care about the behavior of your systems as seen by end users, by customers, by businesses – you need to look even deeper than that.”
When you measure the P99, what are the chances that your end user (a person or a client) would experience something worse than the 99th percentile? Pretty good, actually. P99 does NOT mean that 99% of things will be better than that. Tene takes the example of a super simple user session that involves just five page loads averaging 40 resources per page. How many users will NOT experience something worse than the 99th percentile of HTTP requests? Only about 13%. In other words, 87% of your end users experience something worse than your P99.
So. Many. Nines.
So, do you obsess over even more nines? Well, that’s problematic too: “There’s not enough data points in the period of time we’re measuring to classify them with enough nines. And we don’t tend to aggregate things across the data sets in order to actually extract a higher number of nines from larger data.” Tene came up with an HdrHistogram to account for this, enabling people to record data and to accumulate intervals of data together to get an accurate number of nines. But he’s given up on that tool achieving broad adoption (“The chances of people actually using well-structured histograms for a larger number of nines is not very high”).
And although it’s hard to measure all these nines, it’s even harder to do it well. That requires accounting for coordinated omission, a much larger topic than we can adequately cover here. Quick summary: A slow operation gets measured only once and the ripple effect of all the other delays it causes don’t get measured at all. This skews results immensely.
Misery Loves … Better User Experiences?
At this trough of despair, Tene pauses to reflect: “So if things are so broken, and we just can’t seem to get anything right here, what do we do? Do we just give up?” Fortunately, no. Some things people are doing actually do help us provide better experiences. Case in point: misery metrics.
Think about what failure looks like in your system and measure the corresponding “misery metrics.” You can reliably measure indicators like timeouts, retries, failed queries and even business-focused metrics like abandoned shopping carts. If those numbers are trending in a bad direction, who cares what your P99 is? There’s clearly a problem … go diagnose and fix it!
Tene wraps up with his take on how to monitor these misery metrics. “Plot the bad thing, watch it, and see how it reacts to the world, see how it reacts to load. Your success rate is measured by the number of things that aren’t broken. In general, you will know when things aren’t good enough. If you don’t deliver a result within 50 milliseconds, users will move on. If you don’t let them check out of their shopping cart in 5 seconds, they’ll abandon. Usually, being better than needed is not a benefit. But being even a little bit worse than needed is enough to lose business.”
This recap is really just scratching the surface. Be sure to catch all the details, straight from the mouth of the “how NOT to measure latency” master himself.
Throw Away Some Nines
For even more critical perspectives on measuring P99s, take a look at Charity Major’s spirited P99 CONF keynote as well as Alex Hidalgo’s carefully crafted exploration on all the nines. Also see this great writeup by Jessica Wachtel.
Majors: You’re Careering Down the Freeway Without Your Glasses On
Early on, Charity Majors, CTO at Honeycomb, flat out states that P99 is BS because every user interaction counts. Even if you’re hitting all your nines, “there can still be a whole bunch of pathologies. Everybody who logged in today might have had their state saved on an unresponsive shard, payments might be failing – there’s an infinitely long, thin tail of things that almost never happen that someday will happen – and whenever they do, they will inevitably bite you.”
Rather than obsess over nines, Majors wants us to look inward and focus on gaining better and faster visibility into what’s really going on inside the system. “Observability lets you inspect cause and effect at a very granular level. It connects effort to output, it connects cause to effect and helps you to iterate and improve on what works using a magnifying glass. Without observability, you are really driving blind, you’re careening down the freeway without your glasses on.”
She continues, “Only a very, very small fraction of system problems and bugs ever actually need to be closely understood. But, that tiny percentage has an outsized effect on the success of your business and the happiness of your users. And the tricky part is that you can never predict in advance what they’re going to be.”
And here’s a summary of Majors’ colorful solution:
She explains, “Really, it’s about empowering software engineers to own their own code, and the way that you do this is by instrumenting it as you go. Never accept a pull request unless you can explain how you can tell if it breaks and what the instrumentation is… If you can do this, shipping software reliably with fast feedback loops, you can probably catch upwards of 80% of all problems before your users do.”
Note that in many respects, Majors is actually quite aligned with Gil’s emphasis on so-called misery metrics: “Instead of alerting on hundreds or thousands of symptom-based monitoring checks, alert only on a few precious SLOs that directly reflect user pain.”
Hidalgo: Kill Rethink your darlings
Spoiler alert: Alex Hidalgo, principal reliability advocate at Nobl9, isn’t a total nine hater. A key takeaway from his P99 CONF session was to look beyond the nines if you want to improve experiences for real-world users working with real-world applications run by teams with real-world budgets.
P99 latencies focus on standard long-tail distributions, but, as Hidalgo has experienced, that’s not always typical. As a former Google site reliability engineer, he’s seen quite a variety of distributions: bimodal, left-skewed, multimodel, to name a few. If you just look at P99s because that’s what you always do, then you could be missing a lot. “Nines aren’t terrible to use, but they’re not always the right choice,” he explained. “You need to be meaningful.”
But what does it mean to be meaningful in this context? As Hidalgo wraps up: “Actually, using the number nine is just fine. There’s nothing wrong with aiming for 99.9% reliability. There’s absolutely nothing wrong with using the 99th percentile for your latency measurements. Because part of being meaningful about your choices, and meaningful about your decisions, is that you should be looking to the past. You just don’t want to copy the past. Sometimes what was done before was actually a very good idea. There’s a reason why some of these things are so common – because very often they are the right choices. You just need to make sure that they’re the right choices for you.”
Continue the Latency Conversation at P99 CONF 23
So maybe we don’t need to throw away all of our nines – and we’re keeping the conference name as is, despite the snarky banter in the chat. 😉
If you want to continue the conversation on measuring and optimizing latency, join your peers at P99CONF, where friendly debate is not only welcomed but encouraged. Topics up for discussion this year span across domains like:
- Rust – Optimizations, case studies, future use cases, Rust vs C++, Zig, Go
- Kubernetes – Database scaling, application optimization, edge, benchmarking
- Databases and event streaming – SQL, NoSQL, caching, data streaming
- AI/ML – feature stores, real-time model predictions
- Edge – Databases, unikernels, API gateways
- Observability – eBPF, tracing, OpenTelemetry