P99 CONF – an open-source-community-focused conference for engineers who obsess over low latency – kicked off with Gil Tene’s take on “misery metrics” and wrapped with a look at P99 latency reduction as a Sisyphean task. In between, engineers spoke of their many hard-fought performance battles and lessons learned– across infrastructure engineering, programming languages, and even the art and science of measuring performance.
And thousands of engineers attending live had a great time digging into all this pain and misery. After all, where else can you sharpen your own latency-slaying ax in just a few hours – learning vicariously from top engineers’ experience wrestling with the toughest performance challenges on the planet? Only at P99 CONF, which is now available for on-demand binge-watching.
Can Go, with carefully-crafted performance optimizations, rival the speed of Rust? How do WebAssembly and Rust fit into a low-latency C++ project? Who wins in a head-to-head performance comparison between the Linux kernel’s network stack and DPDK? What are the latest use cases for eBPF and io_uring? How do teams at the likes of Uber, Lyft and Square satisfy the vaunted two, three, four and even five “nines”? And should we just throw away our nines altogether? These discussions were all part of the conference.
Here’s a look at the P99 CONF keynotes, followed by a quick rundown of the most-watched 20-minute talks.
Gil Tene: Misery Metrics & Consequences
Gil Tene’s now legendary “oh sh*t” talk has prompted many engineers to rethink their approach to measuring P99 and similar latencies. But after obsessing over how to improve performance measurement for over a decade and seeing how distributed systems are evolving, Gil’s own approach to the topic has also evolved.
From Dr. Strangelove to The Matrix to sacrificial goats, Tene took the audience on a journey that ranged from high latency peaks to a trough of measurement despair. He deconstructed what we’re really looking at with “all the pretty charts,” and, more importantly, what we’re missing. This talk will leave you wondering what’s really going on behind the charts you see across our event logos – and probably thinking quite a bit about the very idea of P99. At what seems to be the bottom of the trough of despair, Tene shows a rather terrifying example of how actions that improve performance on a chart can seriously undermine the end-user experience. Misery indeed.
But Tene remains confident that it is, in fact, possible to overcome the shortcomings of our accepted performance measurement methodologies…just not in the way he originally believed we could. There’s hope in misery. A light at the end of the tunnel. Rainbows, even (perhaps because Tene was joining us from Hawaii). As it turns out, engineers can prevent misery by learning to love misery. And if that doesn’t yet make sense to you, it’s time to grab the red pill and watch his session.
Liz Rice: Using eBPF for High-Performance Networking in Cilium
Liz Rice walked attendees through how Cilium (part of the CNCF) improves throughput, frees up CPU usage and makes Kubernetes networking more efficient by using eBPF to bypass parts of the network stack.
Using XDP (eXpress Data Path), Cilium can run eBPF programs on the network interface card – enabling you to take advantage of eBPF as soon as a network packet arrives. For example, as Rice demonstrates, you could use eBPF as a very fast and efficient way to identify and discard “packets of death.” Notably, such a mitigation can be loaded dynamically, without installing a kernel patch or rebooting machines. And that’s just one case of how you can use eBPF to dynamically change the behavior of networking in a system.
eBPF can also be used for manipulating packets; for example to change the source and destination addresses contained in the packets for load balancing. As a packet arrives, an eBPF XDP program can determine where to send it – on that host or to a different machine – without the packet being processed by the kernel’s networking stack. This enables impressive performance gains (Exhibit A: read how Seznam.cz achieved over two times better throughput and saved an “unbelievable amount of CPU usage” by running an XDP-based load balancer vs IPVS one.)
Looking beyond XDP, eBPF programs can be attached to a variety of different points in the network stack, and this is especially helpful when working with the complex networking stack of Kubernetes. As Rice’s demos, flamegraphs and benchmarks show, this yields yet more opportunities for throughput and CPU gains. Watch the video and see the performance impact for yourself.
Armin Ronacher: Overcoming Variable Payloads to Optimize for Performance
Armin Ronacher shared his experiences building Sentry.io’s event processing pipeline, which is especially interesting (read: confoundingly non-trivial) given that payload sizes can have multiple orders of magnitude difference, the complexity of data impacts processing and the variable volume of data streams is impossible to predict.
As an error-tracking and monitoring product, Sentry groups error and transaction events by commonality. This requires fast and complex processing as the events come in. Multiple layers of processing might be required to produce an error report that includes a stack trace, source code, tags and an example screenshot. All this needs to be deduced, almost instantaneously, from a sea of session updates, transaction events, metrics and many many reports.
This brings quite a few challenges. One is that users expect that reports come in as fast as possible, but an individual crash report can have a high variance of event processing times (from 1 millisecond to over 30 minutes). The delay depends on a number of factors that often aren’t immediately discernible. Another intricacy: What happens at the end of the pipeline can impact the beginning of it. Also, to ensure that the pipeline can be extended closer and closer to the user, there are two layers to the pipeline: the innermost ingestion system, plus a customer relay (an on-prem proxy) to perform actions like stripping out personal identifiable information (PII) before the data leaves their infrastructure. The data stems from many different sources, and any change to their pipeline – even a bug fix – can break a customer’s workflow.
Not surprisingly, addressing these complex challenges required a carefully crafted, multifaceted strategy. Watch the video to hear how they’ve approached it so far and how they plan to optimize it next.
Dor Laor: 8 Years of Battling P99 Latency at ScyllaDB
For eight years now ScyllaDB, led by Dor Laor, has been battling P99 latency while building an “all-weather, all-terrain, high throughput, low latency database.” Some might feel that this is a Sisyphean task, but for Laor and everyone at ScyllaDB, it’s a labor of love.
Laor began his keynote by outlining three main sources of latency. The first is speed mismatch, most commonly between disk speed and CPU/memory speed. If this mismatch occurs – and the system is not engineered to handle that mismatch – you hit a wall and the P99 latencies are markedly higher than the mean and median latencies. If you’re working on something like a file system or database, this translates to high P99 across the board. After drilling down into the various CPU and I/O differences that drive this speed mismatch, Laor outlines how ScyllaDB addresses it.
The second source of latency should be quite familiar to anyone with a teenager: lack of respect for limits. ScyllaDB, like many other systems, has CPU and I/O schedulers that schedule multiple tasks. Each task has a quota (ours is half a millisecond). By the time a task’s quota is up, it’s supposed to relieve the CPU for other tasks. But, if you get a self-centered task that doesn’t let others have their turn, P99 latency increases. At ScyllaDB, we combat this by having a block detector that alerts us to task quota violations and even provides us details about what the task was doing at the moment it stalled. Some of the culprits we’ve found include large memory allocations, the memory allocator itself, the cache, the I/O system and sometimes even in the Linux kernel.
The third core source of P99 latency is imperfect isolation. This is quite common in software. ScyllaDB takes a variety of steps to isolate the CPU and the I/O. For example, to isolate CPUs, we use a shard-per-core architecture, which relieves us from locking. Moreover, within every CPU, there’s a scheduler that provides isolation between tasks and ensures that they respect limits. There’s also an I/O scheduler that loads the disk to its sweet spot: the point at which the disk throughput is maximized but I/O isn’t queued and latency is still good. This is done with a userspace I/O scheduler that prioritizes each task: giving tasks that require low latency priority over, say, heavyweight batch tasks such as compaction that are less critical.
But then Laor breaks it open: “What is the most notorious source of latencies? Everywhere. Pretty much everything must be perfect to have a low P99.” This includes culture, disaster recovery, OSes, architecture, disks, protocols, clients and more. Watch the video for his take on the many impacts on P99 latency, and examples of how they can be addressed.
Charity Majors: From SLOs to “Game of the Year”
Charity Majors is known for her brutally incisive insights – and her P99 CONF keynote was no exception. There’s a lot that latency-minded engineers can learn from gaming – where anything short of a flawless experience will undermine even the most imaginative design. That’s where tools and telemetry come into play. As Majors puts it, “It’s impossible to not have issues. But it is possible to find and fix issues before users notice. You really want this to be taken for granted by users.”
But how do you measure this experience? Following on the ugly P99 reality first introduced by Gil Tene, then also probed by Alex Hidalgo and Brian Taylor, Majors declares, “Aggregates are bullsh*t. Every individual experience counts. Any one player who can log in can start a sh*tstorm on the forums.” Your system might have four nines but still…
- Everyone who logged in today might have had their state saved on an unresponsive shard – and think you are 100% down.
- Latency for logins might be timing out for everybody in certain regions.
- Upserts to payment might be failing upon registration.
- Effective observability is the hidden link between engineering experience and user experience.
According to Majors, “Without observability, you are really driving blind. You are careering down the freeway without your glasses on.” Observability should help you see what’s happening under the hood – to the point where you can debug your code, reconstruct any user’s experience, and even understand new scenarios without shipping new code. It should help you move beyond “Is this broken?” to “How does this work and what is my user experiencing?” The better you get at that, the less you will experience breakages.
But how do you achieve this unicorn-like level of observability with games or other complex highly-distributed systems that are:
- Deployed across multiple clouds, data centers and regions
- Designed and developed by a multitude of teams around the world
- Used across thousands of device types
- Prone to enormous concurrency issues and “thundering herds”
That’s something Majors has been tackling for years, leading to her latest adventure as “accidental startup founder” at Honeycomb.io. Watch her keynote for a clear path forward – sharing her hard-fought lessons learned so you can deliver a user experience that’s all unicorns and rainbows.
Bryan Cantrill: Sharpening the Ax: The Primacy of Toolmaking
Tools – things used to make other things – are at the very core of our humanity. But despite a charter to make things, engineers are too often discouraged from making their own tools. Time spent on tooling is viewed as a regrettable distraction rather than an essential endeavor. Bryan Cantrill, well-known software engineer and pundit, disagrees.
“Honestly, some of the most consequential developments in software history were in fact groups seeking to develop better tools for themselves.” Want examples and engaging stories? Cantrill never disappoints.
Cantrill’s trove of impactful toolmaking examples begins with Unix, circa 1969 (not surprising given that Cantrill’s own career took off at Sun Microsystems, which his 9-year-old daughter suspected was a microbrewery). The team behind Unix saw that AT&T Bell Laboratories really needed a better way to typeset their patent applications, and they originally spun Unix as a system for word processing. But it soon became apparent that Unix had quite a bit of utility beyond that. According to Cantrill, “The tool was becoming more valuable than the artifacts it was creating. It was an early dividend from investing in tooling.”
The next – and crisper – example was born when Don Knuth was updating his book, “The Art of Computer Programming.” Typesetting went digital between the first and second editions, and Knuth despised the layout that ensued. So, Knuth took it upon himself to build the tool needed to typeset his own book: TeX. Per Bryan, “TeX and LaTeX became hugely important for writing technical documents around the globe. We should feel lucky that a computer scientist had this problem earlier and had this zeitgeist, this ethos, around writing a tool.”
Git’s predecessor NSELite, Cantrill’s baby DTrace (which greatly accelerated the development of later systems like ZFS), Jenkins (née Hudson), and even Rust were all the products of someone trying to complete his or her own job faster and more efficiently. And even if the long-term dividends were unanticipated by the toolmaker, Cantrill’s colorful stories and takeaways are certainly easy to appreciate by any modern-day tech professional.
Avi Kivity: A Window Into Database Operations From The Disk’s Perspective
Understanding what goes on inside a fast NVMe SSD is key for maximizing its performance, especially with IO-heavy workloads. But how do you peer into this traditionally black box?
That’s what ScyllaDB CTO and co-founder Avi Kivity recently solved for. Specifically, he wanted a new level of insight into what ScyllaDB, a NoSQL database with a unique close-to-the-hardware architecture, looked like from the perspective of the NVMe-based Nitro SSDs featured in new AWS EC2 I4i instances.
Kivity set out to visualize this and shared his approach with the mesmerized P99 CONF audience. After explaining how he created a window into the disk using open source Python and Linux tools, Kivity narrates and interprets a real-time visualization of IO: a dynamic display of color-coded reads and writes, random IO, sequential IO and the disk bandwidth consumed by each. The visualization includes:
- How ScyllaDB’s shard-per-core architecture is reflected in disk behavior
- How sequential SSTable writes differ from commitlog writes
- How compaction ramps up and down in response to read and write workloadsHow the database works with the disk to avoid lock contention
Next, Kivity shifts to a different perspective: how the same database operations manifest themselves in monitoring dashboards. Metric by metric, he sheds light on the database internals behind each graph’s interesting trends.
Watch Kivity’s talk to experience this fascinating journey in system observability – and learn how you can gain this level of insight into your own IO.
But Wait, There’s More! 50+ Tech Talks Across Core Low- Latency Themes
Beyond the keynotes, the conference featured a deep lineup of sessions that could keep any latency-minded engineer binge-watching for hours. Here’s the tip of the iceberg:
Nobl9’s Alex Hidalgo explained why you might encounter problems if you only think about “nines” when setting service reliability targets – and explored better numbers to use instead. [Read about it: When 99% Service Level Objectives Are Overrated (and Too Expensive)]
Optimizely’s Brian Taylor covered how they applied lessons of Gil Tene’s “oh sh*t” talk to understand the surprising sources of latency found in their data pipeline.
Square Engineering’s Omar Elgabry shared how they reduced DynamoDB latencies from more than 10 seconds to around 500 milliseconds after employing three techniques: timeouts, backoff and jitter.
Lyft’s Pavlo Stavytsky revealed how they measure CPU load to improve app performance – what metrics they collect, plus how they retrieve and calculate them.
Uber’s Cristian Velazquez detailed how they reduced the cost of compute capacity by implementing a large-scale, semi-automated Go garbage collection tuning mechanism.
Microsoft’s Daniele Salvatore Albano presented cachegrand, a SIMD-accelerated hashtable without locks or busy-wait loops using fibers, io_uring and much more. [Read about it: Cachegrand, a Fast, Scalable Keystore for Data-Oriented Development]
Oracle’s Ron Pressler looked at how Java virtual threads allow higher throughput for the
thread-per-request style by drastically increasing the request capacity of the server.
AWS’ Marc Richards used an HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking doing kernel bypass via DPDK.
Google’s Steve Rostedt shared his insider insight on using tracing to analyze the overhead from a Linux host running KVM.
Red Hat’s Bartłomiej Płotka explained how to find and uncover efficiency problems effectively using the power of modern cloud native observability and tooling.
ScyllaDB’s Piotr Sarna explained how he integrated WebAssembly and Wasmtime into a C++ project in a latency-friendly manner, including async Rust.
Percona’s Peter Zaitsev explored query optimizations such as parallel queries, asynchronous queries and queueing complex work to get the most out of your database.
Vercel’s Malte Ubl discussed the trade-offs of the new paradigm of rendering web pages in the edge and looked into end-to-end architectures that make it shine.
Akita’s Mark Gritter shed light on why his team went against the current trends, sticking with Golang instead of rewriting their application in Rust. [Read about it: Is a Rust Rewrite Really Worth It?]
Hudson River Trading’s Sabina Smajla demonstrated how to take advantage of programming languages’ asynchronous libraries with a few minor tweaks to existing code. [Read about it: 3 Types of Asynchronous Programming]