Editor’s Note: P99 CONF 2022 is now a wrap! You can (re)watch all videos and access the decks now.
ACCESS ALL THE VIDEOS AND DECKS NOW
P99 CONF 2022 is now just hours away. We hope you’ll be joining the community of over 10K performance-obsessed engineers for two deeply technical and highly engaging days. Expect 50+ talks about performance from the perspective of Rust, Go, Java, K8s, Wasm, Linux kernel, observability, event streaming, and of course databases (both SQL and NoSQL), lively virtual lounges, book bundle giveaways, and more – all for free.
The conference kicks off October 19 with Gil Tene’s take on “Misery Metrics & Consequences” and Liz Rice’s deep dive into performance networking with eBPF. From there, prepare for some difficult decisions as you forge your own path across three tracks of talks, including an Instant Access track that makes binge-watching all too easy.
To help you get into that low-latency mindset, here’s a sampler of the top sessions from P99 CONF 2021.
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Marc Richards, Performance Engineer at Talawah Solutions
In this talk Marc walks you through the performance tuning steps that he took to serve 1.2M JSON requests per second from a 4 vCPU c5 instance, using a simple API server written in C. At the start of the journey, the server was capable of a very respectable 224k req/s with the default configuration. Along the way, he made extensive use of tools like FlameGraph and bpftrace to measure, analyze, and optimize the entire stack, from the application framework, to the network driver, all the way down to the kernel. Marc began this wild adventure without any prior low-level performance optimization experience, but once he started going down the performance tuning rabbit hole, there was no turning back. Fueled by his curiosity, willingness to learn, and relentless persistence, he was able to boost performance by over 400% and reduce p99 latency by almost 80%.
Whoops! I Rewrote It in Rust
Brian Martin, Software Engineer at Twitter
Pelikan is Twitter’s open source and modular framework for in-memory caching, allowing them to replace Memcached and Redis forks with a single codebase and achieve better performance. Twitter operates hundreds of cache clusters storing hundreds of terabytes of small objects in memory. In-memory caching is critical and demands performance, reliability, and efficiency. In this talk, Brain shares his adventures in working on Pelikan and rewriting it from C to Rust.
Rust, Wright’s Law, and the Future of Low-Latency Systems
Bryan Cantrill, CTO of Oxide Computer Company
The coming decade will see two important changes with profound ramifications for low-latency systems: the rise of Rust-based systems, and the ceding of Moore’s Law to Wright’s Law. In this talk, Bryan discusses these two trends, and (especially) their confluence – and explains why he believes that the future of low-latency systems will include Rust programs in some surprising places.
I/O Rings and You — Optimizing I/O on Windows
Yarden Shafir, Software Engineer at Crowdstrike
Very recently Windows decided to go on the same path as Linux and implement I/O rings – a way to queue multiple I/O operations at a time instead of one by one. This change is expected to have major impact on the performance and efficiency of high-I/O applications, thus keeping Windows servers competitive. This talk presents this new feature and its implementation. Yarden demonstrates how it should be used and discusses potential future additions to it to further improve the handling of I/O by complex systems.
Keeping Latency Low and Throughput High with Application-level Priority Management
Avi Kivity, CTO and Co-Founder at ScyllaDB
Throughput and latency are at a constant tension. ScyllaDB CTO and co-founder Avi Kivity shows how high throughput and low latency can both be achieved in a single application by using application-level priority scheduling.
Crimson: Ceph for the Age of NVMe and Persistent Memory
Orit Wasserman, Architect at Red Hat
Ceph is a mature open source software-defined storage solution that was created over a decade ago. During that time new faster storage technologies have emerged including NVMe and Persistent memory. The crimson project aim is to create a better Ceph OSD that is more well suited to those faster devices. The crimson OSD is built on the Seastar C++ framework and can leverage these devices by minimizing latency, cpu overhead, and cross-core communication. This talk discusses the project’s design, current status, and future plans.
Rust Is Safe. But Is It Fast?
Glauber Costa, Staff Engineer at DataDog
Rust promises developers the execution speed of non-managed languages like C++, with the safety guarantees of managed languages like Go. Its fast rise in popularity shows this promise has been largely upheld. However, the situation is a bit muddier for the newer asynchronous extensions. This talk will explore some of the pitfalls that users may face while developing asynchronous Rust applications that have direct consequences in their ability to hit that sweet low p99. Glauber shares how the Glommio asynchronous executor tries to deal with some of those problems, and what the future holds.
Continuous Performance Regression Testing with JfrUnit
Gunnar Morling, Principal Software Engineer at Red Hat
Functional unit and integration tests are a common practice to detect and prevent regressions within a software component or application’s behavior. Things look different, though, when it comes to performance-related aspects: how to identify an application is slower than it used to be? How to spot higher memory consumption than before? How to find out about sub-optimal SQL queries that sneaked in? Any performance tests based on metrics like wall-clock time or throughput are not portable. They are subject to a specific execution environment such as a developer laptop, CI, or production-like environment. Welcome JfrUnit: based on the JDK Flight Recorder (JFR), it allows you to implement assertions based on all kinds of JFR events emitted by the JVM or your application. JfrUnit makes it very easy to identify potential performance issues by asserting metrics that may impact your application’s performance, like an increased object allocation rate, retrieval of redundant data from the database, loading of unneeded classes, and much more. This code-centric session explores:
- Using JDK Flight Recorder and JfrUnit for implementing performance regression tests
- Emitting JFR events from 3rd party libraries using JMC Agent
- Analyzing performance regressions in JDK Mission Control
New Ways to Find Latency in Linux Using Tracing
Steven Rostedt, Software Engineer at Google
Ftrace is the official tracer of the Linux kernel. It originated from the real-time patch (now known as PREEMPT_RT), as developing an operating system for real-time use requires deep insight and transparency of the happenings of the kernel. Not only was tracing useful for debugging, but it was critical for finding areas in the kernel that was causing unbounded latency. It’s no wonder why the ftrace infrastructure has a lot of tooling for seeking out latency. Ftrace was introduced into mainline Linux in 2008, and several talks have been done on how to utilize its tracing features. But a lot has happened in the past few years that makes the tooling for finding latency much simpler. Other talks at P99 discussed the new ftrace tracers “osnoise” and “timerlat”, but this talk focuses more on the new flexible and dynamic aspects of ftrace that facilitates finding latency issues which are more specific to your needs.