lines masked in circle
purple circle at half opacity

Low-Latency Distributed Data Strategies at P99 CONF: SQL, NoSQL & Event Streaming

grid of red dots
white dots in concentric circle

Share This Post

Editor’s Note: P99 CONF 2022 is now a wrap! You can (re)watch all videos and access the decks now.


P99 CONF is a highly-technical open source industry conference for engineers who obsess over P99 percentiles and long-tail latencies. It’s hosted by ScyllaDB. And since database performance is near and dear to us, we quite eagerly reached out to our friends and colleagues across the community to ensure a wide spectrum of distributed data systems, approaches, and challenges would be represented at P99 CONF. SQL and NoSQL, in-memory and persistent stores, transactional workloads and analytical data lakehouses — we’ll be covering all this and more.

If you share our obsession with high-performance low-latency data systems, here’s a rundown of sessions to consider joining at P99 CONF 2022.

Note: Speakers are listed in reverse alphabetical order


A Deep Dive into Query Performance

Peter Zaitsev, Percona

If you look at data store as just another service, the things applications care about are successfully establishing a connection and getting results to the queries promptly and with correct results.

In this presentation, we will explore this seemingly simple aspect of working with a database in detail. We will talk about why you want to go beyond the averages, and how to group queries together in a meaningful way so you’re not overwhelmed with amount of details but find the right queries to focus on.

We will answer the question on when you should focus on tuning specific queries or when it is better to focus on tuning the database (or just getting a bigger box).

We will also look at other ways to minimize user facing response time, such as parallel queries, asynchronous queries, queueing complex work, as well as often misunderstood response time killers such as overloaded network, stolen CPU, and even limits imposed by this pesky speed of light.


Testing Persistent Storage Performance in Kubernetes with Sherlock

Sagy Volkov, Lightbits Labs

Getting to understand your Kubernetes storage capabilities is important in order to run a proper cluster in production. In this session I will demonstrate how to use Sherlock, an open source platform written to test persistent NVMe/TCP storage in Kubernetes, either via synthetic workload or via variety of databases, all easily done and summarized to give you an estimate of what your IOPS, Latency and Throughput your storage can provide to the Kubernetes cluster.


Aggregator Leaf Tailer: Bringing Data to Your Users with Ultra Low Latency

Jeffery Utter, theScore

theScore is one of the leading sports news and media platforms in North America. With the recent growth of sports betting, theScore is diving in head-first. Province and state regulations often enforce strict data-locality requirements often prohibiting the use of centralized hosted data storage solutions.

In this talk, we’ll learn how theScore built Datadex, a geographically distributed system for low-latency queries and realtime updates. We’ll touch on the architecture of the system (Aggregator Leaf Tailer), the underlying technologies (RocksDB, Kafka and the JVM), as well as some nitty-gritty tips for optimizing the JVM and RocksDB for high-throughput.


Properly Understanding Latency is Hard — What We Learned When We Did it Correctly

Brian Taylor, Optimizely

We applied Gil Tene’s “oh shit” talk on understanding latency to our data pipeline. This talk will describe the measurement challenges we faced, what we learned, and how we improved with those learnings. This was vital for our capacity planning to scale up our data pipeline multiple times. The story that a latency spectrum told us allowed us to identify surprising sources of latency.


How We Reduced Performance Tuning Time by Orders of Magnitude with Database Observability

Yuying Song, PingCAP

Doing performance tuning on a massively distributed database is never an easy task. This is especially true for TiDB, an open-source, cloud-native NewSQL database for elastic scale and real-time analytics, because it consists of multiple components and each component has plenty of metrics.

Like many distributed systems, TiDB uses Prometheus to store the monitoring and performance metrics and Grafana to visualize these metrics. Thanks to these two open source projects, it is easy for TiDB developers to add monitoring and performance metrics. However, as the metrics increase, the learning curve becomes steeper for TiDB users to gain performance insights. In this talk, we will share how we measure latency in a distributed system using a top-down (holistic) approach, and why we introduced “tuning by database time” and “tuning by color” into TiDB. The new methodologies and Grafana dashboard help reduce the time and the requirement of expertise in performance tuning by orders of magnitude.


Keeping Latency Low for User-Defined Functions with WebAssembly

Piotr Sarna,  ScyllaDB

WebAssembly (WASM) is a great choice for user-defined functions, due to the fact that it was designed to be easily embeddable, with a focus on security and speed. Still, executing functions provided by users should not cause latency spikes – it’s important for individual database clusters, and absolutely crucial for multi-tenancy. In order to keep latency low, one can utilize a WebAssembly runtime with async support. One such runtime is Wasmtime, a Rust project perfectly capable of running WebAssembly functions cooperatively and asynchronously. This talk briefly describes WebAssembly and Wasmtime, and shows how to integrate them into a C++ project in a latency-friendly manner, while implementing the core runtime for user-defined functions in async Rust.


Overcoming Variable Payloads to Optimize for Performance

Armin Ronacher, Sentry

When you have a significant amount of events coming in from individual customers, but do not want to spend the majority of your time on latency issues, how do you optimize for performance? This becomes increasingly difficult when you are dealing with payload sizes that are multiple orders of magnitude difference, have complex data that impacts processing, and the stream of data is impossible to predict. In this session, you’ll hear from Armin Ronacher, Principal Architect at Sentry and creator of the Flask web framework for Python on how to build ingestion and processing pipelines to accommodate for complex events, helping to ensure your teams are reaching a throughput of hundreds of thousands of events per second.


Measuring P99 Latency in Event-Driven Architectures with OpenTelemetry

Antón Rodríguez, New Relic

While there are numerous benefits to Event-Driven Architecture, like improved productivity, flexibility, and scalability, they also pose a few disadvantages, such as the complexity of measuring end-to-end latency and identifying bottlenecks in specific services.

This talk shows you how to produce telemetry from your services using an open standard to retain control of data. OpenTelemetry allows you to instrument your application code through vendor-neutral APIs, libraries, and tools. It provides the tools necessary for you to gain visibility into the performance of your services and overall latency.

Anton will share his experience building high-throughput services and strategies to use distributed tracing in an optimal way and without affecting the overall performance of the services.


High-Speed Database Throughput using Apache Arrow Flight SQL

Kyle Porter, Bit Quill Technologies

Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.


End-To-End Performance Testing, Profiling, and Analysis at Redis

Filipe Oliveira RedisLabs

High-performance (as measured by sub-millisecond response time for queries) is a key characteristic of the Redis database, and it is one of the main reasons why Redis is the most popular key-value database in the world.

In order to continue improving performance across all of the different Redis components, we’ve developed a framework for automatically triggering performance tests, telemetry gathering, profiling, and data visualization upon code commit.

In this talk, we describe how this type of automation and “zero-touch” profiling scaled our ability to pursue performance regressions and to find opportunities to improve the efficiency of our code, helping us (as a company) to start shifting from a reactive to a more proactive performance mindset.


Apache Iceberg: An Architectural Look Under the Covers

Alex Merced, Dremio

Data Lakes have been built with a desire to democratize data – to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems, but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.

Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.

Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.

You will learn:

  • The issues that arise when using the Hive table format at scale, and why we need a new table format
  • How a straightforward, elegant change in table format structure has enormous positive effects
  • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
  • The resulting benefits of this architectural design

Dor Laor

P99 Pursuit

Dor Laor, ScyllaDB

Performance engineering is a Sisyphean hill climb for perfection. Those who climb the hill are hardly ever satisfied with the results. You should always ask yourself where the bottleneck is today and what’s holding you back. Great performance improves your software. It enables you to run fewer layers, manage 10x less machines, simplifies your stack, and more. In this keynote session, ScyllaDB CEO Dor Laor will cover the principles for successful creation of projects like ScyllaDB, KVM, the Linux kernel and explain why they spurred his vision for the P99 CONF.


Portrait of Avi Kivity

How a Database Looks from a Disk’s Perspective

Avi Kivity, ScyllaDB

In a world of cloud infrastructure, serverless, and layers of virtualization, software architects can be tempted to abstract away hardware infrastructure considerations when they are designing their applications. Yet when you are designing systems that need to operate at scale, such presumptions and lapses in curiosity can lead to intrinsic systemic inefficiencies that result in millions of dollars of costs that just do not need to be incurred. In this talk Avi Kivity, CTO of ScyllaDB, will dive into how high performance distributed systems such as modern databases can make best, most efficient use of fast NVMe storage.


Implementing Highly Performant Distributed Aggregates

Michał Jadwiszczak, ScyllaDB

Designing how to implement aggregates in a distributed database is a non-trivial task. When dealing with aggregates that will be polling the entire cluster, it is important to consider performance impacts. If done poorly, full table scans can bring production systems to their knees. So how can you implement aggregate functions without hammering real-time availability and performance for other read/write operations? Learn how distributed aggregates were implemented in ScyllaDB to balance performance across large NoSQL distributed database clusters.


Building Efficient Multi-Threaded Filters for Faster SQL Queries

Vlad Ilyushchenko,  QuestDB

QuestDB is an open-source time series database written in Java and C++.Until recently, QuestDB’s SQL engine was using a single thread to execute full-scan queries based on a JIT-compiled filter function. While it was quite fast, we knew that our storage model leaves a lot of room for optimizations, such as parallel filter execution. In this talk, we share how we built efficient zero-GC multi-threaded query processing. Be prepared for performance gains that exceed the number of cores (sic!). We aim for practicality, so we are going to speak of important low-level implementation details, our approach to concurrent code testing, as well as our pain points.


How Dashtable Helps Dragonfly Maintain Low Latency

Roman Gershman, DragonflyDB

Dashtable is a hashtable implementation inside Dragonfly. It supports incremental resizes and fast, cache-friendly operations. In this talk, we will learn how Dashtable helps Dragonfly to keep its tail latency in check. In Dashtable, long-tail latencies have been reduced by a factor of 1000x, but P999 are 7x longer. Find out why we still think this is a good tradeoff.


Removing Implicit Deadlocks on a Thread-Per-Core Architecture with 2-Phase Processing

Alex Gallego, Redpanda

An illustrated discussion of this topic.


Pavel EmelyanovNew IO Scheduler Algorithm for Mixed Read/Write Workloads

Pavel Emelyanov, ScyllaDB

Modern SSDs are believed to be random IO friendly devices, unlike HDDs. ScyllaDB’s low-level library called Seastar is believed to exploit share-nothing design to achieve peak performance on any hardware. IO scheduling in ScyllaDB is where both beliefs meet together … and are both refuted.


Square Engineering’s “Fail Fast, Retry Soon” Performance Optimization Technique

Omar Elgabry, Square

Optimizing for performance and reducing latency is a hard problem. Examples could be: choosing a different algorithm and data structures, improving SQL queries, adding a cache, serving requests asynchronously, or some low-level optimization that requires a deep understanding of the OS, kernel, compiler, or the network stack. The engineering effort is usually nontrivial, and only if you’re lucky, you’ll see some tangible results.

That being said, there are some performance optimization techniques, with a few lines of code — even exist in the built-in library — it can lead to noticeable surprising results. One of these techniques is to “fail fast, retry soon”. These techniques are often neglected or taken for granted.

In distributed systems, a service or a database consists of a fleet of nodes that functions as one unit. It is not uncommon for some nodes to go down, usually, for a short time. When this occurs, failures can happen on the client-side and can lead to an outage. To build resilient systems, and reduce the probability of failure, we’re going to explore these topics: timeouts, backoff, and jitter. We’ll talk about timeouts, what timeout to set, pitfalls of retries, how backoff improves resource utilization, and jitters reduce congestion. Furthermore, we’re going to see an adaptive mechanism to dynamically adjust these configurations.

This is inspired by a real-production use case where DynamoDB latency p99 & max went down from > 10s to ~500ms after employing these three techniques: timeouts, backoff, and jitter.


C# as a System Language

Oren Eini, RavenDB

When you think about C#, you’ll usually think about a high-level language, one that is utilized to build websites, APIs, and desktop applications. However, from its inception, C# had the foundation to be used as a system language, with facilities that allow you direct memory access and fine-grained control over memory and execution.

In the last five years, there has been a huge emphasis on making C# a more capable language for system development. Oren Eini, the founder of RavenDB, has used C# as the base language to build a distributed document database for over a decade.

In this talk, Oren will discuss the features that make C# a viable system language for building high-end systems. Learn how you can mix and match, in a single project, both high-level concepts and intimate control over every single thing that is happening in your system.


Retaining Goodput with Query Rate Limiting

Piotr Dulikowski, ScyllaDB

Distributed systems are usually optimized with particular workloads in mind. At the same time, the system should still behave in a sane way when the assumptions about workload do not hold – notably, one user shouldn’t be able to ruin the whole system’s performance. Buggy parts of the system can be a source of the overload as well, so it is worth considering overload protection on a per-component basis. For example, ScyllaDB’s shared-nothing architecture gives it great scalability, but at the same time makes it prone to a “hot partition” problem: a single partition accessed with disproportionate frequency can ruin performance for other requests handled by the same shards. This talk will describe how we implemented rate limiting on a per-partition basis which reduces the performance impact in such a case, and how we reduced the CPU cost of handling failed requests such as timeouts (spoiler: it’s about C++ exceptions).


Outrageous Performance: RageDB’s Experience with the Seastar Framework

Max De Marzi Jr.,  RageDB

Learn how RageDB leveraged the Seastar framework to build an outrageously fast graph database. Understand the right way to embrace the triple digit multi-core future by scaling up and not out. Sacrifice everything for speed and get out of the way of your users. No drivers, no custom protocols, no query languages, no GraphQL, just code in and JSON out. Exploit the built in Seastar HTTP server to tie it all together.


Three Perspectives on Measuring Latency

Geoffrey Beausire, Criteo

Latency is one of the most common Service Level Indicators (SLI), but where should it be measured from?

There are three main ways to measure latency:

  • Server side latency: Precise and high cardinality but missing the big picture
  • Client side latency: Big picture but noisy
  • Blackbox monitoring latency: Good trade-off between the other two

In this talk, we will dive deeper into each perspective and how all of them can be leveraged. We will use Criteo’s large scale key/value infrastructure as a case study.


More To Explore

Intel CPU Die Topology

Jason Rahman looks at a pair of Intel CPUs and observes a few key details on their on-die interconnect topology and impact on cross-core communication patterns.

P99 CONF OCT. 23 + 24, 2024

Register for Your Free Ticket