lines masked in circle
purple circle at half opacity

The Observability and eBPF Buzz at P99 CONF 2022

grid of red dots
white dots in concentric circle

Share This Post

Editor’s Note: P99 CONF 2022 is now a wrap! You can (re)watch all videos and access the decks now.


P99 CONF has expanded in a number of ways since its debut last year. The community size, measured in terms of event registrants, more than doubled. We’ve also extended the length of each day to suit more time zones, and we’re offering 3 parallel tracks – including one with on-demand “instant access” content. This enabled us to expand our focus to additional performance-critical topics, like observability and eBPF, all while offering the community free access to the insights of luminaries like Liz Rice, Charity Majors, and Alex Hidalgo.

For now, let’s let the speakers “speak” for themselves. Here’s a look at what they’ll be sharing, in their own words.

Charity MajorsCharity Majors,

Performance Tuning with Precision

When you hear people talk about performance tuning, they usually mean the big stuff. Your largest customers, your busiest write locks; heavy workloads that strain the throughput or burst capacity of your network devices, message buses, or an entire fleet of CPUs.

But large workloads in aggregate aren’t the only things that matter. Every platform actually consists of an infinitely long, thin tail of performance issues as uniquely experienced by every application, user, build id (or combination thereof). When we lack the tooling to pick apart these workloads, we have had to resort to performance tuning in gross aggregates; but when we can take a scalpel to them, our options for optimization multiply.

In this talk, I will demonstrate how to emit and gather telemetry data for observability and analysis, and showcase several real-world examples that illustrate the transformative power of debugging with wide events and high-cardinality, high-dimensionality data.

Liz Rice, Isovalent

Using eBPF for High-Performance Networking in Cilium

The Cilium project is a popular networking solution for Kubernetes, based on eBPF. This talk uses eBPF code and demos to explore the basics of how Cilium makes network connections, and manipulates packets so that they can avoid traversing the kernel’s built-in networking stack. You’ll see how eBPF enables high-performance networking as well as deep network observability and security.

Alex Hidalgo, Nobl9

Throw Away Your Nines

People want their computer services to operate well, and consequently they often set targets for how reliable they hope to be. In addition to target percentages, percentiles are often used to make it easier to examine the data for most user experiences, while ignoring the outliers. But problems often arise because of a tendency to only think about “nines”. All services and user experiences are unique, and it is all too common to ignore important data or stress your systems and humans by trying to be more performant than you need to be. Throw away your nines, and let’s find some better numbers to use instead.

Dmitrii Dolgov, Red Hat

Performance Insights Into eBPF, Step by Step

Having full visibility throughout the system is well established best practice. Usually one knows which metrics to collect, how and what to profile or instrument to understand why the system exhibits this level of performance. All of this becomes more challenging as soon as an eBPF layer is included.

In this talk, I shed some light on those bits of your service that use eBPF, step by step with topics such as:

  • How to collect execution metrics of eBPF programs?
  • How can we profile these eBPF programs?
  • What are the common pitfalls to avoid?

The talk will provide the attendees with an approach to analyze and reason about eBPF programs’ performance. This experience was drawn from building Red Hat Advanced Cluster Security (RHACS) but is applicable to other domains.

Yuying Song, PingCAP

How We Reduced Performance Tuning Time by Orders of Magnitude with Database Observability

Doing performance tuning on a massively distributed database is never an easy task. This is especially true for TiDB, an open-source, cloud-native NewSQL database for elastic scale and real-time analytics, because it consists of multiple components and each component has plenty of metrics.

Like many distributed systems, TiDB uses Prometheus to store the monitoring and performance metrics and Grafana to visualize these metrics. Thanks to these two open source projects, it is easy for TiDB developers to add monitoring and performance metrics. However, as the metrics increase, the learning curve becomes steeper for TiDB users to gain performance insights. In this talk, I will share how we measure latency in a distributed system using a top-down (holistic) approach, and why we introduced “tuning by database time” and “tuning by color” into TiDB. The new methodologies and Grafana dashboard help reduce the time and the requirement of expertise in performance tuning by orders of magnitude.

Bartłomiej Płotka, Red Hat

Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations

Over the recent decade, cloud computing and cloud-native platforms emerged and promised lower costs, less effort and more flexibility. Unfortunately, in practice, it is not so easy. Knowledge short gaps, pitfalls, the complexity of using cloud APIs, and misunderstanding of services and systems cause companies to lose millions of dollars every year. On top of that, the big data world demands more and more software and computing power every day.

Fortunately, there is a method to that madness! No matter if you build your own software, or use open source or paid systems, there are many efficiency gains that will save tons of money. Sometimes it’s a code optimization, sometimes algorithm adjustment, sometimes system-level operation!

In this talk, I will explain how to notice and uncover efficiency problems effectively thanks to the power of modern cloud-native observability and tooling. The audience will learn pragmatic practices they can do to ensure effective and sustainable optimizations and how to avoid regressions in the future. All to increase the business operational margin and enable more opportunities with more efficient software and systems.

Roni Dover, Digma

The Observant Developer — OpenTelemetry from Code to Prod

Observability can be about more than pretty dashboards, it can be a powerful tool for designing better code. A part of the modern developer’s stack. Just like tests, metrics and traces can be leveraged to challenge code assumptions and develop for real-world requirements in an evidence-based manner.

Developer observability can help us write better code, improve the dev process and roll out more scalable and mature system capabilities. This session will explore practical ways in which OpenTelemetry combined with open-source tools such as Jaeger, Prometheus, and others can be integrated into the modern development stack.

This talk is relevant to any Developer or DevOps practitioner seeking to understand how to use current observability technologies effectively.

Matthew Lenhard, ContainIQ

Tracking syscall and Function Latency in your k8s Cluster with eBPF

During this talk, I will walk the audience through a real life performance tuning exercise, where we hunt down slow system calls and functions. I’ll start by launching a k8s cluster, with some known bottlenecks. I’ll then walk the audience through using the mountinfo directory in order to correlate pod names to their PIDs. Next, I will put together a series of eBPF based scripts that can attach to these PIDs and print out the latency of function and system calls from this PID. Finally, I’ll walk through interpreting this information as well as some of the performance issues related to running these tests in production.

Antón Rodríguez, New Relic

Measuring P99 Latency in Event-Driven Architectures with OpenTelemetry

While there are numerous benefits to Event-Driven Architecture, like improved productivity, flexibility, and scalability, they also pose a few disadvantages, such as the complexity of measuring end-to-end latency and identifying bottlenecks in specific services.

This talk shows you how to produce telemetry from your services using an open standard to retain control of data. OpenTelemetry allows you to instrument your application code through vendor-neutral APIs, libraries, and tools. It provides the tools necessary for you to gain visibility into the performance of your services and overall latency.

I will share my experience building high-throughput services and strategies to use distributed tracing in an optimal way and without affecting the overall performance of the services.

Henrik-Rexed-photoHenrik Rexed, Dynatrace

Why Kubernetes Freedom Requires Chaos Engineering to Shine in Production

Like any other technology transformation, k8s adoption typically starts with small “pet projects.” One k8s cluster here, another one over there. If you don’t pay attention, you may end up like many organizations these days, something that spreads like wildfire: hundreds or thousands of k8s clusters, owned by different teams, spread across on-premises and in the cloud, some shared, some very isolated. When we start building application for k8s, we often lose sight of the larger picture on where it would be deployed and moreover what the technical constraints of our targeted environment are. Sometimes, we even think that k8s is that magician that will make all our hardware constraints disappear.

In reality, Kubernetes requires you to define quotas on nodes, namespaces, resource limits on our pods to make sure that your workload will be reliable. In case of heavy pressure, k8s will evict pods to remove pressure on your nodes, but eviction could have a significant impact on your end-users.

How can we proactively test our settings and measure the impact of k8s events to our users? The simple answer to this question is Chaos Engineering.

During this presentation, I will use real production stories to explain:

  • The various Kubernetes settings that we could implement to avoid major production outages
  • How to define the chaos experiments that will help us to validate our settings
  • The importance of combining Load testing and chaos engineering
    The observability pillars that will help us validate our experiments
  • These are just a few of the many speakers you will find in our 2022 program.

Make sure you register today to be able to attend the full event, including access to the live Speaker’s Lounge during the conference.


More To Explore

P99 CONF Day 1: The Low-Latency Lowdown

Day 1 of P99 CONF kicked off with “misery metrics” and ended with a look at P99 latency reduction as a Sisyphean task. In between, engineers spoke of their many hard-fought performance battles and lessons learned. And thousands of engineers attending live had a great time digging into all this pain and suffering.

Background Texture
Background Texture