lines masked in circle
purple circle at half opacity

Kerry Osborne on the Complexity of Solving Performance Problems

grid of red dots
white dots in concentric circle

Share This Post

Google’s Database Black Belt leader Kerry Osborne explores the complexity of solving performance problems related to databases and IO systems

When Google’s Database Black Belt leader talks, people pay attention. Kerry Osborne’s P99 CONF 2023 talk (How to Improve Your Ability to Solve Complex Performance Problems) is still getting considerable attention, even months later.

Note:  P99 CONF is a technical conference on performance and low-latency engineering. It’s virtual, free, and highly interactive. This year’s agenda spans Rust, Zig, Go, C++, compute/infrastructure, Linux, Kubernetes, and of course, databases.

See featured speakers + access a free P99 CONF 24 pass

In a conference that typically features 15-20 minute TED Talk-like sessions, Kerry’s came in at just over 40 minutes. But even that wasn’t enough. Kerry has MUCH more expertise to share, and the P99 CONF community is eager to hear it. So, we’re pleased to announce that Kerry will return for P99 CONF 24 to deliver a follow-up talk.

In his own words:

“This is the second part of a talk I did at P99 CONF 2023. The first half focused on the basics of how our brains solve complex problems in the performance space along with the characteristics we need to have/develop in order to be good at it. And finally, I covered some of the basic approaches used, both from experience and from a research standpoint. This talk (Part 2) will focus on specific things we can do to get better at it – including an almost foolproof method to reach a successful outcome.”

Since we’re already ramping up for P99 CONF 24, we thought it was a good time to look back at some highlights from Kerry’s original talk. This blog is the first of three:

Here’s a lightly edited transcript from part of Kerry’s P99 CONF 23 talk.

***

Let’s talk about the problem characteristics of what we’re trying to solve in the performance space. They’re complex, right? There’s almost always multiple solutions to any given problem.

If somebody comes to me and says, “Hey, my database is slow,” there’s a million different things that could cause it to be slow – or just cause the perception that it’s slow. There could also be a lot of different solutions that would speed that system up. If somebody has a more specific issue – like “My IO system is slow,” there are often many solutions that I could apply to make the IO system work better.

Those solutions vary in a number of different ways: benefits, cost, and risk.

Benefits

Most people focus on the benefits. For example, if the IO system is slow, I could try to improve the speed of the IO system by putting in solid state disks, or by trying to separate the large multi-block reads from the single block reads – so that my single block reads are not delayed by having to wait in a queue behind large multi-block reads that take longer.

There’s a number of ways that I could go after that, and the benefit of each of those potential solutions could be different. The solid-state disk could speed it up 10X or 100x. Changing the queuing mechanism might speed it up 5x. There are different amounts of benefits that can be accomplished by the different solutions that are proposed. That makes the problem space complex.

Costs (monetary, time to implement, opportunity cost)

Another thing that varies is the cost of implementing those solutions. It’s probably a lot more expensive to go buy a whole solid-state disk system than it is to work on how the IOs are queued or something else related to software. And there are multiple types of costs to consider: monetary costs, time to implement, and also the opportunity cost. What’s lost by taking longer to implement a solution can far outweigh the monetary cost. We have all these different spaces to keep in mind, just in the cost area.

Risk

Risk is also a factor. When we’re talking about implementing systems and production systems, anything that you do on a production system introduces some amount of risk. We’re often trying to minimize the risk of any changes that we make, while still maximizing the benefit. It’s a trade-off between at least these three areas – and often many others as well.

Proposing performance solutions with these characteristics in mind

When working with stakeholders, we often have to explain: “Hey, we’ve got three or four potential solutions here – we’re going to recommend this one, and here’s why.” In my team’s reports, we often include a sentence like this:

“The following list of recommendations is a balance between urgency and ease of implementation, slightly leaning towards urgency.”

In this case, the cost of “time to implement” was really important. It was urgent to get it resolved quickly. Other times, risk or value might take precedence. When proposing solutions to performance problems, it’s important to consider the various priorities and explain why you recommend one solution over the other. It’s not always about maximum benefit.

More To Explore

Intel CPU Die Topology

Jason Rahman looks at a pair of Intel CPUs and observes a few key details on their on-die interconnect topology and impact on cross-core communication patterns.

P99 CONF OCT. 23 + 24, 2024

Register for Your Free Ticket