Google’s Database Black Belt leader, Kerry Osborne, explores what goes on in your brain as you attempt to solve performance problems
Kerry Osborne – Google’s Database Black Belt leader – delivered a P99 CONF 2023 talk (How to Improve Your Ability to Solve Complex Performance Problems) that’s still getting considerable attention, even months later. We’re pleased to announce that Kerry will return for P99 CONF 24 to deliver a follow-up talk.
Note: P99 CONF is a technical conference on performance and low-latency engineering. It’s virtual, free, and highly interactive. This year’s agenda spans Rust, Zig, Go, C++, compute/infrastructure, Linux, Kubernetes, and of course, databases.
See featured speakers + access a free P99 CONF 24 pass
As we’re already ramping up for P99 CONF 24, we thought it was a good time to look back at some highlights from Kerry’s original talk. This blog is the second of three:
- The Complexity of Solving Performance Problems
- Your Brain Solving Performance Problems (this blog)
- 3 Approaches for Solving Complex Performance Problems
Here’s a lightly edited transcript of part of his previous talk.
***
Our brains work in two different modes, particularly as we’re solving problems: intuitive and analytical. Intuitive is the mode that our brain is in when we’re not actively and effortfully thinking about something. It’s automatic. Analytical is the mode where we’re actually applying effort and working diligently in a focused manner in our brains.
For example, if you see the very simple equation
1+1 =
you’re not going to apply any effort to solving it. You already know from your childhood that 1 + 1 = 2. The answer is immediate. That’s something that your brain does in automatic mode.
However, if you see
19 X 35 =
you’re probably going to have to do some mental gymnastics to reach the answer. We can simplify this with a technique called substitution.
I might notice that 19 X 35 is very close to 20 X 35, which I can pretty much do in automatic mode – that’s 700. If I take that a step further, I could say that it’s exactly 35 less than 700. I can also do that in my head pretty easily: that’s 665. That’s an example of a substitution. It’s one of those things that our brain will do – sometimes we do it on purpose, and sometimes our brain can do it to us when we’re unaware of it.
Consider another example. Most of us have gotten in the car, driven to work, pulled into the parking lot, and not had any idea what happened between the time we backed out of the driveway and the time we got into the parking lot. That’s because our brains are very good at running in automatic mode. We can do some fairly complex things in this automatic mode – driving a car is not simple, after all. But we can do it and not be consciously thinking about the journey or the turns. We can do it all while concentrating or thinking about something else.
Our brains run in this automatic, intuitive mode all the time throughout our day-to-day activities. Unfortunately, a lot of biases can creep in when we’re in that mode – and often, we don’t even know that they’re happening.
For example, when I see a set of symptoms on a computer system or a database system, I have a tendency to think that I’ve got enough information at this point – that I know everything there is to know about the problem set. Often, that’s not the case at all. I have to constantly remind myself: “Hey, there may be other facts that you need to be looking at before you start trying to solve this problem.”
So what really happens in our brains when we start trying to solve a performance problem?
Step 1: Defining the problem
With a traditional problem-solving approach, the first step is defining the problem. That sounds super simple, but defining the problem is actually more important than you might think. How we define the problem can significantly narrow the solution space, so we want to be very careful about how we define the problem.
For example, assume our problem is that our user contacted us and said “Hey, we’ve got this batch job that runs every night, from 1 am to 2 am. We have to deliver the end result to the next system in the pipeline by 2 am, but this system is no longer completing the job in that one-hour timeline. And it seems like the IO system is the main problem: it’s running slow, it’s not behaving the way we’re used to seeing it behave.”
How do we define that problem?
- “The batch process is not finishing in an hour” is a pretty good broad description of the problem.
- “We need to speed up the IO system, the IO system is slow” is a very narrow definition of the problem.
Step 2: Gathering data
Why is this problem definition so important? Because our next step is gathering data to understand what’s causing the problem.
Assume our definition is that the IO system is slow. Thanks to our biases, our brain will automatically want to discard any data that doesn’t have a bearing on what’s causing the IO system to be slow. That’s just the way our brains work. They’re going to discount that data – maybe not ignore it altogether, but certainly discount it and not pay much attention to it. That’s why how we define the problem is super important. We need to gather data in a methodical fashion.
Step 3: Postulating a reason for the problem
In the traditional way of solving problems, it’s very important that we have the mental discipline to defer this step until we’ve got a good definition and a good set of data. If we start postulating about the reason as soon as we get an idea in our head about what’s causing it, our brains start tricking us again. We’ll start evaluating the importance of any data that we see or any symptoms that we see based on that premature postulation – so we have to be careful about that.
Step 4: Listing the possible solutions
This is where we use the creative part of our brains to try to come up with opportunities for fixing the problem. Often, this is where we gather in a room, brainstorming with multiple people and drawing things on a whiteboard, etc. This is an important step! We want to give ourselves adequate time to do this and also have an environment where nothing is out of bounds in terms of suggestions for ideas. This is not the place to say “Well, that’s a stupid idea.” This is the place to throw everything out there and make a very comprehensive list of everything we can think of.
Step 5: Ordering the possible solutions
The fifth step is ordering the possible solutions. This is where we start potentially eliminating things based on cost or risk (as discussed in the previous blog). The goal is to create an ordered list of what we think will provide the greatest benefit with the least risk and cost. But remember that priorities vary and we need to weigh those things against one other. Once we have that list and stakeholder buy-in, we attempt to implement the possible solutions in the agreed-upon order.
Digression: More Thoughts on Gathering Data
When we’re gathering data, I’m going to postulate that all modern computer systems are basically made up of many components, many smaller, discrete things that we can monitor and measure. It then follows that making a computer system run faster depends on being able to identify which of those things are consuming most of the time. It doesn’t help us to fix the thing that takes 1% or 2% of the overall time when something else is taking up 80% or 90% of the time. We need to focus on the area where most of the time is being consumed and see if we can attack that because that’s where we have the most opportunity to improve the performance of the overall system.
This also implies something else that’s super critical: that our system is well instrumented. Most database systems today are very, very well-instrumented. We can get down to very small things and know exactly how many times they were executed in a process, exactly how long they took, how long they took historically, etc. We have all that information at our disposal.
Many years ago, in the early days of relational databases, none of this was instrumented, so we had very little idea as to where the time was going. There was a lot of guesswork going on. Now, most things are instrumented. It’s essential that we pay attention to that, and that we instrument our software well.
People argue that instrumentation is time-consuming, it slows down processing, etc. There is a cost implied with that, but it’s super critical to have that information available. Even if it costs a few percentage points on performance, it’s usually worth it when you’re supporting a large production system.