Using flamegraphs to optimize the CPU usage of Maestro kernel, a Unix-like operating system written in Rust.
Editor’s note: This post is by P99 CONF speaker Luc Lenôtre, Site Reliability Engineer at Clever Cloud. He’ll be talking about “Writing a Kernel in Rust: Code Quality and Performance” at P99 CONF 24. Maestro kernel began as a C-based school project and transitioned to Rust for better code quality. Now, it’s in a clean-up and performance enhancement phase. Luc’s tech talk will share key lessons learned along the way.
Get more details + access a free P99 CONF 24 pass
This post was originally published on Luc’s blog.
Background
Maestro is a Unix-like kernel and operating system written from scratch in Rust. It’s meant to be lightweight and compatible-enough with Linux to be usable in everyday life.
The first commit of the kernel dates back to December 22nd, 2018, at 3:18 in the morning (the best time to write code, of course). It started as a school project.
It was originally implemented using the C language and continued to be for roughly a year and a half, until the codebase became too hard to keep clean.
At that moment, I decided to switch to Rust (my first project in this language), which represented several advantages:
- Restart the project from the beginning, using lessons learned from previous mistakes
- Be a bit more innovative than just writing a Linux-like kernel in C. After all, just use Linux at that point
- Use the safety of the Rust language to leverage some difficulty of kernel programming. Using Rust’s typing system allows to shift some responsibility over memory safety from the programmer to the compiler
In kernel development, debugging is very hard for several reasons:
- Documentation is often hard to find, and BIOS implementations may be flawed (more often than you would think)
- On boot, the kernel has full access to the memory and is allowed to write where it should not (its own code, for example)
- Troubleshooting memory leaks is not easy. Tools such as valgrind cannot be used
- gdb can be used with QEMU and VMWare, but the kernel may have a different behaviour when running on a different emulator or virtual machine. Also, those emulators may not support gdb (example VirtualBox)
- Some features in the support for gdb in QEMU or VMWare are missing and gdb might even crash sometimes
All those issues are reasons for using a memory-safe language, to avoid them as much as possible.
Overall, the use of Rust in the kernel allowed for the implementation of a lot of safeguards. And I believe that it is, to this day, the best decision I have made for this project.
Optimisation: CPU profiling
In a previous blog introducing the Maestro project, I said it needs cleaning up and optimizations.
Indeed, anyone who tested the OS with the provided .iso file probably observed that the system is very slow.
It turns out that humans are bad at predicting how a computer program behaves. It makes debugging hard, but it also makes it difficult to predict which part of a program needs optimization in priority.
To make this task easier, it is possible to use flamegraphs, by Brendan Gregg.
Usually, you need tools such as the perf command (Linux perf) to profile the execution of a program. However, this command is not available to observe the kernel (or at least, not at the current state of Maestro and not at every stage). So we need to find another solution.
Attempt 1: Poor man’s profiler
I am not going to explain what the poor man’s profiler is since there is a dedicated website for it.
The simplicity of this profiler, along with the presence of a gdb stub in QEMU (and also VMware) made it an interesting choice.
When running a kernel in QEMU (with the -s
option), one can connect to the gdb stub using this command:
(gdb) target remote :1234
And then debug the kernel as if it were a userspace program.
The command connects gdb to port 1234
of the local machine using the gdb protocol.
For VMWare, the default value is 8832
. By the way, I discovered how to use this feature by reading spooky secrets of the CIA on WikiLeaks.
I tried to write a program that used the gdb protocol to profile the kernel, but it turns out there is a delay between each command that slows down the execution to the point where the guest OS becomes unusable.
So I had to find another solution.
Attempt 2: QEMU plugin
Spoiler: This attempt was successful. Yay!
QEMU TCG (Tiny Code Generator) provides an API for plugins, allowing to observe the execution of the guest.
The API provides direct access to a lot of information but also lacks a lot of features. In particular:
- Access to the current state of CPU registers
- Access to the guest’s memory
A workaround for these features is to use internal QEMU functions by re-declaring their prototypes.
The plugin works by pausing the execution of the program at a given sampling frequency. For each sample, the plugin dumps the current callstack to an output file. All the collected callstacks can then be aggregated by another program.
This tool works by assuming that the more time spent on a function, the higher the probability that it will show up in the callstack when a sample is taken. It is not an exact science though: different runs might produce slightly different flamegraphs.
The guest OS is still slowed down, but it is a lot better than the previous attempt as it remains usable.
For more details, the source code is available here.
Visualize the boot sequence
The following flamegraph (created from this commit) shows the boot sequence, from the entry into the kernel’s code, until bash has started.
The kernel has been compiled in debug mode because compiling in release mode provides too few samples.
In this interactive flamegraph (open it in a new tab), you can click on each block to zoom in on them. You can also use the upper-right button to search for a function name.
The bottom blocks are separated because they work on different stacks:
kernel_main
represents the execution before entering userspace (starting the first process)event_handler
(maybe I should rename this to interrupt_handler by the way) represents the execution of code that is triggered when an interruption occurssyscall_handler
represents the execution of system call- The rest are miscellaneous detached stacks that are often required for context switching
Note: An interruption is an event on the system, either internal to the CPU (example: invalid access to memory), or external (examples: pressing a key on the keyboard, the disk is ready for I/O, …).
When the CPU receives an interruption, it suspends the execution of the code to handle the interruption instead. Then, the execution is resumed.
Each interruption has an associated ID allowing to identify it (although it is more complicated nowadays as several components can use the same interruption ID)
System calls also work by triggering an interruption under x86 (32 bits), with the instruction int 0x80
. Under x86_64, the sysenter
instruction is used instead.
Analysis
Now that we have a flamegraph, let’s use it to find which parts of the kernel slow everything down.
A consequence of profiling in debug mode is that a lot of bound checks and overflow checks (that are not present in release mode) are accounted for. It is also the case for certain assertions, notably in memory allocators, increasing their footprints.
The ones I expected
Disk accesses
The kernel spends most of the time reading or writing files. This is due to:
- The absence of a file cache, which forces the kernel to access the disk for every operation on filesystems
- Disk accesses being slow, mainly (but not only) because of the use of Parallel ATA
Also, to get notified when the controller is ready, the driver is using polling (checking the status in a loop, which is very CPU-intensive) rather than idling until an interruption occurs.
ELF parser
The ELF parser is slow because it uses inefficient algorithms to traverse data structures.
For example, finding a symbol by name is done in linear time (O(n)
), by traversing all symbols until finding the one that corresponds.
It gets even worse when loading a kernel module: for every “undefined” symbol in the module, the kernel needs to find the corresponding symbol in its own symbol table, which raises the time complexity to quadratic (O(n^2)
).
A way to improve it would be to use the internal hashmap provided inside the ELF files, and when not present, to cache the symbols in a hashmap ourselves.
The unexpected ones
This is where the flamegraph becomes really useful.
I was expecting the TTY to take a significant amount of time, but it turns out it almost does not show up on the flamegraph.
strlen
The kernel seems to be spending quite some time (6.6% of the total) in the strlen
function. Let’s look at the code (in C):
size_t strlen(const char *s)
{
size_t n = 0;
// TODO optimize
while (s[n])
++n;
return n;
}
Well, yeah… That could be optimized.
In the past, the glibc used special masks to be able to check 4 bytes at a time (or 8, under 64 bits). It also makes sure to respect memory alignment. musl has a similar implementation, although simpler.
Now, glibc seems to be using compiler internals.
Feeding the entropy pool
The entropy pool is a buffer that stores noise to be used by random number generators (for files such as /dev/random
or /dev/urandom,
or the getrandom
system call).
The kernel uses interruptions as noise to feed the pool. Because this can leak data from the system (which is a security issue), I use the ChaCha20 algorithm to make this data unusable.
The problem is that each time an interruption occurs, the kernel recomputes a ChaCha20 block, which is slow. This implementation of the ring buffer also seems to be slow.
Conclusion
This article describes only the first pass of optimizations to make. Once this is fixed, the system will speed up, but other parts of the kernel will show up as being expensive relative to others in the flamegraph.
The more iterations are made, the more optimized the kernel becomes.
While this flamegraph allows you to optimize the usage of the CPU, it does not address issues related to memory usage (and in particular, memory leaks). Exploring memory flamegraphs could be the subject of another article.