Bryan Cantrill delivered an impassioned whirlwind of insight at the first-ever P99 CONF (2021) – the open source-focused conference designed for (and by) engineers who obsess over P99 latencies. P99 CONF is free and virtual: Register now to learn from luminaries like Cantrill, Liz Rice, Charity Majors, Alex Hidalgo, Gil Tene, Armin Ronacher and Malte Ubl.
In eager anticipation of Cantrill’s return for P99 CONF 2022, let’s take a look back at his talk on “Rust, Wright’s Law and The Future of Low-Latency Systems.” He predicted the coming decade will see two important changes with profound ramifications for low-latency systems: the rise of Rust-based systems, and the ceding of Moore’s Law to Wright’s Law. In this talk, he discusses these two trends, and especially their confluence — explaining why he believes the future of low-latency systems will include Rust programs in some surprising places.
Trust us: You’ll want to watch this one in its entirety. Here’s a peek into what it’s all about.
The Arbiter of Systems Performance
The talk begins with a look at the inextricable tangle of hardware/software co-design and performance: “Hardware is the ultimate arbiter of our systems’ performance. Hardware is the thing that actually runs our systems. And even though hardware is not the beginning and end of all systems performance, it will ultimately limit our performance.”
Cantrill believes three different kinds of revolutions have driven advances in systems performance, with economic feasibility looming over all of them:
- Process revolutions that propelled all designs
- Architectural revolutions that optimized for certain use cases
- Software revolutions that allowed us to better utilize hardware
In his words: “Those software revolutions really boil down to ‘How can we better utilize hardware?’ We don’t always phrase it that way. But when we advance performance with a software revolution, we are advancing performance because we’re using hardware better.”
What’s commonly become known as “Moore’s Law” – the observation that transistor density, speed and cost-efficiency double every two years or less – is the understandable starting point for discussions on process revolutions. (The full video sets the record straight on what Gordon Moore actually said).
The 1980s and early 1990s were halcyon days in this respect: “You’d go get a new computer, and the one you bought just 18 months ago was obsolete. In fact, it was getting so fast, so quickly, that we actually had to slow computers down in order to play games that were written for the old machine. You would buy a game that was written for a PC/AT, and then it would be unplayable on a 386SX. Sorry if I sound like “Antiques Roadshow” here, but if you’re of my vintage, you know what I’m talking about. You could actually unpress the Turbo button to make your applications go slower.”
But delve a little deeper and you’ll discover that the “good old days” of Moore’s Law were not golden to the core. DRAM wasn’t invited to the party: Although memory was becoming denser, it was not becoming faster. We were increasingly hitting the memory wall in the late 1980s – and especially the 1990s – to the point where caching became a necessity versus an aid. Moore’s law was continuing, as was Dennard scaling (as transistors get smaller, their power density stays constant). Yet, more and more applications were bumping up against the memory wall. The only solution was a major architectural revolution.
By the mid-1990s, symmetric multiprocessing seemed poised to lead such an architectural revolution, putting us on the path to deliver throughput on multithreaded workloads. However, it necessitated a software revolution – writing code for multithreaded systems – and didn’t really help single-threaded latency. The 1990s also brought experiments like deep pipelining and VLIW (very long instruction word ), neither of which really panned out. Moore’s Law was resulting in more transistors, but it was challenging to take advantage of them.
In the past decade or so, Moore’s Law has been very clearly decelerating, as has the broader process revolution. Dennard scaling stopped in 2006, the clock rate is practically limited to be under 5 gigahertz, and companies are playing with different ways to spend that silicon with multicores – getting many, many cores on a die. But that process revolution has indisputably decelerated, and company after company has been struggling to shrink their process nodes. (Watch the video for Cantrill’s take that talking about process nodes in terms of nanometers is “total fiction.”)
The bottom line for Cantrill is that Moore’s 1965 observation that processors are getting cheaper over time as their density and performance increase no longer holds true. These 3nm and 5nm transistors cost a lot because huge fabrication systems must be built to deliver them. But Gordon Moore isn’t the only person to look at the history of technology and wonder about its economics.
In 1936, Theodore Wright (no relation to the Wright brothers) was looking at the cost of an airplane over time. He discovered that the more you manufacture something, the unit costs drop. More specifically, when the volume roughly doubled, the unit cost dropped by 10 to 15%. And it’s not just the economies of scale.
Cantrill explained, “As we make more of something, we get better at it. We’re finding process improvements, a bunch of little iterative stuff that adds up. It adds up to this really significant law: Wright’s Law. In 2013, Jessika Trancik and her team at the Santa Fe Institute observed that Wright’s Law is actually better than Moore’s law at explaining what happened to transistor economics over time. Moore’s Law is ending, but Wright’s law seems to be holding, especially for older process nodes. That is to say, the amount that we are spending per transistor does seem to be dropping, and we are making more and more and more transistors.”
- So what are the implications of processors becoming cheaper to produce?
If Wright’s Law continues to hold, compute will be economically viable in more places that were previously confined to hard logic – for example, we’re already seeing CPUs on the network interface card (NIC), CPUs next to flash, and on the spindle
- This is true even on die, where chiplets have made it easier than ever to build a heterogeneous system — and where mixed process nodes have demanded more sophistication
- Having more compute in many more places is particularly germane to system performance. More compute close to data (SmartNICs, open-channel SSDs, on-spindle compute) lowers latency; bringing data to special-purpose compute (general purpose graphics processing units, field-programmable gate arrays) increases throughput.
Now’s a great time to revisit the point that Bryan made at the very beginning of his talk: “Hardware is the ultimate arbiter of our systems’ performance.” Given the hardware developments set off by Moore’s Law ceding to Wright’s Law, we need to rethink our system software. It’s time to rethink our system software to hydrate all the cores, including the many hidden cores (watch Timothy Roscoe’s USENIX keynote for some X-ray vision into the cores that even researchers overlook).
The Needed Software Revolution
As Cantrill puts it, “This coming revolution, where we’ve got many compute elements in many different places, is at some very deep level special purpose. Our general-purpose compute is still going to be that big core complex running on raw CPUs. That’s going to draw the most power that’s going to have our highest performance, and that’s going to be connected to our fastest DRAM. But, we are also going to have these compute elements that are special purpose. And, because they are special purpose, they’re going to be weird. In particular, they are going to have more compute, they’re going to be out of balance with respect to memory. And with respect to storage, they’re not going to have much memory. They’re going to have very little non-volatile storage, especially non-volatile storage that they can write dynamically. In these very confined worlds, dynamic environments like Java and Go and Python, etc, etc, etc. are non-starters.”
C and C++ – the languages traditionally used in this domain – could certainly work here. However, they come with the burden of well-known challenges around safety and composability. So what’s the path forward?
Enter Rust, with its killer feature for special-purpose compute: no_std. Rust doesn’t need to depend on its own standard library. In fact, many of the most valuable aspects of Rust are not in the standard library (for instance, sum types, algebraic types, the ownership model traits and hygienic macros). You can tap the vast majority of Rust’s power with no_std crates.
“A no_std crate cannot perform a heap allocation. It’s enforced at compile time. This is amazing to have something where I can truly have composability at a layer of the stack that has never had this kind of composability. This is incredible to be able to pull in different no_std crates and actually pull together a system – and the binaries are tight,” he said.
You can put these tight Rust-based programs – without a runtime – in all sorts of interesting places, with all the power you need for the tasks you want to run. Plus, you’re doing it in a way that’s memory-protected and safe.
Case in point: Oxide Computer Company is developing a new (now open source) operating system appropriately called Hubris. It’s a message-passing microkernel-based system, it’s memory protected, and it’s built entirely in Rust: microkernel and tasks. The minimal Hubris system is 30k, and entirely realistic systems area at 200k – and that’s before focused optimization attempts.
Rust, Wright’s Law, and the Future
Cantrill sums it up:
“no_std is really without precedent. And Rust is actually the first language since C to meaningfully exist at the boundary of hardware and software. And this is what points us to the future. Wright’s Law means we’re going to have compute in more places – we are already seeing that. Those compute elements are going to be special purpose. Don’t wait for your general-purpose CPU to be shoved down to a smart NIC. It’s going to draw too much power. We can’t have memory that fast there. But what we can put down there is Rust.
“What we are going to see is many more exciting de novo hardware-facing Rust systems that – thanks to no_std – will be able to build on one another. It’s a very exciting time to be developing high-performance low-latency systems, and the Rust revolution is very much here.”
Again, you’ll want to watch Bryan’s full presentation on this topic, even if you’ve read this far. And remember that you can catch him live at P99 CONF 2022 – free and virtual.