The Cloud Storage Triad: Latency, Cost, Durability

Share This Post

Editor’s note: This post is by P99 CONF 24 speaker Chris Riccomini. Chris will be co-presenting with Rohan Desai; their talk is titled  “Building a Cloud Native LSM on Object Storage.”

This talk discusses the design and implementation of SlateDB, an open source cloud native storage engine built as a log-structured merge-tree (LSM) on top of an object store like S3, Google Cloud Storage (GCS), or Azure Blob Store (ABS).  LSMs are traditionally built assuming data will reside on local storage. Building an LSM on object storage allows SlateDB to benefit from object storage replication and durability guarantees while presenting unique latency and cost challenges. During the talk, Chris and Rohan will discuss the design decisions and tradeoffs we faced when building SlateDB.

Get more details + access a free P99 CONF 24 pass

To read more from Chris, follow his Substack newsletter, Materialized View. And to learn more about SlateDB on their site.

***

A new theorem for primary persistence on object stores.

I believe that the future of database persistence is object storage—S3, Google Cloud Storage, and so on. New systems like Neon, WarpStream [$], and Turbopuffer persist data in object storage to offer infinite retention, durability, replication, data warehouse integration, and so on.

Object stores present some challenges, though. Historically, such systems have been higher latency, have lacked atomic writes, and have billed a lot for API requests. For these reasons, many systems write to a cache in front of the object store. Data is then flushed to object storage asynchronously.

Write caches come with their own set of challenges. If durability and consistency are important, caches must be replicated and serve consistent reads. Consensus protocols enter the picture. Neon, for example, uses their own brand of Paxos with Safekeepers and Pageserver.

I don’t think we’ll need these write caches in the future. Object storage is changing fast. Latency on S3 has improved dramatically with S3 Express. Google cloud storage is going to offer similar latencies soon. As for atomic writes, nearly every object store except S3 already offers compare-and-swap (CAS), called preconditions. I believe S3 Express will adopt pre-conditions soon. And systems like Tigris already offer both preconditions and low-latency reads and writes.

Given these trends, it’s reasonable to expect object stores to converge on low latency reads and writes with atomicity. Such systems would allow us to move write ahead logs into the object storage layer and eliminate write caches.

I recently began hacking on a project to test this theory out. The project—dubbed SlateDB—is a cloud-native log-structured merge tree (LSM) embedded key-value database. My goal is to answer the question: what does an LSM tree look like if all writes are persisted directly to object storage? I don’t think even RocksDB-cloud does this; they have a pluggable WAL that can write to disk, Kafka, or Kinesis.

Recall that an LSM tree normally has an in-memory sorted list of key-value pairs called a MemTable. Writes are inserted into this MemTable. To keep writes durable, writes are also sent to a write-ahead log (WAL). When MemTables get large, they’re frozen (made immutable) and flushed to disk as sorted-string tables (SSTs).

The most naive implementation of a cloud-native LSM might simply send all WAL writes directly to object storage. This works and is reasonably low latency with S3 Express. Unfortunately, it’s expensive when you have a lot of writes. PUTs are $0.0025 per-1000 requests. A high-volume service that sustains 10,000 writes per-second would cost 2.5c per-second, or $65,000 per-month.

Object storage pricing isn’t changing anytime soon. If cost remains fixed and SlateDB is constrained to object storage, there’s no choice but to batch writes. Batching decouples the client write-rate from the PUT calls sent to object storage. SlateDB can be configured to send writes to object storage every N milliseconds. All writes accrued during that window are sent as a single PUT. A 10ms window means a maximum of 100 writes per-second are sent to object storage. A 100 write-per-second upper-bound implies a maximum cost of $0.00025 per-second, or $650 per-month—much more reasonable. WarpStream [$] works this way (agents default to 50ms windows), as does Turbopuffer (which defaults to a 100ms window).

Batched writes leave clients with a decision: they can either wait for the batch to complete or they can let writes happen asynchronously. If a client waits for a successful batch write, their latency goes up. If the client chooses not to wait, their latency drops but they lose durability.

This triad of latency, cost, and durability presents an easy mental model; something I’m facetiously calling the latency, cost, durability (LCD) theorem. The LCD theorem holds high API costs fixed, and forces clients to choose between money, durability, and latency.

Clients that want low latency and high durability have to pay. Clients that care less about latency or durability can lower cost at the expense of latency (waiting for writes to finish) or durability (treating writes as asynchronous).

Thinking about CAP for obj stores writes: cost/latency/durability.

⬆️ cost ⬇️ latency ⬆️ durability = sync writes
⬇️ cost ⬆️ latency ⬆️ durability = sync batch writes
⬇️ cost ⬇️ latency ⬇️ durability = async batch writes

Reduce API costs by batching. Clients can wait (durability) or not (latency).

— Chris Riccomini (@chris.blue) Apr 19, 2024 at 9:17 AM

LCD-style systems have a rather elegant design. Clients can pick batch windows and maximum batch sizes to control latency—similar to Apache Kafka’s linger.ms and batch.size settings. And a simple async/await or future-based write API allows clients to choose whether to wait for writes to successfully persist to object storage or not.

I’m still experimenting with SlateDB, but I think it could be a very useful library (Yes, it’s written in Rust). The library isn’t yet open source (it doesn’t have compaction and reads only exist in a PR). I would love to get some more contributors, though. Shoot me a DM if you want to help! It’s already surfaced a number of interesting questions:

  • Can we move the control plane into object storage (using pre-conditions)?
  • Can we achieve multi-region durability on S3 Express using quorum bucket writes?
  • Which compaction strategies work well for object storage (tiered looks promising)?
  • Can we flush MemTables directly to level-0 SSTs, thereby eliminating the WAL all together?

I plan to answer these questions in future posts.

Book

Support the Materialized View newsletter by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.

Disclaimer

I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list.

Hear more from Chris (and Rohan) at P99 CONF 24

More To Explore

P99 CONF Speaker Spotlight: Tanel Poder

Get to know Tanel Poder, Performance Nerd at PoderC Consulting, in anticipation of his P99 CONF talk: “Using eBPF Off-CPU Sampling to See What Your Databases are Really Waiting For”

P99 CONF OCT. 23 + 24, 2024

Register for Your Free Ticket