SESSION ON-DEMAND

All Things P99

The event for developers who care about P99 percentiles and high-performance, low-latency applications

KV Caching Strategies for Latency-Critical LLM Applications

LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll share techniques NVIDIA’s TensorRT-LLM uses to maximize KV Cache hit rates for structured LLM workloads.

22 minutes

John Thomson, Deep Learning Algorithms Engineer at University of Waterloo

John Thomson is a Deep Learning Algorithms Engineer at University of Waterloo.

P99 Conf Logo
P99 CONF OCT. 21 + 22, 2026

Register for Your Free Ticket

Registration includes free 30-day access to O’Reilly’s ebook library.