SESSION ON-DEMAND

All Things P99

The event for developers who care about P99 percentiles and high-performance, low-latency applications

KV Caching Strategies for Latency-Critical LLM Applications

LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll share techniques NVIDIA’s TensorRT-LLM uses to maximize KV Cache hit rates for structured LLM workloads.

22 minutes
Register for access to all 60+ sessions available on demand.
Fill out the form to watch this session from the P99 CONF 2025 livestream. You’ll also get access to all available recordings.

John Thomson, Deep Learning Algorithms Engineer at University of Waterloo

John Thomson is a Deep Learning Algorithms Engineer at University of Waterloo.