SESSION ON-DEMAND

All Things P99

The event for developers who care about P99 percentiles and high-performance, low-latency applications

KV Caching Strategies for Latency-Critical LLM Applications

LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll share techniques NVIDIA’s TensorRT-LLM uses to maximize KV Cache hit rates for structured LLM workloads.

22 minutes

All Things P99

The event for developers who care about P99 percentiles and high-performance, low-latency applications

KV Caching Strategies for Latency-Critical LLM Applications

John Thomson, Deep Learning Algorithms Engineer at University of Waterloo

Proudly supported by

All Things P99

The event for developers who care about P99 percentiles and high-performance, low-latency applications

KV Caching Strategies for Latency-Critical LLM Applications

John Thomson, Deep Learning Algorithms Engineer at University of Waterloo

Proudly supported by

Register for Your Free Ticket