LLM inference is split into two phases: Prefill and Decode. The Prefill phase fills the KV Cache with the context, while the Decode phase is an autoregressive process which generates tokens. Maximizing the KV Cache hit rate is critical to minimize time-to-first-token latency. In this talk, I’ll share techniques NVIDIA’s TensorRT-LLM uses to maximize KV Cache hit rates for structured LLM workloads.
