SESSION ON-DEMAND

All Things P99

The event for developers who care about P99 percentiles and high-performance, low-latency applications

LLM Inference Optimization

This talk will discuss why LLM inference is slow and key latency metrics. It also covers techniques that make LLM inference fast, including different batching, parallelism, and prompt caching. Not all latency problems are engineering problems though. This talk will also cover interesting tricks to hide latency at an application level.

31 minutes
Register for access to all 60+ sessions available on demand.
Fill out the form to watch this session from the P99 CONF 2025 livestream. You’ll also get access to all available recordings.

Chip Huyen, LLM Inference Optimization

I'm Chip Huyen, a writer and computer scientist. I'm building infrastructure or real-time ML. I also teach Machine Learning Systems Design at Stanford. Previously, I was with Snorkel AI, NVIDIA, Netflix, Primer, Baomoi.com (acquired by VNG). I helped launch Coc Coc - Vietnam’s second most popular web browser with 20+ million monthly active users. In my free time, I travel and write. After high school, I went to Brunei for a 3-day vacation which turned into a 3-year trip through Asia, Africa, and South America. During my trip, I worked as a Bollywood extra, a casino hostess, and a street performer. I’m the author of four bestselling Vietnamese books. I’m working on an English book on machine learning interviews.