This talk will discuss why LLM inference is slow and key latency metrics. It also covers techniques that make LLM inference fast, including different batching, parallelism, and prompt caching. Not all latency problems are engineering problems though. This talk will also cover interesting tricks to hide latency at an application level.
