LLM deployments are driving massive GPU demand and cost. This talk presents a generic architecture for offloading KV-cache tensors to a disaggregated shared store, enabling GPU-initiated IO for efficient storage and retrieval. We’ll cover the system requirements for offloading and retrieval, their impact on platform design and performance, and a mathematical model for predicting gains across LLMs and hardware, supported by initial results.
