PagedAttention

PagedAttention is the memory management algorithm at the heart of vLLM, introduced in a 2023 paper by UC Berkeley researchers that solved one of the most painful problems in LLM serving: KV cache memory fragmentation. The technique organizes the KV cache (the running state of self-attention during autoregressive generation) into fixed-size blocks (typically 16 tokens each), allocated and tracked through a virtual-memory-style indirection layer. This eliminates the wasted memory of static pre-allocation, enabling 2x-4x more concurrent requests on the same hardware. PagedAttention also enables advanced features like copy-on-write for parallel sampling, prefix caching across requests with shared prompts, and efficient handling of dynamic batches. The algorithm has been ported to TensorRT-LLM (where NVIDIA calls it KV cache reuse), Text Generation Inference (TGI), and other inference frameworks. PagedAttention is widely credited as one of the most impactful inference systems innovations of 2023. AI governance teams encounter PagedAttention as a transparent infrastructure optimization that does not affect model output quality.

PagedAttention-backed serving in Centralpoint: Centralpoint sits above vLLM and other PagedAttention-enabled inference stacks alongside cloud LLM APIs in one model-agnostic platform. Tokens are metered per skill, prompts stay local, and chatbots deploy through one line of JavaScript with audit-ready governance.

Related Keywords:
PagedAttention,,

Back