Continuous Batching

Continuous batching, sometimes called dynamic batching or in-flight batching, is an LLM serving technique where new incoming requests join a running batch immediately rather than waiting for the current batch to complete. Standard static batching wastes GPU cycles when requests in a batch finish at different times — the GPU sits idle waiting for the longest-generating request to complete. Continuous batching fills these gaps by inserting new requests into the freed slots, dramatically improving GPU utilization and aggregate throughput. The technique was popularized by vLLM in 2023 and is now standard in production LLM serving including TensorRT-LLM (as in-flight batching), Text Generation Inference (TGI), Triton Inference Server, and most managed inference platforms. Continuous batching enables 2x-10x throughput improvements over static batching at the same hardware, with minimal latency impact. AI governance teams using continuous-batching infrastructure document the configuration as part of their inference architecture lineage. The technique is the dominant production pattern for high-volume LLM serving in 2024-2025.

Continuous-batching infrastructure through Centralpoint: Centralpoint sits above continuous-batching inference stacks like vLLM and TensorRT-LLM, with consistent metering regardless of backend. The model-agnostic platform routes to any LLM, keeps prompts local, supports generative and embedded models, and deploys chatbots through one line of JavaScript on any portal.

Related Keywords:
Continuous Batching,,

Back