LLM Infrastructure

Continuous Batching

Continuous batching is an inference serving technique that dynamically adds and removes requests from a running batch at each generation step, maximizing GPU utilization without waiting for all requests to complete.

What is Continuous Batching?

Continuous batching is an inference serving technique that dynamically adds and removes requests from a running batch at each generation step, maximizing GPU utilization without waiting for all requests to complete. It eliminates the inefficiency of static batching where short requests waste GPU cycles waiting for longer requests in the same batch to finish.

In static batching, all requests in a batch start and end together. If one request generates 10 tokens and another generates 500, the GPU sits idle for the shorter request's slot during 490 generation steps. Continuous batching (also called iteration-level scheduling) evicts completed requests immediately and inserts waiting requests into freed slots at every iteration.

This approach increases throughput by 2-8x under realistic workloads where output lengths vary significantly. The technique requires careful KV cache management since new requests entering mid-batch need their own cache allocation while departing requests release theirs. Modern serving engines like vLLM, TensorRT-LLM, and SGLang all implement continuous batching as a default behavior.

Why does Continuous Batching matter?

Continuous batching is essential for cost-efficient LLM serving at scale. Without it, GPU utilization drops to 20-40% under variable-length workloads, meaning organizations pay for 2-5x more hardware than actually needed to handle their request volume.

How is Continuous Batching used in practice?

Every major LLM API provider uses continuous batching internally. A production deployment serving a coding assistant — where responses range from 20 tokens for short answers to 2,000 tokens for code generation — achieves 4x higher throughput with continuous batching compared to static batching on identical hardware.

About the Author

Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.