ScalarLM vLLM Optimization with Virtual Channels

Jul 28, 2025 — Greg Diamos

Today we are adding an optimization to vLLM workers in ScalarLM that can increase throughput by 8x by leveraging an old network technology called virtual channels, which are used for flow control.

Reference: Virtual Channels Paper

The Problem

The problem is that GPUs need to batch together many requests when generating output tokens to get good utilization. LLMs are autoregressive, meaning that they generate one output token sequentially, one at a time. This means that one of the dimensions of the matrix multiplies performed during output token generation is the batch size. From our previous work benchmarking matrix multiplies on MI300X, we show that an optimal size is around 1000 or higher.

Reference: ScalarLM MI300X Benchmarking

However, the maximum batch size is limited by the total amount of GPU memory. vLLM allocates GPU memory for the kv cache when it boots up. For Llama 3.3 70B on a MI300x, the kv cache can fit about 108 thousand tokens, or 13 requests that each have a sequence length of 8 thousand tokens.

If a request comes in that doesn't fit into the cache, it fails. vLLM will respond with the dreaded rate limit error. Which must be handled by the client, typically by backoff. Backoffs result in underutilization of the GPUs.

Existing ScalarLM Optimization

ScalarLM already has an optimization to eliminate rate limits. Instead of pushing requests through a load balancer to vLLM workers, ScalarLM pushes all inference requests into a unified persistent queue. Before they allocate space in the kv cache, requests are just strings, which can easily fit into a disk backed persistent queue. vLLM workers then compute the maximum number of requests that can fit into the kv cache, and pull exactly that many requests out of the queue. This keeps the batch size relatively high, and makes it impossible to run out of space and throw an error.

Virtual Channels and Flow Control

It turns out that we can do even better than this. vLLM workers don't know exactly how many tokens a request will take until it finishes. vLLM puts the tokenizer in a different process than the API server to decouple them so they don't block each other. But this means that the REST API server doesn't know how many input tokens a request will need until it has already sent it. So the API server must be conservative and plan for the worst case. It assumes each request will take the maximum number of input tokens. This can be highly inefficient when the max size is e.g. 32K tokens, but requests on average only use 4K tokens.

We introduce virtual channels and flow control to handle this. When the vLLM API process boots up, it starts with one credit for each slot in the kv cache, and distributes them among virtual channels. When it pulls a request from the queue and submits it to the LLM engine, it deducts the worst case, e.g. 32K, tokens. The LLM engine receives the request and runs the tokenizer. It now knows how many kv cache slots the input tokens need. So it sends an ACK back to the API server with any remaining tokens. E.g. if the request actually used only 4K input tokens, the ACK will return 28K credits back to the API server. It can use these credits to pull another request from the queue. The tokenizer runs much faster than the LLM, so using virtual channels and credits quickly determines exactly how many requests can fit into the kv cache without wasting any space.

Additional Optimization

Virtual channel credits also allow us to introduce one more optimization. Similar to how the vLLM LLM engine doesn't know how many input tokens a request uses until it runs the tokenizer, it doesn't know how many output tokens a request will produce until it hits a stopping criteria. So again, each request must assume the worst case, e.g. max_tokens=1024. As soon as requests hit their stopping criteria, and when they free their kv cache space, the LLM engine can send back an ACK with the number of newly available kv cache tokens. This mitigates load imbalance in the case where one request generates many output tokens and most requests finish quickly.

Results

Both of these optimizations are now in ScalarLM. Running Qwen3 32B on a TensorWave pod of 512 MI325X GPUs allows a single ScalarLM client to submit up to 147,456 concurrent requests to fill up the kv cache, virtual channels credits ensure that there are almost no wasted kv cache slots, and a pull-based architecture with a persistent queue ensures that there are no rate limits ever.