Speculative Decoding and Disaggregated Serving
The New Tricks That Make LLMs Feel Fast
The best serving breakthroughs in AI right now are not new foundation models. They are computer architecture moves applied to inference, because modern decoding is often memory bound. Once prefill is done, every new token is dominated by moving and updating attention state, and the attention state grows with context length. That is why the industry has become obsessed with KV cache, not as an implementation detail, but as the object that decides latency, throughput, and cost.
The first big lever is speculative decoding. Instead of generating one token at a time with the expensive model, you let a smaller draft model propose multiple tokens, then you verify them with the large model and accept the ones that match. If it sounds like branch prediction, it should. vLLM treats this as a first class feature specifically because it can reduce inter token latency in memory bound regimes, which is exactly where long context systems live. https://docs.vllm.ai/en/latest/features/spec_decode/
What changed recently is that speculative decoding is becoming an ecosystem, not a trick. The SGLang team published SpecForge as a framework for training draft models that port cleanly into their serving stack, which is a tell that serious operators want repeatable workflows, not one off hacks. https://github.com/sgl-project/SpecForge AMD’s developer hub goes further and documents reproducible speculative decoding performance work in a real serving setup, including a concrete speedup claim in their tutorial context. https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/speculative_decoding_deep_dive.html
The second big lever is disaggregated inference, splitting prefill and decode onto different resources so each phase can scale independently. The hard part is obvious the moment you do this: you must transfer the KV cache efficiently from the prefill side to the decode side, and that turns your serving system into a distributed memory system. The DistServe retrospective summarizes how the field evolved after the initial push, and it lists a whole family of follow on systems that focus specifically on KV cache transfer, scheduling, and network constraints. https://hao-ai-lab.github.io/blogs/distserve-retro/ A separate recent overview frames this evolution as eras of KV cache handling, with disaggregation as the key inflection point because it forces explicit cache movement and cache economics. https://www.modular.com/blog/the-five-eras-of-kvcache
The third lever is kernel level efficiency, especially attention. If attention is where you spend your memory bandwidth and your time, you want kernels that overlap data movement with compute and exploit hardware features like Tensor Memory Accelerator paths and low precision math. FlashAttention 3 is a good illustration of the direction, focusing on asynchrony and hardware aware scheduling on Hopper class GPUs. https://pytorch.org/blog/flashattention-3/
These levers converge on one uncomfortable conclusion. Serving is now a memory hierarchy problem. GPU HBM is the hottest tier, but long context pushes you to decide what spills to host memory, what spills to SSD, and what can be reused across requests. That is why you are seeing discussion of extending context storage beyond HBM, and why new memory ideas get framed explicitly for inference. Even the idea of high bandwidth flash is marketed in terms of augmenting HBM for inference workloads, which tells you where operators feel the pain. https://www.tomshardware.com/pc-components/dram/sandisks-new-hbf-memory-enables-up-to-4tb-of-vram-on-gpus-matches-hbm-bandwidth-at-higher-capacity
Low precision is the final accelerant because it reduces memory footprint and increases throughput, but it only works if you manage accuracy. NVIDIA positions FP8 as a supported datatype for higher throughput on H100 class hardware and documents how to use FP8 and FP4 style formats through Transformer Engine, which is another signal that inference efficiency is becoming standardized, not experimental. https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html
If you are building serious systems, the implication is that you should benchmark like an infrastructure engineer, not like a model demo. Long context, multi turn sessions, and agentic workloads stress KV cache, cache movement, and concurrency regimes. The wins you will feel in production are increasingly coming from speculative decoding plus smarter cache handling plus kernels that are built around the real bottleneck, which is memory traffic, not raw FLOPS.

