Article

Inference and Serving

Quantization, KV cache, batching, and the engineering behind every API call

You have spent five articles building a model from scratch. You understand how tokens enter the network, how self-attention routes information between them, how trillions of tokens of pre-training bake knowledge into the weights, and how fine-tuning reshapes that knowledge into helpful behavior. The result is a model that can answer questions, write code, and reason through complex problems.

But none of that matters if it takes thirty seconds to respond.

Training a frontier model costs hundreds of millions of dollars — a one-time expense spread across millions of users. Inference — the act of generating a response for a single request — happens billions of times a day, and every millisecond of latency and every megabyte of memory translates directly into cost. The techniques in this article determine whether your model costs $0.01 or $1.00 per request, whether users wait 200ms or 20 seconds, and whether a single GPU serves ten users or ten thousand.

This is where engineering meets economics. The model is fixed — the question is how fast and how cheaply you can run it.

The Inference Bottleneck

Every time you send a message to ChatGPT, Claude, or any LLM API, the model does not generate your entire response at once. It produces one token at a time — each new token conditioned on every token that came before it. This is the autoregressive loop we first encountered in What Is a Large Language Model?, and it is the fundamental reason inference is hard.

Autoregressive generation — one token at a time

ThecapitalofFranceisParis,acityknownforitsartandarchitecture.

Timing breakdown

Prefill

← TTFT (320ms) →← Decode (22ms/tok) →

TTFT

—

TPS

—

Prefill = compute-boundDecode = memory-bound

But not all tokens are created equal. Inference actually has two distinct phases, and they hit entirely different hardware bottlenecks.

Prefill is the first phase. The model processes your entire input prompt — every token in parallel — building up an internal representation called the KV cache (more on this in section 03). Prefill involves massive matrix multiplications across all input tokens simultaneously. It is compute-bound — limited by how many floating-point operations the GPU can perform per second.

Decode is the second phase — and the bottleneck most users feel. The model generates output tokens one at a time. Each decode step requires reading the model’s billions of parameters from memory, but performs relatively little computation per parameter read. It is memory-bandwidth-bound — limited by how fast data can be streamed from GPU memory (HBM) to the compute units.

How severe is this imbalance? A roofline analysis of Llama 3 405B found that even at batch size 32, the arithmetic intensity during decode is only ~42 FLOPs per byte — far below the GPU’s compute ceiling (Yuan et al., 2024). The GPU’s compute cores sit idle the vast majority of the time, waiting for weights to arrive from memory.

This is why inference engineers talk about two separate latency metrics:

Time-to-first-token (TTFT): How long the user waits before the first token appears. Dominated by prompt length and GPU compute capacity.
Tokens-per-second (TPS): How fast subsequent tokens stream out. Dominated by memory bandwidth.

Every technique in the rest of this article targets one or both of these bottlenecks.

Quantization: Trading Precision for Speed

If decode is memory-bandwidth-bound, the most direct fix is to make the model smaller. Not by removing parameters — by representing each parameter with fewer bits.

During pre-training, model weights are typically stored in FP16 or BF16 — 16 bits per parameter. A 70-billion-parameter model at FP16 requires roughly 140 GB of memory just for the weights. That already exceeds the capacity of a single H100 GPU (80 GB). Quantization compresses those weights to lower precision — 8 bits, 4 bits, or even less — trading a small amount of numerical accuracy for dramatic memory and speed gains.

Quantization — trading precision for speed (70B model)

FP16

16-bit

Memory: 140 GBSpeed: 1×Quality: 100%

FP8

8-bit

Memory: 70 GBSpeed: ~1.8×Quality: ~99%

INT8

8-bit

Memory: 70 GBSpeed: ~2×Quality: ~98%

INT4 (AWQ)

4-bit

Memory: 35 GBSpeed: ~3–4×Quality: ~95%

Less data to move = faster decode = more requests per GPU = lower cost per token

FP8 is the conservative choice. Supported natively on H100 and Blackwell GPUs, it halves memory vs FP16 with minimal quality impact (NVIDIA, 2025).

4-bit quantization is where the format wars live. AWQ (Activation-aware Weight Quantization) achieves ~95% quality retention and 741 tokens/second with the Marlin kernel — 60% faster than FP16 inference. GPTQ was the first widely adopted 4-bit format with excellent tooling. GGUF (the llama.cpp format) is optimized for CPU and hybrid inference on consumer hardware (Jarvislabs, 2025).

Quantization is the single most impactful optimization for inference cost. It attacks the memory bottleneck directly — less data to move means faster decode, more requests per GPU, lower cost per token.

The KV Cache: Why Memory Scales with Context

Quantization shrinks the model weights. But there is another memory consumer that grows with every token you generate — and it can dwarf the weights themselves.

Recall from The Transformer Architecture that self-attention computes queries, keys, and values for each token. During generation, every new token needs to attend to all previous tokens. The KV cache stores key and value tensors for every token already processed — turning an O(n²) recomputation into an O(n) lookup.

The trade-off is memory. For Llama 3.1 70B, each token adds ~0.31 MB to the cache. At 128K tokens: approximately 40 GB — for a single request (Lyceum Technology, 2025).

Notice that Llama 3.1 70B uses 64 query heads but only 8 key-value heads — thanks to Grouped Query Attention (GQA), which reduces the KV cache by up to 8×. Without it, that 128K-context cache would be ~320 GB per request (IBM, 2025).

KV cache scaling — Llama 3.1 70B (BF16, 8 KV heads via GQA)

KV cache at 1K context (1,024 tokens)

Keys Values

H100 80GB memory allocation

Weights

Model: 35 GB (INT4)KV cache: 0.3 GBFree: 44.7 GB

Per-token cache

0.31 MB

Total cache

317 MB

Max concurrent (1 GPU)

145 requests

Longer context = more KV cache = fewer concurrent requests = higher cost per token

Batching: From Static Waste to Continuous Flow

A single LLM request barely uses a modern GPU. The decode phase is memory-bandwidth-bound, and a single sequence cannot saturate even a fraction of the GPU’s compute capacity. The obvious fix: process multiple requests simultaneously.

Static batching is the naive approach. Collect a batch of requests, pad them all to the length of the longest sequence, and process together. When the shortest request finishes, it sits idle until the longest completes.

Continuous batching — introduced by the Orca paper (Yu et al., OSDI 2022) — eliminates this waste entirely. As soon as one request finishes, a new one slides into its slot. No padding, no waiting, no batch boundaries.

Static vs. continuous batching

Static Batching — padded, batch boundaries

~2 req/s

0t →

Continuous Batching — no padding, no gaps

~7 req/s

0t →

Wasted (padding) Active compute

Orca (OSDI 2022) demonstrated 36.9× throughput improvement over FasterTransformer with iteration-level scheduling

But continuous batching alone still has a memory management problem. Sequences grow at different rates, and reserving contiguous memory leads to fragmentation.

PagedAttention — the core innovation of vLLM (Kwon et al., SOSP 2023) — solves this with virtual memory for KV cache. Logical cache blocks map to non-contiguous physical memory via a block table, exactly like an OS page table. The result: under 4% fragmentation, 2–4× higher throughput than naive serving, and 85–92% GPU utilization under high concurrency.

Speculative Decoding and the Serving Frontier

The optimizations so far all accept the fundamental constraint of autoregressive generation: one token at a time. Speculative decoding challenges that constraint.

The idea, proposed independently by two groups in 2023 (Leviathan et al., ICML 2023; Chen et al., 2023), is elegant: use a small, fast draft model to predict several tokens ahead, then have the large target model verify the entire draft in a single forward pass. Verification is cheaper than generation because the target model can check all draft tokens in parallel.

Speculative decoding — draft and verify

Standard decoding — 1 token per forward pass

The

→

quick

→

brown

→

fox

→

jumped

5 forward passes

Speculative decoding — draft 5, verify 1 pass

Draft

The

quick

brown

fox

jumped

1 small-model pass

Verify

The

quick

brown

fox

jumped

1 large-model pass

4 accepted + 1 rejected → regenerate from rejection point. Net: 4 tokens for 2 passes instead of 4.

Speedup benchmarks

Original (Leviathan et al.)2–3×2023

Mirror-SD (Apple)2.8–5.8×2025

Speculative Streaming1.8–3.1×2024

Zero quality loss — the output distribution is mathematically identical to standard decoding

The inference frontier extends beyond speculative decoding. Disaggregated serving separates prefill and decode onto different hardware — using compute-dense GPUs for prefill and memory-bandwidth-optimized hardware for decode. Prefix caching (RadixAttention in SGLang) reuses KV cache across requests that share the same system prompt — particularly powerful for RAG workloads where many requests share retrieved context.

The Serving Stack: Choosing a Framework

All of these optimizations — quantization, KV cache management, continuous batching, speculative decoding — are implemented by inference serving frameworks. The landscape in 2026 has consolidated around a few major options.

The serving stack — framework comparison

FrameworkThroughputLatencyEaseBest for

vLLM

General-purpose production serving

vLLMGeneral-purpose production serving

Throughput

Latency

Ease

SGLang

Multi-turn chat, RAG, shared-context workloads

SGLangMulti-turn chat, RAG, shared-context workloads

Throughput

Latency

Ease

TensorRT-LLM

Single-model production at max throughput

TensorRT-LLMSingle-model production at max throughput

Throughput

Latency

Ease

llama.cpp / Ollama

Local inference, privacy, developer experimentation

llama.cpp / OllamaLocal inference, privacy, developer experimentation

Throughput

Latency

Ease

PagedAttention, continuous batching, and speculative decoding are now table stakes — every serious framework implements them

The real insight is that these frameworks are converging. PagedAttention, continuous batching, and speculative decoding are now table stakes. The differentiation is in the details: prefix caching strategies, quantization format support, multi-GPU parallelism, and ease of deployment. For most production workloads, the framework matters less than the optimization stack you enable within it.

Inference optimization handles the “how fast and how cheap” question. But the most impactful lever for LLM quality is something that does not require any infrastructure changes at all: the prompt. How you frame a question, what examples you provide, and what constraints you set can change the output more than any hardware upgrade.

Article 7 covers prompt engineering — the techniques, patterns, and mental models for getting the most out of any LLM.