Article

Tokenization and the Input Pipeline

How LLMs turn raw text into numbers — and why the way you split matters

Why Tokenization Matters More Than You Think

In What Is a Large Language Model?, we established the core idea: an LLM predicts the next token. But what is a token? Not a word. Not a character. Something in between — and the choice of where to draw the lines has consequences that ripple through every layer of the system.

An LLM does not see text. It sees integers. Before the model can process the sentence “The cat sat on the mat,” something has to convert those characters into a sequence of numbers. That something is the tokenizer — and it is the single most underrated component in the entire stack.

Why underrated? Because tokenization determines three things you care about every day:

Cost. LLM APIs charge per token. If your tokenizer turns a sentence into 20 tokens instead of 15, you pay 33% more. At scale — millions of API calls per day — that difference is the gap between a viable product and a budget overrun. GPT-5 charges $1.25 per million input tokens and $10.00 per million output tokens (OpenAI, 2026). Every unnecessary token is money on fire.

Context window. Models have a fixed token budget — 128K for GPT-4o, 200K for Claude (up to 1M for the latest Opus model), up to 10 million for Llama 4 Scout. The context window is measured in tokens, not characters or words. An inefficient tokenizer wastes that budget on overhead, leaving less room for the content that actually matters.

What the model can “see.” The tokenizer decides the atomic units of perception. If “unhappiness” is a single token, the model sees it as one concept. If it is split into [“un”, “happiness”], the model can potentially reason about the negation prefix. If it is fragmented into [“un”, “hap”, “pin”, “ess”] — as happens with some tokenizers for some scripts — the model is working with near-meaningless fragments.

Here is the part that surprises most people: the same sentence tokenizes differently depending on which model you use. “Tokenization is surprisingly important” might be 4 tokens in one model and 6 in another. And the differences are not random — they reflect deep design choices about vocabulary size, training data, and which languages the creators prioritized.

Input text

The quick brown fox jumps over the lazy dog

GPT-4o (o200k)

Vocab: ~200K9 tokens

The quick brown fox jumps over the lazy dog

Llama 3 (128K)

Vocab: 128K9 tokens

The quick brown fox jumps over the lazy dog

Mistral Tekken (131K)

Vocab: 131K9 tokens

The quick brown fox jumps over the lazy dog

The comparison above makes the problem concrete. The same text, processed by three different tokenizers, yields different token counts — sometimes dramatically so, especially for non-English languages and code. This is not a trivia fact. It is a cost multiplier, a context window constraint, and a quality signal all at once.

From Characters to Subwords

The history of tokenization is a story about finding the right granularity. Too fine-grained, and sequences are impossibly long. Too coarse, and the vocabulary explodes. The field converged on subwords — but it took decades to get there.

Character-level tokenization is the simplest approach. Every letter, digit, and punctuation mark is its own token. The vocabulary is tiny — just 256 entries for ASCII, or a few thousand for Unicode coverage. The problem? Sequences become absurdly long. The word “tokenization” is 12 characters, meaning the model needs 12 forward-pass steps just to process one word. Worse, individual characters carry almost no semantic information. The letter “t” means nothing by itself. The model has to learn, from scratch, that “t-o-k-e-n” means something — an enormous burden on the attention mechanism.

Word-level tokenization is the opposite extreme. Every word is its own token. “Tokenization” is a single token. Sequences are short and semantically rich. The problem? The vocabulary explodes. The English language has over 170,000 words in current use, plus proper nouns, technical jargon, compound words, misspellings, and every other language on the planet. A word-level vocabulary for a multilingual model would need millions of entries — and the embedding table (which maps each token to a vector) would consume more memory than the rest of the model. Worse, any word not in the vocabulary becomes an “unknown” token — a dead end where all meaning is lost.

Subword tokenization is the sweet spot. Instead of splitting at character boundaries or word boundaries, split at subword boundaries. Common words stay whole: “the”, “and”, “is” are single tokens. Rare words get decomposed: “unhappiest” becomes [“un”, “happi”, “est”] — fragments that are morphologically meaningful. The prefix “un-” appears in thousands of words, so the model sees it constantly and learns its meaning. The suffix “-est” is a superlative marker. Even a word the model has never seen before can be understood through its parts.

Why did subwords win? Three reasons:

Controlled vocabulary size. You pick a target — 32K, 100K, 128K tokens — and the training algorithm stops when it reaches that size. No vocabulary explosion.
No unknown tokens. Any input can be encoded by falling back to smaller subwords or individual characters. Modern byte-level BPE implementations can encode literally anything — any script, any emoji, any binary sequence — because the base vocabulary includes all 256 byte values (Hugging Face, 2024).
Morphological structure. Subwords often align with meaningful word parts: prefixes, suffixes, stems. This gives the model a compositional handle on language that neither character-level nor word-level approaches provide.

The four main subword algorithms are BPE (Byte Pair Encoding), WordPiece, Unigram, and SentencePiece — though SentencePiece is really a framework that wraps BPE or Unigram with language-agnostic preprocessing. BPE is by far the dominant approach in modern LLMs, and it is the subject of the next section.

“The unhappiest” tokenized three ways

Character

Vocab: ~256 (ASCII)

The·unhappiest

14 tokens

+ Tiny vocabulary+ Handles any input

− Absurdly long sequences− No semantic info per token

Word

Vocab: 170K+ (English only)

Theunhappiest

2 tokens

+ Short sequences+ Semantically rich

− Vocabulary explosion− Unknown words are lost

Subword (BPE)

Vocab: 32K–200K (configurable)

The unhappiest

4 tokens

un = negation prefix

happi = stem

est = superlative suffix

+ Controlled vocab+ No unknown tokens+ Morphological structure

− Requires training on corpus

BPE Step by Step

Byte Pair Encoding was invented in 1994 by Philip Gage — not for language models, but for data compression (Gage, 1994). The original algorithm was simple: find the most common pair of adjacent bytes in a file, replace every occurrence with a new byte, and repeat. Twenty-two years later, Sennrich, Haddow, and Birch (2016) adapted the same idea for neural machine translation — and it became the foundation of how every modern LLM reads text.

Here is how it works, step by step.

Start with characters. Take a small training corpus — say, four words with frequencies. Each word is split into individual characters, with a special end-of-word marker (_). The initial vocabulary is just the set of unique characters.

Count pairs. Scan the corpus and count every adjacent pair. The pair (“l”, “o”) appears 7 times (from “low” and “lower”). The most frequent pair wins.

Merge. Replace every occurrence of the winning pair with a new token. The vocabulary grows by one. The corpus gets shorter.

Repeat until you reach the target vocabulary size — 32K, 100K, 128K, whatever the model designers chose.

Step 0/8

Initial state

Corpus

low▁

\u00d75

lower▁

\u00d72

newest▁

\u00d76

new▁

\u00d73

Vocabulary (9 entries)

lowernst▁

Vocab size

Total tokens

100%

of original

The connection to compression is not a coincidence. BPE literally compresses the corpus: each merge shortens the encoded sequences. A well-trained BPE tokenizer can reduce a corpus to 15–25% of its character-level length. The compression ratio is a direct measure of how efficiently the tokenizer captures the statistical structure of the training data.

What about byte-level BPE? The version above starts from characters. Modern implementations — GPT-2, GPT-3, GPT-4, DeepSeek V3, Qwen3 — start from bytes instead (OpenAI tiktoken). The base vocabulary is all 256 byte values, which means any input can be encoded, no matter what script or symbol it contains. There is never an “unknown” token.

At encoding time, the learned merges are applied greedily: the tokenizer scans the input, finds the longest matching token from the vocabulary, emits it, and moves on. The result is a sequence of token IDs — integers that the model can process.

The beauty of BPE is its simplicity. There is no linguistic knowledge baked in — no dictionary, no grammar rules, no morphological analyzer. It is a purely statistical process that discovers structure from data. And yet the tokens it produces often align with morphological boundaries, because morphemes tend to be frequent substrings. The algorithm does not know that “un-” is a prefix. It just notices that “un” appears often enough to merit its own token.

Special Tokens and the Vocabulary

BPE builds the core vocabulary, but a tokenizer needs more than subwords. Every model adds a set of special tokens — control signals that frame and structure the input. These tokens are not learned from data. They are hand-designed, and they are invisible to the end user but critical to the model’s behavior.

The most universal special tokens:

BOS (Beginning of Sequence): Signals the start of a new input. Tells the model “this is a fresh context, not a continuation.” In Llama 3, this is <|begin_of_text|>.
EOS (End of Sequence): Signals the end of generation. When the model outputs this token, generation stops. In Llama 3, this is <|end_of_text|>. A related token, <|eot_id|>, marks the end of a single turn in a conversation — a subtle but important distinction for fine-tuning and multi-turn inference.
PAD (Padding): Used to align sequences to the same length in a batch. The model learns to ignore these.
UNK (Unknown): A fallback for inputs not in the vocabulary. Modern byte-level tokenizers have effectively eliminated this — if the vocabulary includes all 256 byte values, nothing is “unknown.”

But the most consequential special tokens are the ones that define chat templates. When you send a message to ChatGPT or Claude, you are not sending raw text. The API wraps your message in a structured format with special tokens marking who is speaking and what role they play.

These structural differences matter. Fine-tuning a model requires matching its expected chat template exactly. Using the wrong template degrades chat quality or confuses the model about who is speaking (Hugging Face, 2024). Multi-model systems that route between different LLMs need to handle template translation. And the special tokens themselves consume context window space — a system prompt with role markers might add 50–100 tokens of overhead before a single word of actual content.

Special tokens: 11

Content tokens: 8

Hover over special tokens to see their purpose.

Vocabulary size trade-offs. The number of subword tokens in the vocabulary is one of the most important design decisions, and the industry trend is clear: bigger is better. The progression from 30K to 200K happened in just six years. Why? Because the embedding table — the lookup that maps each token ID to a vector — is tiny relative to the model’s total parameters. A 200K vocabulary with 4,096-dimensional embeddings requires ~800M parameters for the embedding layer. For a model with 400B+ total parameters, that is 0.2% of the budget. The cost is negligible, and the benefit is substantial: fewer tokens means shorter sequences, faster inference, lower cost, and better multilingual performance.

The multilingual improvement is where larger vocabularies make the biggest difference. Llama 2’s 32K vocabulary encoded Hindi at roughly 4–5 tokens per word. Llama 3’s 128K vocabulary cut that significantly. GPT-4o’s 200K vocabulary brought Hindi from 4.19 tokens/word (cl100k) down to 1.89 tokens/word — a 55% reduction (Microsoft, 2024). But even after that improvement, Hindi still requires 63% more tokens than English (~1.16 tokens/word on o200k_base). The tokenization gap has narrowed, but it has not closed.

This disparity has a name: the token tax. A 2025 study evaluating 10 LLMs on African languages found that token fertility (tokens per word) reliably predicts accuracy — higher fertility consistently means lower accuracy across all models and subjects. Doubling fertility leads to a 4× increase in training cost and inference latency (arXiv:2509.05486). The tokenizer is not just a preprocessing step. It is a structural advantage for some languages and a structural penalty for others.

From Tokens to Embeddings

Tokenization converts text into a sequence of integers. But integers are not useful to a neural network — you cannot meaningfully multiply or compare token ID 4,721 and token ID 51,883. The model needs vectors: dense, continuous lists of numbers where mathematical operations correspond to semantic relationships.

This is the job of the embedding layer — and if you have read What Are Vector Embeddings?, the concept will be familiar. The embedding layer is a lookup table. A giant matrix with one row per token in the vocabulary. Token ID 4,721 maps to row 4,721 — a vector of 4,096 or 12,288 numbers (depending on the model’s hidden dimension). That vector is the model’s learned representation of what that token “means.”

The numbers are not hand-crafted. They are learned during pre-training, adjusted by billions of gradient updates until tokens that appear in similar contexts end up with similar vectors. The word “cat” and the word “kitten” — if they are single tokens — will have embedding vectors that point in roughly the same direction, because they appear in overlapping contexts in the training data. This is the same principle behind Word2Vec and the famous “king − man + woman = queen” analogy (Mikolov et al., 2013) — but now operating at the subword level, inside a model with billions of parameters.

To put concrete numbers on it: GPT-3’s embedding table has shape [50,257 × 12,288] — 50,257 tokens, each mapped to a 12,288-dimensional vector. That is 617 million parameters just for the embedding layer. Llama 3, with its 128K vocabulary and 8,192-dimensional embeddings (for the 70B model), has an embedding table of roughly 1 billion parameters. These are the model’s first learned parameters — and the last, too, since the same matrix (or a separate “unembedding” matrix) is used at the output to project back to vocabulary probabilities.

But embeddings alone are not enough. The sentence “dog bites man” has a very different meaning from “man bites dog.” If you just look up the embeddings for each token, the two sentences produce the same set of vectors — just in a different order. The model needs to know that order matters.

This is the role of positional encoding: a mechanism that injects information about where each token sits in the sequence. The original transformer (Vaswani et al., 2017) used fixed sinusoidal functions — sine and cosine waves at different frequencies — to encode absolute positions. GPT-2 and GPT-3 replaced those with learned positional embeddings: trainable vectors, one per position, added to the token embeddings during training.

Modern LLMs have moved to a more elegant solution: Rotary Position Embeddings (RoPE), introduced by Su et al. (2021). Instead of adding a positional vector, RoPE rotates the query and key vectors in the attention mechanism by position-dependent angles. The key insight: the dot product between two RoPE-rotated vectors depends only on the difference in their positions, not the absolute positions. This makes RoPE inherently relative — the model naturally learns that “the word 3 positions ago” matters, regardless of whether that is position 5 or position 5,000. RoPE is used by Llama 3, Llama 4, DeepSeek V3, Qwen3, Mistral, and virtually every other modern open-weight LLM. It is also the mathematical foundation for context window extension techniques — the reason some models can handle millions of tokens is partly because RoPE’s rotational structure scales gracefully to sequence lengths far beyond the training-time maximum.

Raw Text

"The cat sat"

Tokenizer

Splits into subword tokens → integer IDs

Embedding Lookup

Each token ID maps to a dense vector (e.g. 4,096 dims)

Expand↓

TheID 976→

catID 8415→

satID 3521→

Each row represents a 4,096-dimensional vector (compressed to 24 cells here)

Positional Encoding (RoPE)

Position-dependent rotation — encodes token order

Transformer Blocks

Deep dive in the next article in this series

These vectors are the bridge between discrete text and continuous mathematics. Everything before this point is symbolic: characters, subwords, integer IDs. Everything after this point is geometric: vectors in high-dimensional space, transformed by attention and feed-forward layers. The embedding layer is where the phase transition happens.

And if the connection to distance metrics is not obvious yet, consider this: two tokens with similar embeddings are “close” in the embedding space, measured by exactly the metrics — cosine similarity, dot product, Euclidean distance — that we explored in that earlier article. The same mathematical framework that powers vector search and RAG is operating, at a smaller scale, inside the very first layer of every LLM.

You now have a sequence of vectors — one per token — carrying both the meaning of each subword and its position in the sequence. This is the input the transformer consumes. But what happens inside?

The answer is the self-attention mechanism: a process where every token looks at every other token and decides which ones matter most for the prediction ahead. It is the single most important innovation in modern AI, and it is the subject of The Transformer Architecture.