Article
What Is a Large Language Model?
The surprisingly simple idea behind the most powerful AI systems ever built
01
The One-Sentence Definition
Strip away the hype, the fundraising decks, the apocalyptic op-eds. A large language model does one thing: it takes a sequence of tokens and predicts what comes next.
That is the entire trick. GPT-5, Claude, Llama, Gemini — every frontier model you have heard of is, at its core, a next-token prediction engine. You feed it a sequence of text fragments (called tokens — more on those in the next article in this series), and it outputs a probability distribution over its vocabulary: a ranked list of every possible continuation, each with a confidence score.
So why does something so simple produce systems that can write code, explain quantum mechanics, and draft legal contracts? The answer is composition. One prediction is trivial. Thousands of predictions chained together — each one conditioned on every token that came before it — produce emergent behavior that no one explicitly programmed.
A brief history of “predict the next word.” The idea is older than you might think. N-gram models in the 1990s predicted the next word by counting how often specific word sequences appeared in a corpus. Bengio et al. (2003) replaced the counting with a neural network and learned distributed representations of words — the first neural language model. A decade later, Mikolov et al. (2013) showed that these learned representations captured remarkable semantic structure (the famous “king − man + woman = queen” analogy). Then, in 2017, Vaswani et al. introduced the transformer architecture — and everything changed. The transformer could process all tokens in parallel rather than one at a time, unlocking the ability to train on vastly more data. Every modern LLM descends from that single paper.
Input prompt
The capital of France is|
Top-10 predicted next tokens
The bar chart above makes the mechanism visceral. The model is not “thinking.” It is not “understanding.” It is assigning probabilities to continuations. But the probabilities are so well-calibrated — trained on trillions of tokens of human text — that the output looks like understanding. Whether it is understanding is a philosophical question. Whether it is useful is an engineering one — and the answer is unambiguously yes.
02
Scale Changes Everything
The first GPT model, released by OpenAI in 2018, had 117 million parameters. GPT-2 followed in 2019 with 1.5 billion. GPT-3, in 2020, jumped to 175 billion — a 100x leap that stunned the field. The scaling did not stop. Meta’s Llama 3.1, released in 2024, pushed to 405 billion parameters in a single dense model (Meta AI, 2024). Then the architecture shifted: DeepSeek V3 packed 671 billion total parameters with only 37 billion active per token, using 256 routed experts (DeepSeek, 2024). Meta’s Llama 4, released in April 2025, embraced the same approach — Maverick uses 400 billion total parameters across 128 experts with just 17 billion active per token, plus a one-million-token context window; Scout fits 109 billion total into 16 experts with an industry-leading ten-million-token context (Meta AI, 2025). By August 2025, OpenAI’s GPT-5 unified multiple specialized sub-models behind a mixture-of-experts router, and in November, Google’s Gemini 3 Pro shipped as a sparse MoE transformer with a million-token context window (OpenAI, 2025; Google, 2025).
The numbers are dizzying. But the real story is not about parameter counts — it is about what happens when you scale up.
Emergent capabilities are abilities that appear at scale but are absent in smaller models. Wei et al. (2022) documented dozens of examples: few-shot learning, chain-of-thought reasoning, code generation, multi-step arithmetic. A 1-billion-parameter model cannot do multi-digit multiplication. A 100-billion-parameter model can — and nobody explicitly taught it how. The capability emerged from the training process at sufficient scale.
Recent research complicates this picture. A 2025 survey by Li et al. notes that emergence aligns more closely with pre-training loss than with raw parameter count — smaller models trained more thoroughly can match larger ones on specific capabilities. And the advent of reasoning-focused models like OpenAI’s o3 and DeepSeek-R1, which use reinforcement learning and search-based inference, has revealed a second scaling axis: test-time compute. Instead of making the model bigger, you let it think longer at inference time. OpenAI’s o3 achieved 87.5% on the ARC-AGI-1 benchmark using high-compute inference, compared to GPT-4o’s sub-10% — not by being bigger, but by spending more computation per answer (ARC Prize, 2024). By late 2025, GPT-5.2 Pro crossed 90% on the same benchmark (OpenAI, 2025). The harder ARC-AGI-2 benchmark, designed to remain challenging, still stumps frontier models at around 3% while humans score 60% — a reminder that scale alone does not produce general intelligence (ARC Prize, 2025).
Scaling laws formalize the relationship between compute, data, parameters, and performance. Kaplan et al. (2020) established that loss follows a power-law curve as you increase any of the three inputs — more compute, more data, more parameters yields predictably lower loss. Hoffmann et al. (2022) refined this with the “Chinchilla” finding: the compute-optimal ratio is roughly 20 tokens per parameter. But the industry has since moved well beyond Chinchilla-optimal training. Llama 3.1’s 8B model was trained on 15 trillion tokens — a ratio of 1,875 tokens per parameter, nearly 100x the Chinchilla point (Meta AI, 2024). Qwen3-0.6B pushed the ratio to 60,000:1, training on 36 trillion tokens across 119 languages (Qwen, 2025; Qwen3 Technical Report). The reasoning: smaller models trained on vastly more data are cheaper to serve at inference time, even if training costs more up front. When you expect billions of inference requests, over-training pays for itself.
Today, scaling means three things simultaneously: more parameters, more training data, and more inference-time computation. The field has moved from a single lever to a three-dimensional optimization problem — and the interplay between these axes is where the most interesting work is happening.
The scaling staircase — 2018 to 2025
Demonstrated pre-training works for NLP
Coherent multi-paragraph text generation
Few-shot learning without fine-tuning
Open-weight models match proprietary
Largest open-weight dense model
MoE efficiency breakthrough
Native multimodality + 1M context
Unified multi-model routing, ~45% fewer hallucinations
Sparse MoE + 1M context + multimodal
Trillion-parameter open-weight MoE
First model to cross 90% on ARC-AGI-1
03
How LLMs Differ from Traditional Software
If you come from a software engineering background, LLMs will feel alien. Traditional software is deterministic: the same input always produces the same output. An LLM is probabilistic — the same prompt can produce a different response every time you run it.
Why? Because the model outputs a probability distribution, not a single answer. It assigns a probability to every token in its vocabulary, then samples from that distribution. The token it picks depends on a parameter called temperature. At temperature 0, the model always picks the highest-probability token — deterministic and repetitive. At temperature 1, it samples proportionally to the probabilities — creative but occasionally erratic. At temperature 2, it samples almost uniformly — chaotic and usually incoherent. Temperature is the “creativity dial,” and it is one of the first things you tune in any LLM application.
Beyond temperature, two other sampling strategies shape the output. Top-k sampling restricts the model to the k most probable tokens at each step — if k is 50, the other tens of thousands of vocabulary entries are ignored. Top-p (nucleus) sampling is more adaptive: it includes the smallest set of tokens whose cumulative probability exceeds a threshold p (typically 0.9 or 0.95), so the pool size varies dynamically based on how confident the model is.
This probabilistic nature has consequences that ripple through everything you build.
No explicit rules. A traditional program follows hand-coded logic: if temperature > 100 then alert(). An LLM has no rules. Its behavior emerges from the statistical patterns encoded in billions of learned weights. You cannot open the code and find the line that makes it good at Python. The capability is distributed across the parameters in ways that are not directly interpretable.
Failure modes are different. Traditional software crashes, throws exceptions, returns error codes. LLMs hallucinate — they generate plausible-sounding text that is factually wrong, confidently citing papers that do not exist or inventing statistics wholesale. This is not a bug in the implementation. It is an intrinsic property of next-token prediction: the model is optimizing for plausibility, not truth. GPT-5 reduced hallucination rates significantly — approximately 45% fewer factual errors than GPT-4o — but did not eliminate them (OpenAI, 2025). (This is exactly the problem that RAG was invented to mitigate — by grounding generation in retrieved facts.)
Debugging is different. When a traditional program produces wrong output, you trace through the execution path, inspect variables, set breakpoints. When an LLM produces wrong output, you… change the prompt and try again. There is no stack trace. There are no variables to inspect. The “execution path” is a forward pass through billions of matrix multiplications. Debugging LLMs is more like coaching a brilliant but unreliable colleague than like fixing a deterministic machine.
Traditional Software
LLM
Input / Output
Structured data (JSON, SQL, typed args)
Natural language (any text)
Traditional APIs demand rigid schemas. LLMs accept freeform text and produce freeform text — flexible but unpredictable.
Logic
Explicit rules, hand-coded
Learned weights, emergent from data
You cannot open an LLM and find the line of code that makes it good at Python. The capability is distributed across billions of parameters.
Determinism
Same input = same output
Probabilistic — same input, different output
LLMs sample from a probability distribution. Temperature controls randomness: 0 is deterministic, 1 is creative, 2 is chaos.
Failure Mode
Crash, exception, error code
Hallucination — confident, wrong
When an LLM lacks information, it does not throw an error. It invents plausible-sounding text that may be completely false.
Debugging
Stack trace, breakpoints, unit tests
Prompt iteration, eval suites, vibes
There is no stack trace inside a forward pass through billions of matrix multiplications. Debugging is more coaching than engineering.
04
The Transformer in 60 Seconds
Every modern LLM is built on the transformer architecture, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. — a paper now cited over 170,000 times. You do not need to understand the math to use LLMs effectively, but you do need a mental map of what happens inside. Here is the 60-second version.
Step 1: Tokenize. Raw text is split into tokens — subword pieces that balance vocabulary size against meaning. “Understanding” might become [“Under”, “stand”, “ing”]. Each token maps to an integer ID. (This is the subject of the next article in this series.)
Step 2: Embed. Each token ID is looked up in a learned embedding table, producing a dense vector — a list of numbers that captures the token’s meaning. (If you have read What Are Vector Embeddings?, these are the same kind of vectors.) A positional encoding is added so the model knows that “dog bites man” is different from “man bites dog.”
Step 3: Transform. The embedded tokens pass through a stack of transformer blocks — the core innovation. GPT-3 uses 96 of these blocks. Llama 3.1 405B uses 126. Each block does two things:
- Self-attention: Every token looks at every other token and decides which ones matter most for predicting the next word. This is how the model knows that in “The cat sat on the mat because it was tired,” the word “it” refers to “cat” — not “mat.” Attention is the mechanism that lets information flow across the entire context window.
- Feed-forward network: A pair of dense layers that process each token’s representation independently, acting as a kind of learned lookup table. Recent research suggests this is where the model stores much of its factual knowledge.
Between each sub-layer, there is a residual connection (a shortcut that lets gradients flow during training) and a normalization step (which keeps the numbers stable).
What about alternatives? State space models like Mamba offer 5x throughput over transformers with linear (rather than quadratic) scaling in sequence length. But as of early 2026, no frontier LLM has adopted a pure SSM architecture. The trend instead is hybrid approaches and incremental attention improvements — like DeepSeek’s Sparse Attention or GPT-5’s Group Query Attention with sliding windows. The transformer remains king, even if its court is growing.
Step 4: Predict. After all transformer blocks, the final representation is projected back to vocabulary size — one score per token in the vocabulary. A softmax function converts these raw scores into probabilities. The token with the highest probability is the model’s “best guess” for what comes next.
The entire process — tokenize, embed, transform, predict — runs in a single forward pass. To generate a full response, the model runs this loop hundreds or thousands of times, each time appending the predicted token to the input and running the forward pass again. This is called autoregressive generation, and it is why LLMs produce text one token at a time.
Raw Text
"The cat sat on the"
Tokenizer
Splits text into subword token IDs
Embedding + Position
Token IDs become dense vectors with positional information
Transformer Blocks
The core computation — repeated N times (e.g. ×126)
Self-Attention
Every token looks at every other token to decide which ones matter most
Add & Norm
Residual connection + layer normalization for stability
Feed-Forward Network
Dense layers that process each token independently — stores factual knowledge
Add & Norm
Second residual connection + normalization before the next block
Deep dive in a later article in this series
Output Probabilities
A probability for every token in the vocabulary
05
The LLM Landscape Today
Which model should you use? The honest answer: it depends on your constraints. The LLM landscape in early 2026 is crowded, fast-moving, and segmented across multiple axes.
Open-weight vs. closed. Closed models — GPT-5, Claude, Gemini — are accessed via API. You cannot see the weights, cannot fine-tune them (except through the provider’s limited fine-tuning endpoints), and cannot run them on your own infrastructure. Open-weight models — Llama, Mistral, DeepSeek, Qwen, Gemma — give you the weights. You can run them locally, fine-tune them for your domain, and deploy them wherever you want. The capability gap between open and closed has narrowed dramatically. On MMLU, the gap shrank from 17.5 percentage points to just 0.3 over the course of a year (WhatLLM, 2025). Open-weight models now trail proprietary state-of-the-art by roughly three months on average. Even OpenAI — long the standard-bearer for closed models — released open-weight models in 2025 with gpt-oss under the Apache 2.0 license (OpenAI, 2025).
Dense vs. mixture-of-experts. A dense model activates every parameter for every token. A mixture-of-experts (MoE) model routes each token to a subset of “expert” sub-networks, activating only a fraction of total parameters. Llama 3.1 405B is dense — all 405 billion parameters fire for every token. DeepSeek V3 is MoE — 671 billion total, but only 37 billion active per token. The trade-off: MoE models are cheaper to run but harder to train and more complex to serve. As of late 2025, MoE has become the dominant architecture for frontier models. GPT-5, Gemini 3 Pro, Llama 4, DeepSeek V3, Mistral Large 3, Qwen3-Max, and Kimi K2 all use MoE — the only major frontier model family that has not confirmed MoE is Claude, whose architecture remains undisclosed.
General-purpose vs. specialized. Frontier models aim to be good at everything. But the market is increasingly segmented: coding models (Qwen3-Coder, DeepSeek-Coder, Codestral), reasoning models (o3, o4-mini, DeepSeek-R1, Gemini 3 Deep Think), small edge-deployable models (Llama 4 Scout at 17B active, Gemma 3 at 27B, Ministral 3 at 3–14B), and multimodal models that process text, images, and audio natively (GPT-5, Gemini 3, Llama 4, Kimi K2.5).
Cost is collapsing. Inference prices are dropping at roughly 10x per year — a trend researchers call “LLMflation.” GPT-4-equivalent performance that cost $20 per million tokens in late 2022 now costs $0.40. DeepSeek’s API pricing runs 94% cheaper than Claude Opus per token (a16z, 2025; Epoch AI, 2025). This changes the economics of every LLM application.
GPT-5 / 5.2
OpenAI
Undisclosed (MoE)
Claude Opus 4.6
Anthropic
Undisclosed
Gemini 3 Pro
Undisclosed (MoE)
Llama 4
Meta
400B / 17B active
DeepSeek V3.2
DeepSeek
671B / 37B active
Mistral Large 3
Mistral AI
675B / 41B active
Qwen3
Alibaba
0.6B to 1T+
Kimi K2
Moonshot AI
1T / 32B active
The table is already out of date. It was out of date the moment I wrote it. That is the pace of this field.
The real question is not “which model is best?” It is: what are your constraints? If you need the absolute highest quality and can afford API costs, a frontier closed model (GPT-5, Claude, Gemini) is the safe bet. If you need data sovereignty, fine-tuning, or cost control at scale, open-weight models (Llama 4, DeepSeek, Mistral, Kimi K2) give you leverage. If you are deploying to edge devices or need sub-50ms latency, small models (Llama 4 Scout, Gemma 3, Ministral 3) are the right tool. And if your task is narrow enough, a fine-tuned small model will often outperform a general-purpose giant — at 1/100th the cost.
You now know what an LLM does at the highest level — predict the next token. You know that scale creates emergent capabilities, that the transformer is the engine under the hood, and that the landscape is a spectrum from tiny edge models to trillion-parameter giants.
But before the model can predict anything, it needs to turn raw text into numbers. That process — tokenization — shapes everything downstream: cost, context window usage, multilingual performance, and even what the model can “see.” It is the subject of the next article in this series.