Article

The Transformer Architecture

Attention, feed-forward layers, and the block that changed everything

Before Transformers: The Bottleneck

In What Is a Large Language Model?, we described the transformer as the engine inside every modern LLM. In Tokenization and the Input Pipeline, we built the input: a sequence of vectors — one per token — carrying meaning and position. Now we open the engine and look inside.

But first, some context on what came before — because the transformer’s design only makes sense when you understand the problem it solved.

Recurrent Neural Networks (RNNs) were the dominant architecture for language tasks from roughly 2013 to 2017. They processed tokens one at a time, left to right, passing a hidden state from each step to the next. That hidden state was the model’s “memory” — a compressed summary of everything it had seen so far.

The problem was compression. The hidden state was a fixed-size vector, regardless of whether the model had seen 5 tokens or 500. Long-range dependencies — the kind where a pronoun in sentence 10 refers back to a noun in sentence 1 — were systematically lost. Information degraded with distance, like a game of telephone played across hundreds of steps.

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber (1997), partially addressed this with gating mechanisms that controlled what to remember and what to forget. LSTMs became the workhorse of NLP for nearly a decade. But they had three fundamental limitations: they could not revise stored information once written, their memory capacity was bounded by a fixed cell state, and — most critically — they could not process tokens in parallel. Each step depended on the previous step’s output. Training was inherently sequential.

This sequential bottleneck was the killer. GPUs are massively parallel processors. An LSTM, no matter how cleverly designed, left most of the GPU idle — waiting for one token to finish before starting the next. Training on the billions of tokens needed for modern language understanding was impractically slow.

Processing “The cat sat on the mat because it”

RNN (Sequential)

The

cat

sat

the

mat

because

Step 0 of 8

Transformer (Parallel)

The

cat

sat

the

mat

because

8×8 = 64 attention connections (every token to every token)

Step 0 of 1

RNN: 8 sequential steps · Transformer: 1 parallel step

Then, in June 2017, eight researchers at Google published a paper titled “Attention Is All You Need” (Vaswani et al., 2017). The title was not hyperbole. They proposed an architecture that dispensed with recurrence entirely — no hidden state passed from step to step, no sequential processing. Instead, every token could attend to every other token simultaneously. All positions processed in parallel.

The paper has been cited over 150,000 times (as of 2025) — placing it among the top ten most-cited papers of the 21st century (Semantic Scholar). Every modern LLM — GPT, Claude, Llama, Gemini, Mistral, DeepSeek, Qwen — descends from it.

What made the difference? Parallelism. An RNN processes a 1,000-token sequence in 1,000 sequential steps. A transformer processes the same sequence in one step — a set of matrix multiplications that modern GPUs devour. This single architectural change unlocked the ability to train on trillions of tokens in reasonable time, which in turn unlocked the emergent capabilities we explored in the first article. Scale was always the goal. The transformer was the vehicle that made scale possible.

The full picture — each section below zooms into one part

Raw text

"The cat sat on the mat because it was tired."

Art. 2

Tokenizer

Split text into subword tokens, map to integer IDs

Art. 2

Embedding + Position

Look up each token ID in a learned table, add positional encoding (RoPE) — produces one vector per token

10 token vectors, each 12,288 dimensions

02–04

Transformer Block

×96 layers

All token vectors pass through the same block together. Each block refines the vectors — early blocks learn syntax, middle blocks learn meaning, late blocks learn reasoning. Stacked 96–126 layers deep.

Multi-Head Self-Attention (tokens exchange information)

Add + RMSNorm (preserve original + normalize)

Feed-Forward Network (apply stored knowledge)

Add + RMSNorm (preserve original + normalize)

same 10 vectors, now deeply refined — take the last one

Output Head

Take the last token's vector after all 96 blocks → project to vocabulary size → softmax → one probability per vocab entry

Sample next token

Pick from the distribution (temperature controls randomness), append to sequence, run the whole stack again for the next token

↑ append token to sequence, repeat

Self-Attention: The Core Mechanism

What does it mean for every token to “attend to” every other token? This is the question at the heart of the transformer — and the answer is the self-attention mechanism.

Start with an intuition. Consider the sentence: “The cat sat on the mat because it was tired.” What does “it” refer to? You know instantly — the cat. But how would a model figure that out? It needs a mechanism where the representation of “it” can be influenced by the representation of “cat,” even though they are separated by several words. The model needs to ask: “For this token, which other tokens in the sequence carry relevant information?”

Self-attention answers that question with three learned projections: Queries, Keys, and Values.

Think of it like a library. You walk into a library with a question (the Query). Every book on the shelf has a label on its spine (the Key) that describes what it contains. You compare your question to every label, and the labels that match well get your attention. Then you read the actual contents (the Values) of the books that matched — weighting the useful books more heavily than the less relevant ones.

In a transformer, every token in the sequence plays all three roles simultaneously. Each token’s embedding vector is linearly projected into three separate vectors:

Query (Q): “What information am I looking for?”
Key (K): “What information do I contain?”
Value (V): “Here is my actual content.”

These projections are learned weight matrices (W_Q, W_K, W_V) — the model discovers during training how to create useful queries, keys, and values from raw token embeddings.

The attention score. For each pair of tokens (i, j), the model computes a score: how relevant is token j’s key to token i’s query? This is just a dot product: Q_i · K_j. A high dot product means the query and key point in similar directions — “this token has what I am looking for.” A low dot product means irrelevance.

The scaling trick. Raw dot products can grow large, especially when the vectors are high-dimensional (128 dimensions per head in most modern models). Large values push the softmax function into near-saturation, producing extremely peaked distributions with vanishing gradients. The fix is simple: divide by the square root of the key dimension. This is why it is called scaled dot-product attention.

The softmax. After scaling, a softmax function converts the raw scores for each query into a probability distribution — a set of weights that sum to 1. This is the “attention pattern”: for token i, how much weight should each other token receive?

The weighted sum. Finally, the output for each token is a weighted sum of all value vectors, using the softmax weights. Tokens with high attention scores contribute more to the output; tokens with low scores contribute almost nothing.

A worked example. Take the sentence “The cat sat on the mat because it was tired.” When computing the attention for the token “it”:

The query for “it” encodes something like: “I am a pronoun — what noun do I refer to?”
The key for “cat” encodes: “I am an animate noun, the subject of this clause.”
The key for “mat” encodes: “I am an inanimate noun, an object of a preposition.”
The dot product Q(“it”) · K(“cat”) is high — subject-pronoun alignment.
The dot product Q(“it”) · K(“mat”) is lower — less syntactic agreement.
After softmax, “cat” gets a high weight (say, 0.62), “mat” gets a low weight (say, 0.05), and the remaining weights are spread across other tokens.
The output for “it” is dominated by the value of “cat” — pulling the meaning of “cat” into the representation of “it.”

This is how the model resolves coreference, without any hand-coded rules about pronouns. The attention mechanism discovers, during training, that this is a useful pattern.

The attention matrix — a square grid with one row per query token and one column per key token — is the window into what the model is “paying attention to.” If you have worked with cosine similarity, the intuition is similar: the dot product between the query and key captures directional alignment in a learned embedding space.

Scaled dot-product attention — the canonical implementation

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    return weights @ V

→ Line 5d_k is the per-head dimension — typically 128 in modern LLMs, not the full 12,288 embedding width.
→ Line 6Q · Kᵀ produces a [seq_len, seq_len] score matrix. The √d_k divisor stops large dot products from saturating the softmax.
→ Line 7softmax over dim=-1 normalises each row independently — one probability distribution per query token.
→ Line 8The weighted sum collapses every row of weights against V, returning one output vector per query.

Attention matrix

The

cat

sat

the

mat

because

was

tired

The

0.45

0.10

0.08

0.05

0.03

cat

0.08

0.40

0.12

0.08

0.06

0.04

sat

0.06

0.25

0.30

0.10

0.06

0.08

0.05

0.04

0.03

0.05

0.06

0.15

0.30

0.10

0.18

0.06

0.04

0.03

the

0.12

0.06

0.05

0.10

0.30

0.20

0.07

0.04

0.03

mat

0.04

0.05

0.08

0.18

0.15

0.35

0.05

0.04

0.03

because

0.04

0.08

0.12

0.05

0.04

0.06

0.30

0.18

0.08

0.05

0.04

0.62

0.04

0.02

0.03

0.05

0.06

0.04

was

0.03

0.08

0.05

0.03

0.08

0.42

0.18

0.07

tired

0.03

0.15

0.04

0.02

0.03

0.05

0.12

0.22

0.32

Row “it” → column “cat” = 0.62 — the model resolves coreference

Attention(Q, K, V) = softmax(QK^T / √d_k) V

That attention matrix is the soul of the transformer. Everything else — the feed-forward layers, the normalization, the residual connections — exists to support it. But one head of attention is not enough. A single set of Q, K, V projections learns one type of relationship. The model needs many.

Multi-Head Attention

Why would a single attention head be insufficient? Because language has multiple simultaneous types of relationships. In “The cat sat on the mat because it was tired,” different things matter depending on what you are trying to predict:

Syntactic proximity: “sat” relates to its subject “cat” and its prepositional phrase “on the mat.”
Coreference: “it” refers back to “cat.”
Semantic role: “tired” is a predicate about “it” (and therefore about “cat”).
Positional pattern: Adjacent words like “the” and “cat” have a strong local dependency.

A single attention head cannot capture all of these patterns at once. It learns a single set of Q, K, V projections — optimized for whatever combination of patterns reduces the training loss most. Some relationships inevitably get underweighted.

Multi-head attention solves this by running multiple attention heads in parallel. Each head gets its own learned projections (W_Qⁱ, W_Kⁱ, W_Vⁱ) and operates on a smaller subspace of the full embedding dimension. The outputs from all heads are concatenated and projected back to the original dimension.

The math works out cleanly. GPT-3 has a hidden dimension of 12,288 and 96 attention heads. Each head produces a 128-dimensional output (12,288 / 96 = 128). But this is not slicing — each head multiplies the full 12,288-dimensional vector by its own learned weight matrix (shaped 12,288 × 128) to compress it into 128 dimensions. Every head sees all the information; each just learns a different “lens.” After all heads run in parallel, their 128d outputs are concatenated (96 × 128 = 12,288) and a linear projection (W_O) remixes them back into 12,288 dimensions.

What do the heads actually learn? Interpretability research has shown that different heads specialize. Some track syntactic structure (subject-verb agreement). Some handle positional proximity (attending to nearby tokens). Some capture long-range semantic relationships (coreference). Some focus on rare or unusual tokens. The model does not decide these specializations in advance — they emerge during training.

Model	Parameters	Attention Heads	KV Heads	Head Dim	Layers
GPT-3	175B	96	96	128	96
Llama 3.1 8B	8B	32	8	128	32
Llama 3.1 70B	70B	64	8	128	80
Llama 3.1 405B	405B	128	8	128	126

Sources: Brown et al., 2020; Meta AI, 2024

Notice two things. First, the head dimension is consistently 128 across all major model families — the variation is in how many heads you use and how deep the stack is. Second, the “KV Heads” column is different from “Attention Heads” for the Llama models. That is Grouped Query Attention — a modern optimization we will cover in section 06.

How multi-head attention works — the full projection math

Each head multiplies the full 12,288d vector by its own learned weight matrix. Every head sees all the information and compresses it differently.

Full input: 12,288 dimensions — every head receives this

Head 1

W_Q¹

12,288×128

128d

Head 2

W_Q²

12,288×128

128d

… 93 more heads …

Head 96

W_Q⁹⁶

12,288×128

128d

Step 1 — each head projects:

q_head = x × W_Q^head

[1 × 12,288] × [12,288 × 128] = [1 × 128]

Same for K and V. Each head has its own W_Q, W_K, W_V — that’s 3 × 96 = 288 different matrices, each learning a different “lens” on the same input.

Step 2 — each head computes attention independently:

output_head = softmax(q · k^T / √128) × v

Result: [1 × 128] per head — 96 heads run in parallel

Step 3 — concatenate all heads + final projection:

…

H96

→

12,288d (96 × 128)

W_O

12,288×12,288

12,288d output

output = concat(head₁, head₂, …, head₉₆) × W_O

[1 × 12,288] × [12,288 × 12,288] = [1 × 12,288]

The concatenation restores the original dimension. W_O then learns how to remix the 96 perspectives into a single enriched representation — the output has the same shape as the input, but now informed by all heads’ attention patterns.

Multi-head attention: 4 heads (of 96) shown

Input embeddings (12,288d)

Head 1

128d

Positional proximity

Attends to neighbouring tokens

Head 2

128d

Syntactic dependency

Tracks subject-verb agreement

Head 3

128d

Coreference

Links pronouns to their referents

Head 4

128d

Semantic similarity

Groups tokens by meaning

Concat + W_O → 12,288d

The fan-out pattern is why “multi-head” attention scales so well. Each head independently computes attention using 128-dimensional Q, K, V vectors (rather than 12,288-dimensional ones), so the per-head parameter cost is 96x smaller. The attention matrix itself is always sequence_length × sequence_length — but computing it with 128-dimensional keys is far cheaper than with 12,288-dimensional ones. The computation is trivially parallelizable across heads and across the batch dimension. GPUs eat this for breakfast.

The Full Transformer Block

Self-attention and multi-head attention are the headline innovations. But a complete transformer block has four components, and the “plumbing” between them matters more than you might expect.

Here is what one transformer block looks like, in the order the data flows through:

Multi-Head Self-Attention — the mechanism from section 03. Each token’s representation is updated by attending to all other tokens.
Residual Connection + Normalization — the output of attention is added to the input (skip connection), then normalized.
Feed-Forward Network (FFN) — two linear layers with an activation function in between. Processes each token’s representation independently.
Residual Connection + Normalization — same pattern: add the FFN output to its input, then normalize.

This block is repeated dozens or hundreds of times. GPT-3 stacks 96 blocks. Llama 3.1 405B stacks 126. Each block refines the token representations, building richer and more abstract features layer by layer.

GPT-3: 96 transformer blocks (4 shown)

Token Embeddings + Position

Block 1Early layers

Capture surface-level features: parts of speech, local word relationships, basic syntax.

Block 2Early layers

Refine positional and lexical patterns. Build basic phrase structure.

··· 45 more blocks ···

Block 48Middle layers

Build compositional meaning: clause boundaries, coreference chains, semantic roles.

··· 47 more blocks ···

Block 96Final layers

Task-specific computation: next-token prediction, instruction following, reasoning.

Output Probabilities

Residual connections: the gradient highway. The “add” in “add and normalize” is a residual connection — one of the most important ideas in deep learning, introduced by He et al. (2016) for image recognition. Without residual connections, training a 96-layer network would be practically impossible. Gradients — the signals that tell each layer how to update its weights — shrink exponentially as they flow backward through many layers. This is the vanishing gradient problem, and it killed deep networks before residual connections solved it.

The fix is elegant. Instead of the block computing a function H(x), it computes a residual F(x) = H(x) − x, so the output is x + F(x). The skip connection creates a direct path for gradients to flow back to earlier layers, bypassing the transformations. Every layer only needs to learn a small refinement on top of the input — not a complete transformation from scratch. He et al. demonstrated networks up to 152 layers using this technique; without it, networks deeper than roughly 20 layers were untrainable.

Normalization: keeping the numbers stable. The normalization step prevents the internal values from growing unboundedly as data flows through 96+ layers. The original transformer used LayerNorm (Ba et al., 2016), which recenters and rescales each vector to have zero mean and unit variance. Modern models have switched to RMSNorm (Zhang & Sennrich, 2019), which only rescales by the root-mean-square value — no recentering. Fewer operations, less memory, comparable performance. Llama, Mistral, DeepSeek, Qwen, and virtually every modern open-weight model use RMSNorm with pre-normalization (normalize before each sublayer, not after).

The feed-forward network: where the knowledge lives. After attention lets tokens communicate with each other, the FFN processes each token’s representation independently. In GPT-3, the FFN’s intermediate dimension is 49,152 — four times the hidden dimension of 12,288. This expansion creates a high-dimensional space where the model can perform non-linear transformations on each token.

But the FFN is more than a generic function approximator. Geva et al. (2021) showed that feed-forward layers in transformers operate as key-value memories. The first weight matrix (W₁) acts as keys that correlate with textual patterns in the training data. The second weight matrix (W₂) acts as values that induce output distributions for those patterns. Lower layers capture shallow patterns (“the word after ‘the’ is usually a noun”). Upper layers capture semantic patterns (“this passage is about medicine”). A 2025 study confirmed that FFNs in the middle 70% of transformer layers contribute more to model performance than other architectural components. This is where the model stores what it “knows” — and it is why the parameter count of the FFN layers dominates the total model size. In GPT-3, the FFN blocks account for roughly 116 billion of the model’s 175 billion parameters (GPT-3 Architecture).

Layer-by-layer refinement. Early blocks tend to capture surface-level features: parts of speech, local word relationships, basic syntax. Middle blocks build compositional structure: clause boundaries, coreference chains, semantic roles. Late blocks perform task-specific computation: generating the next token, following instructions, reasoning about relationships. This progression is not hand-designed — it emerges from training. But it is consistent enough that researchers can predict, given a layer number, what kind of information it encodes (Tenney et al., 2019).

Inside one transformer block (GPT-3 dimensions) — click any step

Self-Attention Phase

Feed-Forward Phase

12,288d49,152d (4× expansion)Residual (skip connection)

Decoder-Only: The GPT Architecture

The original 2017 transformer was an encoder-decoder architecture, designed for translation. The encoder processed the input sentence with bidirectional attention (every token sees every other token). The decoder generated the output translation with causal attention (each token sees only preceding tokens), plus cross-attention to the encoder’s output.

Modern LLMs do not use encoder-decoder. They use decoder-only — just the decoder stack, with no encoder at all. GPT, Claude, Llama, Gemini, Mistral, DeepSeek, Qwen — all decoder-only. Why?

Causal masking: no peeking ahead. In a decoder-only model, each token can only attend to itself and the tokens that came before it. Token at position 5 sees tokens [1, 2, 3, 4, 5] but never [6, 7, 8, …]. This is enforced by a causal mask: a triangular matrix applied to the attention scores before softmax. The upper-right triangle (representing future tokens) is set to negative infinity, which softmax converts to zero weight. The result: the model generates text left to right, one token at a time, conditioned only on the past.

This is not a limitation — it is the design. Causal masking is what makes the model autoregressive: it predicts the next token given all previous tokens, exactly the task it was trained on. At inference time, the model generates a token, appends it to the sequence, and runs the forward pass again. Each new token “unlocks” one more column of the attention matrix.

Why decoder-only won. The shift from encoder-decoder to decoder-only happened between 2018 and 2020, driven primarily by OpenAI’s GPT series. Three factors made decoder-only dominant:

Simplicity. One attention mechanism (causal self-attention) handles all dependencies. No cross-attention, no separate encoder. Fewer architectural decisions, fewer hyperparameters, simpler training infrastructure.
Task generality. Next-token prediction on a causal sequence is the most general pre-training objective in NLP. It works for text completion, question answering, code generation, translation — any task that can be framed as “generate a continuation.”
Scaling track record. The GPT series (GPT-1 through GPT-4) proved that decoder-only scales predictably. The industry invested billions in this paradigm.

Recent research has revisited this assumption. A 2025 study found that encoder-decoder models can achieve 30–40% lower first-token latency and higher throughput on certain hardware configurations, particularly edge devices. But no frontier model has switched — the momentum behind decoder-only is enormous, and the scaling laws are well-understood.

Attention mask

The

cat

sat

the

mat

because

The

cat

sat

the

mat

because

Thecatsatonthematbecauseit

Causal mask: each token sees only itself and preceding tokens. The upper triangle is blocked.

The output head. After all transformer blocks, the final token representation needs to become a prediction. This is the job of the output head — a linear projection from the hidden dimension back to vocabulary size. For Llama 3.1 405B, this means a projection from 16,384 dimensions to 128,256 (the vocabulary size). A softmax converts the raw scores (called logits) into a probability distribution over every token in the vocabulary. The token with the highest probability is the model’s “best guess” — though as What Is a Large Language Model?, the model can also sample from this distribution to produce more varied output (controlled by temperature).

Some models reuse the same weight matrix for both the input embedding table and the output projection — a technique called weight tying. This makes intuitive sense: the same matrix that maps token IDs to vectors also maps vectors back to token probabilities. It saves over 2 billion parameters — the embedding table for a 128K vocabulary at 16,384 dimensions is that large on its own. However, most frontier models today — including Llama 3, GPT-3, and DeepSeek — do not tie these weights. They keep separate embedding and output head parameters, which allows each to specialize. Weight tying is more common in smaller models where parameter efficiency matters more.

Output head: hidden state → next token

Final hidden state

16,384d

Linear projection (W_out)

16,384 → 128,256

Logits (raw scores)

128,256

Softmax → probabilities

128,256 (sum = 1)

Sample next token

tired

23%

cold

11%

comfortable

sleeping

raining

+ 128,251 other tokens with smaller probabilities

The full architecture, then, looks like this:

Tokenize raw text into token IDs (covered in Tokenization and the Input Pipeline).
Embed each token ID into a dense vector via the embedding table.
Apply positional encoding (RoPE rotations on the query and key vectors).
Pass through N transformer blocks, each applying multi-head causal self-attention, residual connections, normalization, and feed-forward processing.
Project the final representation to vocabulary size and softmax into probabilities.
Sample the next token, append it, and repeat.

That is the complete decoder-only transformer. Every modern LLM — from an 8-billion-parameter Llama 3.1 8B to the frontier models with undisclosed sizes — follows this blueprint.

Modern Optimizations

The architecture described in sections 02 through 05 is the conceptual transformer — the design from 2017, refined through 2020. Every production model today uses it. But the details have evolved significantly. A modern transformer block in Llama 3.1 or DeepSeek V3 differs from GPT-3 in five key ways. None of these change the fundamental mechanism — they are engineering refinements that improve speed, memory efficiency, or both.

Grouped Query Attention (GQA). In standard multi-head attention, every head has its own query, key, and value projections. That means the KV cache — the stored key and value vectors from previous tokens that avoid redundant computation during generation — grows linearly with the number of heads. GQA, introduced by Ainslie et al. (2023), shares key and value projections across groups of heads. Instead of 128 independent KV sets, Llama 3.1 405B uses just 8 — a 16:1 compression ratio. GQA is now standard in Llama 3/4, Gemma 3, Qwen3, and Mistral.

Rotary Position Embeddings (RoPE). We covered this in Tokenization and the Input Pipeline, but it fits into the transformer block at the attention layer. RoPE rotates the query and key vectors by position-dependent angles before computing dot products. The attention score then naturally encodes relative position — the distance between tokens, not their absolute locations (Su et al., 2021).

RMSNorm replacing LayerNorm. RMSNorm drops the recentering step, normalizing only by the root-mean-square value. Fewer operations, smaller memory footprint, comparable performance. Every modern model family has switched (Zhang & Sennrich, 2019).

SwiGLU activation in the FFN. GPT-3’s feed-forward network used GeLU activation. Modern models use SwiGLU — a gated activation function introduced by Shazeer (2020). SwiGLU adds a gating mechanism: instead of W₂ × GeLU(W₁ × x), it computes W₂ × (Swish(W_gate × x) &odot; (W_up × x)), where &odot; denotes element-wise multiplication. Empirical results show consistent improvements, though the original paper candidly notes: “We offer no explanation as to why these architectures seem to work.”

Flash Attention. Standard attention computes the full N × N attention matrix in GPU high-bandwidth memory (HBM) — an O(N²) memory cost. FlashAttention (Dao et al., 2022) avoids materializing this matrix entirely, using a tiling strategy that keeps computation in fast SRAM. The result: 10–20x memory reduction and 2–4x speedup. FlashAttention-2 pushed to 73% of theoretical max throughput on A100 GPUs. FlashAttention-3 targets H100s, reaching 1.3 PFLOPs/s — 85% of peak hardware utilization (PyTorch Blog). Flash Attention is what makes 128K and million-token context windows practical.

Together, these five optimizations turn the 2017 transformer into a 2026 production system. The concepts are unchanged — attention, residual connections, feed-forward layers, causal masking. But the implementation is faster, leaner, and capable of handling context lengths the original authors never imagined.

2017 transformer → 2026 production

Attention

Multi-Head Attention

128 independent KV sets

Attention

Grouped Query Attention

8 shared KV sets

16x smaller KV cache

Position

Absolute Embeddings

Learned positions. Fixed max length.

Position

RoPE

Relative positions via rotation.

Extrapolates to longer contexts

Normalization

LayerNorm

Recenter + rescale. Scale + shift params.

Normalization

RMSNorm

Rescale only. Fewer ops.

Same performance, less memory

Activation

GeLU / ReLU

Simple activation. 2 weight matrices.

Activation

SwiGLU

Gated activation. 3 matrices.

Consistent quality gains

Attention Compute

Standard Attention

Full N×N matrix in HBM. O(N²) memory.

Attention Compute

Flash Attention

Tiled computation in SRAM.

10–20x memory reduction

Click any card to flip, or use the button above.

You have seen the architecture — attention to let tokens communicate, feed-forward layers to store and transform knowledge, residual connections to keep gradients flowing, causal masking to enforce left-to-right generation, and a stack of blocks deep enough to capture everything from syntax to semantics.

But an untrained transformer is just random weights producing random token predictions. The architecture is the engine; training is the fuel. How does a model go from random noise to coherent text generation? How much data does it need, and how much does it cost? That is the subject of Pre-Training — How LLMs Learn.