$ miric.dev

Article

Pre-Training — How LLMs Learn

Data, objectives, and the trillion-token training runs that create general intelligence

01

The Training Objective: Next Token Prediction

In The Transformer Architecture, we built the engine — self-attention, feed-forward layers, residual connections, the full decoder stack. But an untrained transformer is nothing more than random weights producing random predictions. Every output is noise. The architecture is a vehicle; pre-training is what teaches it to drive.

And the driving lesson is almost absurdly simple: predict the next token.

That is the entire training objective. Given a sequence of tokens — say, “The capital of France is” — the model outputs a probability distribution over its entire vocabulary. The correct answer, “Paris,” should get the highest probability. If it does not, the model is wrong, and its weights are adjusted to be less wrong next time.

This objective is called causal language modeling. “Causal” because the model can only look backward — token i is conditioned on tokens 1 through i−1, never on future tokens. This mirrors how we explored causal masking in The Transformer Architecture: the attention mask prevents peeking ahead, and training exploits this by computing the loss at every position in the sequence simultaneously. A single training sequence of 4,096 tokens produces 4,095 prediction tasks in one forward pass.

The loss function is cross-entropy — a measure from information theory that quantifies how far the model’s predicted distribution is from the true answer. For a single position, it reduces to:

L = −log(py)

where py is the probability the model assigns to the correct next token

If the model is confident and correct (py = 0.95), the loss is low (~0.05). If it spreads probability across thousands of wrong tokens (py = 0.001), the loss is high (6.9). The training process — backpropagation — computes how each weight in the network contributed to that loss and nudges every weight in the direction that would reduce it. Multiply this by trillions of tokens, and the model gradually develops the ability to predict plausible continuations for nearly any text input.

What makes this remarkable is what falls out of the objective. Nobody teaches the model grammar, facts, reasoning, or code. All of those capabilities emerge from the relentless pressure to predict the next token across a vast, diverse corpus. As we discussed in What Is a Large Language Model?, emergent capabilities arise from scale. Pre-training is the process that turns scale into capability.

Gradient descent on the loss landscape

High lossLow loss

Loss

Tokens processed

0

Sample prediction

L = −log(py) · Lower loss = better predictions · The model processes 15T tokens to reach convergence

02

The Data Pipeline

Where do trillions of tokens come from? The short answer: the internet. The longer answer involves one of the most elaborate data engineering pipelines in computing.

It starts with Common Crawl — a nonprofit that has been archiving the public web since 2008. Their archive exceeds 9.5 petabytes of raw HTML, and each monthly crawl captures roughly 2.5 to 3 billion web pages (Common Crawl, 2025). Every major open LLM draws from this reservoir. But raw web data is overwhelmingly garbage — duplicated pages, spam, boilerplate navigation, SEO filler, toxic content, personal information. The art of pre-training data is not in collecting it. The art is in filtering it.

The Llama 3 data pipeline is the most detailed public account of this process. Their final training set contains 15 trillion tokens — seven times larger than Llama 2’s dataset — with a carefully tuned mix: roughly 50% general knowledge, 25% mathematical and reasoning content, 17% code, and 8% multilingual text covering over 30 languages.

Getting from raw crawl to that clean 15 trillion tokens requires multiple filtering stages:

  1. Heuristic filters strip HTML boilerplate, remove documents that are too short or too repetitive, and discard pages with abnormal character distributions.
  2. Deduplication — both exact and semantic — eliminates near-duplicate documents. Meta used RoBERTa-based clustering for semantic dedup, scoring documents by quality and difficulty, then greedily selecting only those below a cosine similarity threshold.
  3. Quality classification separates high-value text from low-value text. Here is where it gets recursive: Meta used Llama 2 to generate labels for training the text-quality classifiers that filter data for Llama 3. The previous generation of the model helps curate data for the next.
  4. Domain-specific pipelines handle code and math separately. DistilRoBERTa classifiers trained on Llama-2-annotated data identify code-relevant and math-relevant web pages.
  5. Safety filtering removes toxic content, NSFW material, and personally identifiable information.

Each stage discards a large fraction of the data. You begin with petabytes and end with terabytes. The funnel is steep.

Llama 3 data pipeline — from raw crawl to 15T tokens

Raw Web Crawl100%

Common Crawl — 9.5+ PB archive

HTML Extraction82%

Strip boilerplate, extract text

Heuristic Filtering55%

Length, repetition, character distribution

Deduplication30%

Exact + semantic (RoBERTa clustering)

Quality Classification16%

Llama 2–trained classifiers

Safety Filtering14%

Toxic, NSFW, PII removal

Final Dataset12%

15T tokens

Final data mix

50%
25%
17%
8%
General Knowledge50%
Math / Reasoning25%
Code17%
Multilingual8%

You begin with petabytes and end with terabytes. The funnel is steep.

This pipeline is not unique to Meta. The open-source community has produced its own large-scale training datasets. FineWeb, released by HuggingFace in 2024, processed 96 Common Crawl dumps into 15 trillion clean tokens. Its educational subset, FineWeb-Edu (1.3 trillion tokens), demonstrated that models trained on filtered educational text dramatically outperform those trained on broader data on knowledge-intensive benchmarks. RedPajama-V2 from Together AI offers 30 trillion tokens of processed web data. Dolma from AI2 provides 3 trillion tokens from web, academic, code, and book sources.

The lesson across all of these efforts is the same: data quality dominates data quantity. Microsoft’s Phi model family made this point forcefully. Phi-1 outperformed models 10 times its size on coding benchmarks despite being trained on 100 times less data (Microsoft Research, 2023). Phi-4, at just 14 billion parameters, matches or exceeds much larger models by investing heavily in data curation and synthetic data — not scale (Microsoft Research, 2024).

The data mix also shapes what the model can do. Llama 3’s decision to allocate 17% of its tokens to code was deliberate: code training improves reasoning performance even on non-code tasks, because code demands logical precision and structured thinking. These are not arbitrary percentages. They are the result of extensive ablation studies where Meta trained smaller models on different mixes and measured downstream performance before committing the full 405B-parameter model to a final mix.

03

The Training Loop

With 15 trillion tokens prepared and a 405-billion-parameter model initialised with random weights, the training loop begins. It will run for months, consuming millions of GPU hours, and every detail of the loop’s design affects whether the model converges, diverges, or stalls.

Batching is the first consideration. The model processes tokens in batches — not one sequence at a time, but thousands in parallel. Llama 3 405B started training with a batch size of 4 million tokens (1,024 sequences of 4,096 tokens each) and gradually increased to 8 million, then 16 million tokens per batch (Meta, 2024). Larger batches produce more stable gradient estimates but require more memory. The staged increase is a common strategy: start small for training stability when gradients are noisy, then scale up once the loss landscape smooths out.

Gradient accumulation extends this idea. When even a 4-million-token batch will not fit in GPU memory in a single forward pass, the batch is split into micro-batches. Each micro-batch computes its gradients independently, and those gradients are accumulated (summed) before a single weight update. This simulates a large batch on hardware that cannot hold it all at once.

The learning rate schedule is where the craft gets subtle. Llama 3 405B used a cosine decay schedule: a linear warmup over 8,000 steps to a peak learning rate of 8 × 10−5, followed by a slow cosine-shaped decline to 8 × 10−7 over 1.2 million training steps. The warmup is critical — starting with a high learning rate would produce catastrophically large gradient updates early in training when the weights are still random and the loss surface is chaotic.

Mixed precision training keeps the whole operation from running out of memory. Modern LLM training uses BF16 (bfloat16) — a 16-bit floating-point format with the same exponent range as FP32 but only 7 bits of mantissa precision (NVIDIA, 2024). Forward and backward passes run in BF16, halving memory requirements. But a master copy of the weights is maintained in FP32 — full 32-bit precision — because small weight updates from gradient descent can be lost in BF16’s limited precision. The optimizer state (momentum, variance in Adam) also stays in FP32.

Checkpointing is the safety net. When you are running 16,384 GPUs for months, hardware failures are not a risk — they are a certainty. Llama 3’s training saved a full checkpoint approximately every 4 minutes, with each checkpoint write taking roughly 2.5 seconds. A checkpoint captures the complete training state: model weights, optimizer state, learning rate position, and random number generator seeds. When a failure occurs — and it will — the training job restarts from the last good checkpoint.

The training loop — repeated 1.2 million times

01

Sample Batch

4M–16M tokens

02

Forward Pass

BF16 precision

03

Compute Loss

Cross-entropy

04

Backward Pass

Gradient computation

05

Update Weights

Adam, FP32

06

Adjust LR

Cosine decay

Repeat 1.2M times over ~3 months

Training dashboard

Loss curve (cross-entropy)

121
spike
spike

Learning rate (cosine decay)

8e-58e-7

Tokens processed

0

GPU utilisation

~40% of H100 peak (990 TFLOPS BF16)

Loss spikes are normal — the training system detects them, rolls back to a checkpoint, and skips the problematic batch

That six-step cycle is the entire algorithm. But what actually happens inside each step? Let’s trace one complete iteration on a tiny toy model — 8-word vocabulary, 4-dimensional embeddings — with concrete numbers at every stage.

One training step — worked example (vocab=8, embed_dim=4)

ForwardStep 1/7

Tokenize Input

Convert words to token IDs the model can process

The input sequence “The cat sat on” is split into tokens and mapped to integer IDs. The target — the next token the model must predict — is “the” (ID 0).

The
cat
sat
on
0
1
2
3
Target
the” → ID 0
1 / 7

04

The Compute Reality

The training loop looks clean on paper. In practice, it is one of the most complex engineering operations on the planet.

Training Llama 3 405B required 30.84 million H100 GPU hours (Meta, 2024). To put that number in perspective: if you ran a single H100 GPU, it would take 3,520 years. Meta compressed that into months by running 16,384 H100 GPUs simultaneously across two custom-built 24,000-GPU clusters.

No single GPU can hold a 405-billion-parameter model. The solution is 4D parallelism — four complementary strategies that split the work across thousands of devices:

  • Tensor Parallelism (TP) splits individual weight matrices across GPUs. A single layer’s weight matrix is sliced, with each GPU computing its portion of the matrix multiplication.
  • Pipeline Parallelism (PP) divides the model vertically — the first 24 transformer blocks on one set of GPUs, the next 24 on another. Different stages process different micro-batches concurrently.
  • Fully Sharded Data Parallelism (FSDP) shards model parameters, gradients, and optimizer states across data-parallel workers. Each GPU only stores a fraction of the full model state.
  • Context Parallelism (CP) splits the sequence dimension across GPUs, enabling training on long contexts without running out of memory.

Meta’s most efficient configuration achieved over 400 TFLOPS per GPU — roughly 40% of the H100’s peak theoretical throughput. That 40% might sound low, but it is a remarkable feat of distributed systems engineering. Communication overhead, synchronisation barriers, and memory management eat the rest.

Then there are the failures. Over a 54-day snapshot of the Llama 3 405B training run, Meta logged 419 unexpected component failures — an average of one failure every three hours. The breakdown: 58.7% were GPU-related (including hardware failures and NVLink interconnect errors), and 17.2% were HBM3 memory faults. The remaining failures came from network switches, host machines, and software bugs.

One failure every three hours across 16,384 GPUs. This is the reality of frontier model training. To handle it, Meta built an automated error detection and recovery system: diagnostic tools identify the failed component, the training job pauses, the faulty GPU is isolated, and the job resumes from the last checkpoint. Despite the constant failures, effective training time exceeded 90% — meaning less than 10% of wall-clock time was lost to interruptions.

GPU cluster scale — from one node to 16,384 GPUs

Single Node8 GPUs — one server

Training cost comparison

DeepSeek V3

$5.6M

2,048 H800s × 57 days

Llama 3 405B

~$60–90M

16,384 H100s × ~3 months

GPT-4

~$78M+

~25,000 A100s × ~90 days

GPT-4.5 / Grok 4

$500M+

Scale and duration undisclosed

Cost growing at 2.4× per year since 2016. Projected: ~$1B by 2027. [Cottier et al., 2024]

Cost varies enormously. Llama 3 405B’s 30.84 million GPU hours, at typical H100 rental rates of $2–3/GPU-hour, implies a compute cost of roughly $60–90 million. GPT-4’s estimated compute cost was approximately $78 million based on Stanford’s 2024 AI Index (Stanford HAI, 2024). Models like GPT-4.5 and Grok 4 are estimated at $500 million or more per training run (Epoch AI, 2025).

Then there is the outlier: DeepSeek V3. This 671-billion-parameter mixture-of-experts model (37 billion active parameters per token) was trained on 14.8 trillion tokens using just 2,048 H800 GPUs over 57 days — a total of 2.788 million GPU hours at a cost of $5.576 million. That is roughly 11 times less compute than Llama 3 405B for a model that achieved competitive performance. But an important caveat: DeepSeek’s figure covers only the final training run. It excludes months of prior research, ablation experiments, and architecture exploration.

The broader trend is unmistakable. The amortised hardware and energy cost of training the most compute-intensive models has been growing at roughly 2.4× per year since 2016 (Cottier et al., 2024). Pre-training is becoming a game that only the wealthiest organisations can afford to play — unless architectures like DeepSeek V3’s MoE approach continue to bend the cost curve.

05

Scaling Laws: The Playbook

Pre-training is expensive. Scaling laws tell you how to spend your budget wisely — or at least, they did, until practitioners started deliberately ignoring them for good reasons.

The first scaling laws came from Kaplan et al. at OpenAI in January 2020. Their paper revealed that language model loss follows a power-law relationship with three variables: model size (parameters), dataset size (tokens), and compute (FLOPs). The trends were remarkably smooth, spanning seven orders of magnitude. Double the compute, and loss drops by a predictable amount.

The practical implication was transformative: you could predict a model’s performance before training it. By running small experiments at low compute budgets and fitting the scaling curves, labs could estimate how a 100-billion-parameter model would perform without actually training one.

Then in March 2022, Hoffmann et al. at DeepMind published a correction. They trained over 400 language models ranging from 70 million to 16 billion parameters and found that Kaplan’s recommendations were systematically wrong about the data side. Kaplan had suggested that most new compute should go to bigger models. Hoffmann showed that model size and data should scale equally.

Their prescriptive result became known as the Chinchilla scaling law: the compute-optimal ratio is approximately 20 tokens per parameter. To prove the point, they trained Chinchilla — a 70B model on 1.4T tokens — using the same compute budget that had trained Gopher (280B parameters on fewer tokens). Chinchilla outperformed Gopher, GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across the board, achieving 67.5% on MMLU — a 7-percentage-point improvement over the four-times-larger Gopher.

The message was clear: most frontier models in 2022 were undertrained. They had too many parameters and not enough data.

Chinchilla scaling law

~20 tokens per parameter

A 70B model needs ~1.4T tokens. A 7B model needs ~140B tokens. Scale both equally.

A 2024 replication study from Epoch AI refined the picture slightly, finding the optimal ratio closer to 25.6 tokens per parameter — but the core insight held.

And then everyone started ignoring it.

Llama 3’s 8B model was trained on 15 trillion tokens — roughly 1,875 tokens per parameter, nearly 100 times the Chinchilla-optimal amount. The 70B model, at approximately 200 tokens per parameter, is 10× beyond optimal. Even Qwen3-0.6B pushed to an extreme 60,000:1 ratio: 36 trillion tokens on a 600-million-parameter model.

Why deliberately overtrain? Because Chinchilla optimises for training compute — it finds the point where your next dollar of training compute produces the most loss reduction. But in production, inference costs dominate. A model serves millions of requests after training. A smaller model that was “over-trained” at training time is cheaper to serve at inference time. Sardana and Frankle (2024) formalised this: when you account for expected inference demand, the optimal strategy shifts toward training smaller models on far more data.

This is the new playbook: Chinchilla tells you where to start; inference economics tells you where to actually train. The frontier has moved from “bigger is better” (Kaplan) to “right-sized is better” (Chinchilla) to “small but over-trained is cheapest end-to-end” (the inference-optimal era).

Scaling laws — three eras of compute allocation

Chinchilla-optimal

Kaplan Era (2020)

Bigger models, less data

Chinchilla Era (2022)

Balanced scaling

Inference-Optimal (2024+)

Smaller models, more data

Training Compute (FLOPs) →← Loss (cross-entropy)
GPT-3
Gopher
Chinchilla
Llama 2 70B
Llama 3 8B
Llama 3 405B
DeepSeek V3
UndertrainedOver-trained (inference-optimal)
Undertrained (above Chinchilla line)
Chinchilla-optimal
Over-trained (inference-optimal)

Hover each dot for details · The frontier has moved from “bigger is better” to “small but over-trained is cheapest end-to-end”

Pre-training gives you a base model — one that can complete text fluently but will not follow instructions, refuse harmful requests, or format answers helpfully. Ask it a question, and it might continue your sentence instead of answering it. Ask for a summary, and it might generate a Wikipedia-style paragraph that wanders off topic.

Turning a base model into an assistant — the kind of model you actually want to talk to — requires fine-tuning. Supervised fine-tuning, RLHF, DPO, LoRA — these techniques are a very different process from pre-training, and they are the subject of Article 5: Fine-Tuning — From Base Model to Assistant.