Article
Fine-Tuning — From Base Model to Assistant
SFT, RLHF, DPO, LoRA, and the art of teaching a model to be useful
You have spent four articles building up to this moment. You understand how tokens flow into the model, how self-attention routes information, and how trillions of tokens of pre-training bake knowledge into the weights. The result is a model that can do something extraordinary — predict plausible next tokens for almost any input.
And that is exactly the problem.
Ask a pre-trained base model “What is the capital of France?” and it might continue with “What is the capital of Germany? What is the capital of Spain?” — because completing a quiz is the most likely continuation. It will not answer your question. It will not refuse a harmful request. It will not format a response in JSON when you ask for it. A base model is a brilliant autocomplete engine with no idea that it is supposed to be helpful.
Fine-tuning is how you cross that gap. Not by teaching the model new knowledge — it already has more than enough — but by teaching it new behavior. How to listen. How to respond. How to say “I can’t help with that.” This article covers the entire fine-tuning stack: supervised fine-tuning, reinforcement learning from human feedback, direct preference optimization, parameter-efficient methods like LoRA, and the critical decision of when to fine-tune at all.
01
The Gap Between Completion and Conversation
What does a base model actually do when you prompt it? In What Is a Large Language Model?, these models are trained on a single objective: predict the next token. They learn grammar, facts, reasoning, and even code — all as a side effect of that pressure. But nothing in the training objective says “be helpful” or “answer questions.”
The difference between a base model and the ChatGPT or Claude you interact with daily is post-training — a set of techniques applied after pre-training that reshape the model’s behavior without fundamentally altering its knowledge.
Base Model
Continues the pattern
User
Assistant
Instruction-Tuned
Answers the question
User
Assistant
Same weights, same knowledge — different behavior
This gap is not a flaw in pre-training — it is a design feature. The base model is a general-purpose language engine. Post-training specializes it. And the specialization stack has evolved rapidly.
The modern post-training pipeline looks like this:
- Supervised Fine-Tuning (SFT) — teach the model to follow instructions by training on (prompt, response) pairs
- Preference optimization (RLHF or DPO) — teach the model which responses humans prefer
- RL with verifiable rewards (GRPO) — teach the model to reason step by step
Each stage addresses a different failure mode. SFT gives the model a conversational format. Preference optimization steers it toward better responses. RL with verifiable rewards builds deeper reasoning. Not every model needs all three stages, but frontier models like Llama 3.1, GPT-4, and DeepSeek-R1 use variations of this full stack.
The post-training pipeline
01
Pre-Trained Base
Next-token prediction on trillions of tokens
02
SFT
Learn to follow instructions from demonstration pairs
03
DPO / RLHF
Learn which responses humans prefer
04
GRPO
Learn to reason via verifiable rewards
Not every model needs all stages — but frontier models use the full stack
There is, however, a cost. Recent research shows that the processes of supervised fine-tuning and alignment can degrade certain capabilities — a phenomenon called the alignment tax. A January 2026 study found that instruction-tuned models underperformed their base counterparts by up to 32.7% on domain-specific benchmarks in zero-shot settings (arXiv, 2026). Alignment enhances reliability but constrains flexibility. The art of post-training is maximizing one without destroying the other.
02
Supervised Fine-Tuning (SFT)
SFT is the simplest and most foundational step: take your pre-trained model and train it further on a curated dataset of (instruction, desired response) pairs. The loss function is the same cross-entropy we covered in Pre-Training — but now the training data looks very different.
Instead of raw internet text, SFT data consists of conversations. Each example has a system prompt (setting the model’s persona), a user message (the instruction), and an assistant response (the ideal output). The model learns to predict the tokens of the assistant response — while the loss on prompt tokens is masked so the model is not rewarded for predicting the instruction itself (Meta, 2024).
SFT training examples
system
You are a helpful assistant.
user
What causes ocean tides?
assistant
Ocean tides are primarily caused by the gravitational pull of the Moon and, to a lesser extent, the Sun. The Moon's gravity creates a bulge of water on the side of Earth nearest to it, and a corresponding bulge on the opposite side due to inertia. As Earth rotates, coastlines pass through these bulges, producing the regular cycle of high and low tides.
What makes a good SFT example
Data quality dominates data quantity. The InstructGPT paper — the work that proved RLHF could align language models — used just ~13,000 SFT prompts written by roughly 40 human labelers (Ouyang et al., NeurIPS 2022). Not millions. Not even hundreds of thousands. The insight was that a small number of high-quality demonstration examples can shift model behavior more effectively than a large, noisy dataset.
When SFT alone is enough. If your goal is to make a model follow a specific output format (always return JSON), adopt a consistent persona (respond as a customer support agent), or handle a narrow domain (medical intake forms), SFT may be all you need. The model already has the knowledge from pre-training — you are just teaching it how to present that knowledge.
When SFT is not enough. SFT teaches the model a way to respond. It does not teach the model which response is better among multiple valid options. For that, you need preference data — and that is where RLHF and DPO enter.
03
RLHF: Learning from Human Preferences
The limitation of SFT is that it trains on a single “gold” response per prompt. But in practice, there are many ways to answer a question — some more helpful, more accurate, or more concise than others. How do you teach a model that subtle quality gradient?
Reinforcement Learning from Human Feedback (RLHF) answers this with a three-step process that was first demonstrated at scale in the InstructGPT paper (Ouyang et al., 2022).
Step 1: Collect preference data. Human labelers are shown two or more model responses to the same prompt and asked to rank them. “Response A is better than Response B.” This is fundamentally easier than writing ideal responses — comparing is cognitively cheaper than creating.
Step 2: Train a reward model. A separate neural network (often a smaller LLM) learns to predict the human preference scores. Given any (prompt, response) pair, the reward model outputs a scalar score representing “how good a human would rate this.”
Step 3: Optimize with PPO. The LLM generates responses, the reward model scores them, and Proximal Policy Optimization (PPO) adjusts the LLM’s weights to maximize the reward. A KL-divergence penalty prevents the model from drifting too far from the SFT baseline — without this anchor, the model can “hack” the reward model by producing degenerate outputs.
The RLHF loop — InstructGPT (2022)
Human labelers compare two model responses and pick the better one
A smaller model learns to predict human preference scores from (prompt, response) pairs
Generate → Score → Update weights → Repeat, with KL penalty tethering to SFT baseline
The downsides. RLHF is expensive, fragile, and complex. You need a separate reward model, a separate PPO optimizer, careful hyperparameter tuning, and enough GPU memory to hold both the policy model and the reward model simultaneously. Meta noted in the Llama 3.1 technical report that PPO was “less stable and more challenging to scale” compared to simpler alternatives (Meta, 2024). Which brings us to the method that simplified the entire process.
04
DPO: Cutting Out the Middleman
In 2023, Rafailov et al. asked a sharp question: if the reward model is just learning a function of the LLM’s own probabilities, why not skip it entirely?
The answer was Direct Preference Optimization (DPO) — a method that uses the same preference data as RLHF but optimizes the LLM directly, without a separate reward model and without reinforcement learning (Rafailov et al., NeurIPS 2023).
The key insight: there is a closed-form mapping between the optimal reward function and the optimal policy. DPO exploits this by treating the LLM itself as an implicit reward model. The training objective becomes a simple classification loss: given a pair of responses (one preferred, one rejected), increase the probability of the preferred response and decrease the probability of the rejected one.
The impact was immediate. Meta adopted DPO as a core component of the Llama 3 / 3.1 post-training pipeline — six rounds total, with fresh preference annotations each time. During DPO, only preference pairs where the chosen response was rated “significantly better” or “better” were used; “similar” pairs were discarded (Meta, 2024).
RLHF vs. DPO — cutting out the middleman
RLHF (4 stages)
DPO (3 stages)
Comparison
DPO successors
No reference model — uses average log-prob as reward
Merges SFT + preference optimization into one stage
Works with thumbs-up/down instead of pairwise comparisons
The progression from RLHF to DPO to SimPO/ORPO/KTO follows a consistent trajectory: simpler algorithms, cheaper compute, comparable or better results. The field has largely moved beyond RLHF for preference alignment — though reinforcement learning is making a dramatic comeback for a different purpose.
05
GRPO and the Reasoning Revolution
In January 2025, DeepSeek published a result that shifted the trajectory of the field. DeepSeek-R1-Zero — trained with pure reinforcement learning and no supervised fine-tuning — spontaneously developed chain-of-thought reasoning, self-verification, and reflection behaviors (DeepSeek-R1, Jan 2025; Nature, 2025).
The algorithm behind it was GRPO — Group Relative Policy Optimization. GRPO is conceptually simple: for each prompt, generate a group of responses. Score each response with a verifiable reward — did the math answer match the ground truth? Did the code compile and pass tests? Then compute advantages relative to the group rather than using a learned critic model.
This is the critical difference from PPO. In PPO, a separate critic network estimates the value of each state. In GRPO, there is no critic — the group itself serves as the baseline. If your response scored higher than the group average, your advantage is positive; if lower, negative. This cuts memory requirements by roughly 50% compared to PPO (arXiv, Mar 2025).
GRPO — group relative policy optimization
Prompt
Solve: What is 247 × 38?
Correct answer: 9,386
247 × 38 = 247 × 40 − 247 × 2 = 9,880 − 494 = 9,386
247 × 38 = 200 × 38 + 47 × 38 = 7,600 + 1,786 = 9,386
247 × 38 ≈ 250 × 38 = 9,500
247 × 38 = 247 × 30 + 247 × 8 = 7,410 + 1,976 = 9,386
247 × 38 = 247 × 40 − 247 = 9,880 − 247 = 9,633
247 × 38... let me multiply directly... = 9,386
Group advantage (mean reward: 0.67)
No reward model. No critic. The group is the baseline.
What emerged was remarkable. Without any examples of chain-of-thought reasoning in its training data, DeepSeek-R1-Zero began producing multi-step reasoning traces, pausing to verify its own work, and exploring alternative solution paths. The model discovered reasoning as an emergent strategy for maximizing reward.
This result is reshaping the post-training landscape. The modern stack is increasingly: SFT for format, DPO for preference alignment, GRPO for reasoning (Raschka, 2025).
06
Parameter-Efficient Fine-Tuning: LoRA and Friends
Every technique discussed so far has a practical problem: full fine-tuning means updating every parameter in the model. For a 70-billion-parameter model, that requires north of 100 GB of VRAM just for the weights, optimizer states, and gradients — the domain of multiple A100 or H100 GPUs.
LoRA (Low-Rank Adaptation) changed this equation entirely. The core insight, published by Hu et al. in 2021, is that the weight updates during fine-tuning have a low “intrinsic rank.” You do not need to change every dimension of every weight matrix — a small, low-rank approximation captures the behavioral shift.
In practice: freeze the entire pre-trained model. For each target weight matrix W (of size d × k), inject two small matrices: A (d × r) and B (r × k), where r is the rank — typically 4 to 64, compared to original dimensions of 4,096 to 12,288. The effective weight becomes W + AB. Only A and B are trained.
LoRA — low-rank adaptation
W
4096 × 4096
16.8M params
A
4096 × 16
B
16 × 4096
Full params/layer
16.8M
LoRA params/layer
131K
Reduction
99.2%
PEFT landscape (2025–2026)
QLoRA pushed this further. Dettmers et al. (NeurIPS 2023) combined LoRA with 4-bit quantization of the base model, introducing NF4 (4-bit NormalFloat) — a data type that is information-theoretically optimal for normally distributed weights. The result: fine-tuning a 65-billion-parameter model on a single 48 GB GPU. Their QLoRA-tuned model, Guanaco, reached 99.3% of ChatGPT’s performance after just 24 hours of training (Dettmers et al., NeurIPS 2023).
Cost democratization
Full fine-tuning (7B): ~$50,000 (H100 cluster)
QLoRA (7B): ~$1,500 (single RTX 4090)
07
When to Fine-Tune (and When Not To)
You now understand the full toolkit. The harder question is: should you fine-tune at all?
The answer depends on what you are trying to fix. There are three levers for customizing LLM behavior, and they form a natural escalation path:
Prompt engineering is your first move. It requires no training data, no GPUs, and no model modifications. If you can get the behavior you need by crafting better prompts — with few-shot examples, system instructions, or structured output formats — stop here.
Retrieval-Augmented Generation (RAG) is your second move. If the model lacks specific, current, or proprietary knowledge, don’t fine-tune it in — retrieve it at inference time. As we covered in RAG From the Ground Up, RAG keeps the knowledge layer separate from the model’s behavior layer.
Fine-tuning is your third move — and only when the first two are insufficient. Fine-tuning excels at changing behavior, not injecting knowledge. Use it when you need:
- A consistent output format across thousands of requests (always return valid JSON with specific fields)
- Domain-specific tone or persona (legal writing, medical communication, customer support)
- Reliable policy adherence (always cite sources, never reveal system prompts, refuse specific categories)
- Latency reduction (encode instructions into weights so you need shorter prompts)
When to fine-tune — decision flowchart
The highest-performing systems combine all three: fine-tuning for behavior, RAG for knowledge, prompt engineering for per-request steering
A final caution. Fine-tuning is a one-time investment that changes the model permanently (or until you fine-tune again). Prompt engineering is per-request overhead that adds latency and cost. The trade-off is upfront engineering time versus ongoing inference cost. For high-volume production systems, fine-tuning a shorter prompt into the model’s weights can cut per-request token costs by 30–50% — paying for itself within weeks.
You have a trained, fine-tuned model. It follows instructions, prefers helpful responses, and maybe even reasons through multi-step problems. But deploying it to serve real users at scale introduces an entirely new set of challenges: memory management, latency budgets, throughput targets, and cost optimization.
In Article 6 — Inference and Serving, we will cover quantization, KV caching, continuous batching, speculative decoding, and the engineering that makes every API call possible.