$ miric.dev

Article

Fine-Tuning — From Base Model to Assistant

SFT, RLHF, DPO, LoRA, and the art of teaching a model to be useful

You have spent four articles building up to this moment. You understand how tokens flow into the model, how self-attention routes information, and how trillions of tokens of pre-training bake knowledge into the weights. The result is a model that can do something extraordinary — predict plausible next tokens for almost any input.

And that is exactly the problem.

Ask a pre-trained base model “What is the capital of France?” and it might continue with “What is the capital of Germany? What is the capital of Spain?” — because completing a quiz is the most likely continuation. It will not answer your question. It will not refuse a harmful request. It will not format a response in JSON when you ask for it. A base model is a brilliant autocomplete engine with no idea that it is supposed to be helpful.

Fine-tuning is how you cross that gap. Not by teaching the model new knowledge — it already has more than enough — but by teaching it new behavior. How to listen. How to respond. How to say “I can’t help with that.” This article covers the entire fine-tuning stack: supervised fine-tuning, reinforcement learning from human feedback, direct preference optimization, parameter-efficient methods like LoRA, and the critical decision of when to fine-tune at all.

01

The Gap Between Completion and Conversation

What does a base model actually do when you prompt it? In What Is a Large Language Model?, these models are trained on a single objective: predict the next token. They learn grammar, facts, reasoning, and even code — all as a side effect of that pressure. But nothing in the training objective says “be helpful” or “answer questions.”

The difference between a base model and the ChatGPT or Claude you interact with daily is post-training — a set of techniques applied after pre-training that reshape the model’s behavior without fundamentally altering its knowledge.

Base Model

Continues the pattern

User

Summarize the key benefits of microservices architecture.

Assistant

Summarize the key benefits of serverless architecture. Summarize the key benefits of event-driven architecture. Summarize the key drawbacks of microservices architecture. Compare microservices and monolithic architecture.

Instruction-Tuned

Answers the question

User

Summarize the key benefits of microservices architecture.

Assistant

Microservices architecture offers several key benefits: 1. Independent deployment — teams can ship changes to individual services without coordinating full-system releases. 2. Technology flexibility — each service can use the language, framework, or database best suited to its task. 3. Fault isolation — a failure in one service does not cascade to the entire system. 4. Scalability — individual services can be scaled independently based on demand.

Same weights, same knowledge — different behavior

This gap is not a flaw in pre-training — it is a design feature. The base model is a general-purpose language engine. Post-training specializes it. And the specialization stack has evolved rapidly.

The modern post-training pipeline looks like this:

  1. Supervised Fine-Tuning (SFT) — teach the model to follow instructions by training on (prompt, response) pairs
  2. Preference optimization (RLHF or DPO) — teach the model which responses humans prefer
  3. RL with verifiable rewards (GRPO) — teach the model to reason step by step

Each stage addresses a different failure mode. SFT gives the model a conversational format. Preference optimization steers it toward better responses. RL with verifiable rewards builds deeper reasoning. Not every model needs all three stages, but frontier models like Llama 3.1, GPT-4, and DeepSeek-R1 use variations of this full stack.

The post-training pipeline

01

Pre-Trained Base

Next-token prediction on trillions of tokens

02

SFT

Learn to follow instructions from demonstration pairs

03

DPO / RLHF

Learn which responses humans prefer

04

GRPO

Learn to reason via verifiable rewards

Not every model needs all stages — but frontier models use the full stack

There is, however, a cost. Recent research shows that the processes of supervised fine-tuning and alignment can degrade certain capabilities — a phenomenon called the alignment tax. A January 2026 study found that instruction-tuned models underperformed their base counterparts by up to 32.7% on domain-specific benchmarks in zero-shot settings (arXiv, 2026). Alignment enhances reliability but constrains flexibility. The art of post-training is maximizing one without destroying the other.

02

Supervised Fine-Tuning (SFT)

SFT is the simplest and most foundational step: take your pre-trained model and train it further on a curated dataset of (instruction, desired response) pairs. The loss function is the same cross-entropy we covered in Pre-Training — but now the training data looks very different.

Instead of raw internet text, SFT data consists of conversations. Each example has a system prompt (setting the model’s persona), a user message (the instruction), and an assistant response (the ideal output). The model learns to predict the tokens of the assistant response — while the loss on prompt tokens is masked so the model is not rewarded for predicting the instruction itself (Meta, 2024).

SFT training examples

system

You are a helpful assistant.

user

What causes ocean tides?

assistant

Ocean tides are primarily caused by the gravitational pull of the Moon and, to a lesser extent, the Sun. The Moon's gravity creates a bulge of water on the side of Earth nearest to it, and a corresponding bulge on the opposite side due to inertia. As Earth rotates, coastlines pass through these bulges, producing the regular cycle of high and low tides.

What makes a good SFT example

Specific, unambiguous instructionComplete, well-structured responseCorrect formatting for the task typeAppropriate refusals for harmful requestsConsistent persona across examples

Data quality dominates data quantity. The InstructGPT paper — the work that proved RLHF could align language models — used just ~13,000 SFT prompts written by roughly 40 human labelers (Ouyang et al., NeurIPS 2022). Not millions. Not even hundreds of thousands. The insight was that a small number of high-quality demonstration examples can shift model behavior more effectively than a large, noisy dataset.

When SFT alone is enough. If your goal is to make a model follow a specific output format (always return JSON), adopt a consistent persona (respond as a customer support agent), or handle a narrow domain (medical intake forms), SFT may be all you need. The model already has the knowledge from pre-training — you are just teaching it how to present that knowledge.

When SFT is not enough. SFT teaches the model a way to respond. It does not teach the model which response is better among multiple valid options. For that, you need preference data — and that is where RLHF and DPO enter.

03

RLHF: Learning from Human Preferences

The limitation of SFT is that it trains on a single “gold” response per prompt. But in practice, there are many ways to answer a question — some more helpful, more accurate, or more concise than others. How do you teach a model that subtle quality gradient?

Reinforcement Learning from Human Feedback (RLHF) answers this with a three-step process that was first demonstrated at scale in the InstructGPT paper (Ouyang et al., 2022).

Step 1: Collect preference data. Human labelers are shown two or more model responses to the same prompt and asked to rank them. “Response A is better than Response B.” This is fundamentally easier than writing ideal responses — comparing is cognitively cheaper than creating.

Step 2: Train a reward model. A separate neural network (often a smaller LLM) learns to predict the human preference scores. Given any (prompt, response) pair, the reward model outputs a scalar score representing “how good a human would rate this.”

Step 3: Optimize with PPO. The LLM generates responses, the reward model scores them, and Proximal Policy Optimization (PPO) adjusts the LLM’s weights to maximize the reward. A KL-divergence penalty prevents the model from drifting too far from the SFT baseline — without this anchor, the model can “hack” the reward model by producing degenerate outputs.

The RLHF loop — InstructGPT (2022)

Step 1Collect Preferences

Human labelers compare two model responses and pick the better one

33K comparisons, ~40 labelers
Step 2Train Reward Model

A smaller model learns to predict human preference scores from (prompt, response) pairs

27% inter-annotator disagreement
Step 3PPO Optimization

Generate → Score → Update weights → Repeat, with KL penalty tethering to SFT baseline

Reward: 0.73 → 0.81 → 0.89
Key result: The 1.3B-parameter InstructGPT was preferred by human labelers over the 175B GPT-3 — RLHF outperformed a 100× increase in model size.

The downsides. RLHF is expensive, fragile, and complex. You need a separate reward model, a separate PPO optimizer, careful hyperparameter tuning, and enough GPU memory to hold both the policy model and the reward model simultaneously. Meta noted in the Llama 3.1 technical report that PPO was “less stable and more challenging to scale” compared to simpler alternatives (Meta, 2024). Which brings us to the method that simplified the entire process.

04

DPO: Cutting Out the Middleman

In 2023, Rafailov et al. asked a sharp question: if the reward model is just learning a function of the LLM’s own probabilities, why not skip it entirely?

The answer was Direct Preference Optimization (DPO) — a method that uses the same preference data as RLHF but optimizes the LLM directly, without a separate reward model and without reinforcement learning (Rafailov et al., NeurIPS 2023).

The key insight: there is a closed-form mapping between the optimal reward function and the optimal policy. DPO exploits this by treating the LLM itself as an implicit reward model. The training objective becomes a simple classification loss: given a pair of responses (one preferred, one rejected), increase the probability of the preferred response and decrease the probability of the rejected one.

The impact was immediate. Meta adopted DPO as a core component of the Llama 3 / 3.1 post-training pipeline — six rounds total, with fresh preference annotations each time. During DPO, only preference pairs where the chosen response was rated “significantly better” or “better” were used; “similar” pairs were discarded (Meta, 2024).

RLHF vs. DPO — cutting out the middleman

RLHF (4 stages)

Preference Data
Train Reward Model
PPO Loop
Fine-tuned LLM

DPO (3 stages)

Preference Data
Direct Optimization
Fine-tuned LLM

Comparison

ComplexityHighLow
StabilityFragileStable
Compute costHighModerate
Memory2 models1 model + ref
PerformanceStrongComparable
RLHFDPO

DPO successors

SimPONeurIPS 2024

No reference model — uses average log-prob as reward

ORPO2024

Merges SFT + preference optimization into one stage

KTO2024

Works with thumbs-up/down instead of pairwise comparisons

The progression from RLHF to DPO to SimPO/ORPO/KTO follows a consistent trajectory: simpler algorithms, cheaper compute, comparable or better results. The field has largely moved beyond RLHF for preference alignment — though reinforcement learning is making a dramatic comeback for a different purpose.

05

GRPO and the Reasoning Revolution

In January 2025, DeepSeek published a result that shifted the trajectory of the field. DeepSeek-R1-Zero — trained with pure reinforcement learning and no supervised fine-tuning — spontaneously developed chain-of-thought reasoning, self-verification, and reflection behaviors (DeepSeek-R1, Jan 2025; Nature, 2025).

The algorithm behind it was GRPO — Group Relative Policy Optimization. GRPO is conceptually simple: for each prompt, generate a group of responses. Score each response with a verifiable reward — did the math answer match the ground truth? Did the code compile and pass tests? Then compute advantages relative to the group rather than using a learned critic model.

This is the critical difference from PPO. In PPO, a separate critic network estimates the value of each state. In GRPO, there is no critic — the group itself serves as the baseline. If your response scored higher than the group average, your advantage is positive; if lower, negative. This cuts memory requirements by roughly 50% compared to PPO (arXiv, Mar 2025).

GRPO — group relative policy optimization

Prompt

Solve: What is 247 × 38?

Correct answer: 9,386

Response 1✓ Correct

247 × 38 = 247 × 40 − 247 × 2 = 9,880 − 494 = 9,386

Answer:9,386
Response 2✓ Correct

247 × 38 = 200 × 38 + 47 × 38 = 7,600 + 1,786 = 9,386

Answer:9,386
Response 3✗ Wrong

247 × 38 ≈ 250 × 38 = 9,500

Answer:9,500
Response 4✓ Correct

247 × 38 = 247 × 30 + 247 × 8 = 7,410 + 1,976 = 9,386

Answer:9,386
Response 5✗ Wrong

247 × 38 = 247 × 40 − 247 = 9,880 − 247 = 9,633

Answer:9,633
Response 6✓ Correct

247 × 38... let me multiply directly... = 9,386

Answer:9,386

Group advantage (mean reward: 0.67)

R1
R2
R3
R4
R5
R6

No reward model. No critic. The group is the baseline.

What emerged was remarkable. Without any examples of chain-of-thought reasoning in its training data, DeepSeek-R1-Zero began producing multi-step reasoning traces, pausing to verify its own work, and exploring alternative solution paths. The model discovered reasoning as an emergent strategy for maximizing reward.

This result is reshaping the post-training landscape. The modern stack is increasingly: SFT for format, DPO for preference alignment, GRPO for reasoning (Raschka, 2025).

06

Parameter-Efficient Fine-Tuning: LoRA and Friends

Every technique discussed so far has a practical problem: full fine-tuning means updating every parameter in the model. For a 70-billion-parameter model, that requires north of 100 GB of VRAM just for the weights, optimizer states, and gradients — the domain of multiple A100 or H100 GPUs.

LoRA (Low-Rank Adaptation) changed this equation entirely. The core insight, published by Hu et al. in 2021, is that the weight updates during fine-tuning have a low “intrinsic rank.” You do not need to change every dimension of every weight matrix — a small, low-rank approximation captures the behavioral shift.

In practice: freeze the entire pre-trained model. For each target weight matrix W (of size d × k), inject two small matrices: A (d × r) and B (r × k), where r is the rank — typically 4 to 64, compared to original dimensions of 4,096 to 12,288. The effective weight becomes W + AB. Only A and B are trained.

LoRA — low-rank adaptation

Frozen

W

4096 × 4096

16.8M params

+

A

4096 × 16

×

B

16 × 4096

Full params/layer

16.8M

LoRA params/layer

131K

Reduction

99.2%

PEFT landscape (2025–2026)

LoRALow-rank adapter matrices on attention weightsBest general-purpose method
QLoRALoRA on 4-bit quantized modelSingle-GPU fine-tuning for large models
DoRADecomposes updates into magnitude + directionSlightly better quality, marginal overhead
LoRA+Different learning rates for A and B matricesBetter convergence
Prefix TuningPrepend learnable "soft" tokens to inputEffective for generation tasks
Prompt TuningSimpler prefix tuning with fewer tokensMinimal compute, lower quality ceiling
MethodKey ideaTrade-off

QLoRA pushed this further. Dettmers et al. (NeurIPS 2023) combined LoRA with 4-bit quantization of the base model, introducing NF4 (4-bit NormalFloat) — a data type that is information-theoretically optimal for normally distributed weights. The result: fine-tuning a 65-billion-parameter model on a single 48 GB GPU. Their QLoRA-tuned model, Guanaco, reached 99.3% of ChatGPT’s performance after just 24 hours of training (Dettmers et al., NeurIPS 2023).

Cost democratization

Full fine-tuning (7B): ~$50,000 (H100 cluster)

QLoRA (7B): ~$1,500 (single RTX 4090)

07

When to Fine-Tune (and When Not To)

You now understand the full toolkit. The harder question is: should you fine-tune at all?

The answer depends on what you are trying to fix. There are three levers for customizing LLM behavior, and they form a natural escalation path:

Prompt engineering is your first move. It requires no training data, no GPUs, and no model modifications. If you can get the behavior you need by crafting better prompts — with few-shot examples, system instructions, or structured output formats — stop here.

Retrieval-Augmented Generation (RAG) is your second move. If the model lacks specific, current, or proprietary knowledge, don’t fine-tune it in — retrieve it at inference time. As we covered in RAG From the Ground Up, RAG keeps the knowledge layer separate from the model’s behavior layer.

Fine-tuning is your third move — and only when the first two are insufficient. Fine-tuning excels at changing behavior, not injecting knowledge. Use it when you need:

  • A consistent output format across thousands of requests (always return valid JSON with specific fields)
  • Domain-specific tone or persona (legal writing, medical communication, customer support)
  • Reliable policy adherence (always cite sources, never reveal system prompts, refuse specific categories)
  • Latency reduction (encode instructions into weights so you need shorter prompts)

When to fine-tune — decision flowchart

Can you get the behavior with better prompts?
Yes
Prompt Engineeringlow complexity
No ↓
Does the model need knowledge it doesn't have?
Yes
RAGmedium complexity
No ↓
Do you need consistent formatting, persona, or policy adherence?
Yes
Fine-Tunehigh complexity
No ↓
Do you need step-by-step reasoning over verifiable answers?
Yes
RL / GRPOhigh complexity
No → fallback
Fine-Tunehigh complexity

The highest-performing systems combine all three: fine-tuning for behavior, RAG for knowledge, prompt engineering for per-request steering

A final caution. Fine-tuning is a one-time investment that changes the model permanently (or until you fine-tune again). Prompt engineering is per-request overhead that adds latency and cost. The trade-off is upfront engineering time versus ongoing inference cost. For high-volume production systems, fine-tuning a shorter prompt into the model’s weights can cut per-request token costs by 30–50% — paying for itself within weeks.

You have a trained, fine-tuned model. It follows instructions, prefers helpful responses, and maybe even reasons through multi-step problems. But deploying it to serve real users at scale introduces an entirely new set of challenges: memory management, latency budgets, throughput targets, and cost optimization.

In Article 6 — Inference and Serving, we will cover quantization, KV caching, continuous batching, speculative decoding, and the engineering that makes every API call possible.