Model Distillation

What is Model Distillation?

Model distillation transfers knowledge from a large teacher model to a smaller student model by training the student to match the teacher's output distributions rather than hard labels. The student learns not just correct answers but the teacher's confidence levels and inter-class relationships, capturing richer information than standard training. This technique enables deploying high-quality models on resource-constrained devices, reducing inference costs by 10-100x while retaining 90-99% of teacher performance.

How does Model Distillation work?

During distillation, the teacher model processes training examples and produces soft probability distributions (logits) over possible outputs. The student model trains on a combined loss: one term matches the teacher's soft outputs (using temperature-scaled softmax to amplify informative signals in low-probability classes), and another term matches the ground-truth hard labels.

The temperature parameter controls how much the soft targets reveal about the teacher's learned structure. Higher temperatures produce softer distributions that expose more nuanced inter-class relationships, while lower temperatures approximate hard labels. Typical temperatures range from 2 to 20 depending on the task.

Advanced distillation techniques include intermediate layer matching (aligning student hidden states to teacher hidden states), attention transfer (matching attention distributions), and progressive distillation (multi-stage compression through increasingly smaller models). Task-specific distillation fine-tunes on domain data, while task-agnostic distillation preserves general capabilities.

Why does Model Distillation matter?

Distillation makes frontier-quality AI accessible at production scale. A 70B parameter model might cost $10 per million tokens to serve, while a distilled 7B student costs $0.50 — a 20x reduction enabling new use cases. For latency-sensitive applications like real-time coding assistants, distilled models respond in milliseconds where teachers require seconds.

Best practices for Model Distillation

Generate diverse training data covering the full input distribution the student will encounter in production
Experiment with temperature values to find the optimal balance between soft-target richness and training stability
Validate distilled models on held-out benchmarks that test edge cases where smaller models typically fail
Use ensemble teachers (multiple large models) to provide more robust soft targets than any single teacher
Apply progressive distillation for extreme compression ratios rather than attempting single-step large-to-tiny transfers

What is Model Distillation?

How does Model Distillation work?

Why does Model Distillation matter?

Best practices for Model Distillation

Related Terms

About the Author