Model distillation transfers knowledge from a large teacher model to a smaller student model by training the student to match the teacher's output distributions rather than hard labels.
Model distillation transfers knowledge from a large teacher model to a smaller student model by training the student to match the teacher's output distributions rather than hard labels. The student learns not just correct answers but the teacher's confidence levels and inter-class relationships, capturing richer information than standard training. This technique enables deploying high-quality models on resource-constrained devices, reducing inference costs by 10-100x while retaining 90-99% of teacher performance.
During distillation, the teacher model processes training examples and produces soft probability distributions (logits) over possible outputs. The student model trains on a combined loss: one term matches the teacher's soft outputs (using temperature-scaled softmax to amplify informative signals in low-probability classes), and another term matches the ground-truth hard labels.
The temperature parameter controls how much the soft targets reveal about the teacher's learned structure. Higher temperatures produce softer distributions that expose more nuanced inter-class relationships, while lower temperatures approximate hard labels. Typical temperatures range from 2 to 20 depending on the task.
Advanced distillation techniques include intermediate layer matching (aligning student hidden states to teacher hidden states), attention transfer (matching attention distributions), and progressive distillation (multi-stage compression through increasingly smaller models). Task-specific distillation fine-tunes on domain data, while task-agnostic distillation preserves general capabilities.
Distillation makes frontier-quality AI accessible at production scale. A 70B parameter model might cost $10 per million tokens to serve, while a distilled 7B student costs $0.50 — a 20x reduction enabling new use cases. For latency-sensitive applications like real-time coding assistants, distilled models respond in milliseconds where teachers require seconds.
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.