LLM Distillation Explained

How Knowledge Distillation Transfers Reasoning Skills in Language Models

Feb 27, 2025

Language model (LM) distillation is a method for transferring knowledge. In LMs, this usually means transferring reasoning skills from large to smaller models. The larger models are called teachers, while the smaller models are called students.

The main goal is to lower computational costs. We want to keep strong performance on complex inference tasks. Model developers can use knowledge distillation to lower parameter counts and memory needs. This method keeps logical coherence and factual accuracy loss to a minimum.

However, distillation still faces challenges around efficient knowledge transfer, avoiding reasoning shortcuts, and balancing inference latency trade-offs.

This article will dive deep into,

Traditional knowledge distillation basics and teacher-student model paradigm
LLM-specific distillation techniques, including TAID and temperature scaling
Comparison between conventional and LLM distillation approaches
A step-by-step guide using the NVIDIA NeMo framework
Advanced Features like CoT and reinforcement learning

Let’s start.

1. Foundations of LM Distillation

In this section, I will explain traditional knowledge distillation. Additionally, how it differs from LLM knowledge distillation.

1.1 Knowledge Distillation

Let’s assume a teacher with extensive knowledge and a bright student eager to learn. The teacher knows complex subjects well. She wants to share this knowledge without overwhelming her students. The central concept of knowledge distillation in LM is to transfer the abilities of a “teacher” model to a “student” model.

**Traditional distillation process** | Source: Compressing deep graph convolution network with multi-staged knowledge distillation

The key components of this process are:

Teacher model: It generates “soft” probability distributions over its output vocabulary. During the process, it uses a temperature-scaled softmax function, allowing the teacher to express confidence in different possible outputs.
Student model: It learns from the teacher’s soft probabilities and the actual “hard” labels. The model focuses on balancing imitation and correctness.
Distillation loss function: Combines cross-entropy loss (encouraging correct predictions) and KL divergence (penalizing deviation from the teacher’s probabilities). The loss is defined as

$L = \alpha L_{\text{CE}} + (1 - \alpha) L_{\text{KL}}(p_s \parallel p_t) $

Where `α` controls the balance between the terms L_CE and L_KL.

Remember, L_CE is the cross-entropy loss. It measures the difference between the student model’s predictions and the ground truth. On the other hand, the L_KL is the divergence between the student model’s probability distribution p_s and the teacher model’s probability distribution p_t

Techniques such as the:

Smoothed knowledge distillation enhances this method by softening the teacher’s probability outputs. This inherently reduces hallucinations and improves factual consistency. This is especially important for question-answering and fact-based dialogues.
Task-aware intermediate distillation (TAID) interpolates between teacher and student representations during training. This helps stop mode collapse and encourages effective transfer.

This is how we perform knowledge distillation in traditional models. Now, let’s understand what distillation in LLM is.

1.2 LLM distillation

Knowledge distillation in the context of LLM takes on fascinating new dimensions. Traditional distillation focuses on classification tasks. In contrast, LLM distillation aims to keep reasoning skills across different contexts. This requires sophisticated approaches that go beyond simple teacher-student knowledge transfer.

The TAID framework is at the core of modern LLM distillation. It uses temperature scaling method to prevents issues like mode collapse. It is where student models gravitate toward oversimplified patterns.

TAID interpolates between teacher and student predictions. This preserves the teacher's reasoning. At the same time, it helps the student create efficient representations.

**Difference in standard knowledge distillation and TAID** | Source: TAID: Temporally adaptive interpolated distillation for efficient knowledge transfer in language models

Temperature scaling plays a crucial role in this process.

The temperature parameter is embedded in the probability distribution formula of the teacher and student model,

$p_s^T = \text{Softmax}\left(\frac{z_s}{T}\right); p_t^T = \text{Softmax}\left(\frac{z_t}{T}\right) $

When T > 1, the softmax distribution of teacher outputs becomes smoother. This eventually reveals subtle relationships between different reasoning paths that might be obscured in sharper distributions. This is particularly important for preserving multi-step reasoning capabilities. In mutli-step reasoning each step builds upon previous insights. Think of it as teaching a student not just the “what” but the “how” of problem-solving.

The benefits of this approach are:

A 37% reduction in hallucination rates through smoothed knowledge transfer
Preserved reasoning capabilities with reduced computational costs
Enhanced generalization across diverse problem domains

**Source**: TAID: Temporally adaptive interpolated distillation for efficient knowledge transfer in language models

For practitioners implementing LLM distillation, temperature tuning becomes a critical skill.

Setting T < 1 creates sharp probability distributions that can make student models overconfident in their predictions. Conversely, T > 1 produces softer distributions that better capture the nuanced relationships between different reasoning paths.

This means that when the temperature is closer to one, the range of search narrows down, and when the temperature of farther away from one, the search area widens.

This is especially important when distilling models for tasks requiring multi-step logical inference or complex problem decomposition.

The loss function balances these competing objectives:

$L = \alpha L_{\text{CE}} + (1 - \alpha) L_{\text{KL}}(p_s \parallel p_t) $

The optimal of α varies between 0.3 and 0.7 depending on the specific task and model architectures involved.

1.3 Comparison table

Below, I have created a comparison table between traditional knowledge distillation and LLM knowledge distillation.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

2. Empowering reasoning

In late 2024 and early 2025, we have seen two primary techniques pushing the development of reasoning LLMs: chain-of-thoughts and reinforcement learning. In this section, we will discuss these techniques from the context of model distillation.

2.1 Chain-of-Thought Methods

Imagine solving a complex math problem without breaking it into steps. This is the same challenge language models face without Chain-of-Thought (CoT) reasoning. Just as humans benefit from showing their work, LLMs achieve significantly better results when they articulate their reasoning process step by step. The evolution of CoT methods reveals a fascinating progression in how we enable machines to think more systematically.

Zero-shot CoT represents the most basic form, where models are simply prompted to explain their thinking without examples.

Despite its simplicity, this approach yields impressive results, boosting performance on the challenging GSM8K mathematics benchmark by 10.4% to 40.7%. This improvement comes from encouraging the model to decompose problems into manageable steps, like a student learning to show their work.

Comparison of Few-shot-CoT and Zero-shot-CoT | Source: Large language models are zero-shot reasoners

Few-Shot CoT furthers this concept by providing carefully crafted examples demonstrating effective reasoning patterns. When models see how similar problems can be broken down and solved methodically, they learn to apply these patterns to new challenges. The impact is substantial—a 22% improvement on the MATH dataset, which covers a wide range of mathematical problems from basic arithmetic to advanced calculus.

Auto-CoT represents the cutting edge of reasoning enhancement, using sophisticated clustering techniques to select the most relevant examples for any given problem automatically. This dynamic approach improves QA accuracy by 9% while reducing the manual effort needed to create effective prompts. Think of it as an intelligent tutor who knows which examples will best help a student grasp a new concept.

2.2 Symbolic Chain of Thought Distillation

CoT is a useful tool, but how can we apply it to distill knowledge? I reckon the principle remains the same, teach the student model to learn the reasoning process.

The authors in the paper titled “Symbolic Chain-of-Thought Distillation: small models can also “think” step-by-step” presented a method that enables smaller language models to learn step-by-step reasoning capabilities from larger models.

An illustration of how SCoTD works | **Source**: Symbolic Chain-of-Thought Distillation: Small models can also “think” step-by-step

The authors propose a technique where a smaller student model is trained on rationalization samples from a much larger teacher model. The training allowed the smaller model to develop CoT reasoning abilities. These abilities were previously only seen in models with >50B parameters.

2.2.1 How it works?

The process works through several key steps:

Initial setup

Teacher Model: Large language model (e.g., GPT-3 175B)
Student Model: Smaller model (e.g., OPT 125M-1.3B)
Training Data: Set of unlabeled input instances D_Train = {(xi)}

Sampling process

For each input xi in D_Train:
- Sample N chain-of-thoughts z̃i with predictions ỹi from teacher
- Formula:
  $(\tilde{y}_i^k, \tilde{z}_i^k) \sim \mathcal{N}_T(y_i, z_i \mid x_i, P) $
- P is the prompt set of examples x, labels y, and CoT z
  $P = { {(x_i, y_i, z_i)} }$
- Typically N = 30 samples per instance

Training process

Create corpus:
$C = \left\{ \left(x_i, \left\{ (\tilde{y}_i^k, \tilde{z}_i^k) \right\}_{k=1}^{N} \right) \right\} $
Train student using language modeling loss:
$E_{(x, \tilde{y}, \tilde{z}) \sim C} \left[ S(\tilde{y}, \tilde{z} \mid x) \right] $

Evaluation options

Greedy decoding:
$\tilde{z}_{\text{test}}, \tilde{y}_{\text{test}} = \arg\max_{z,y} S(z,y \mid x_{\text{test}}) $
Self-consistency:
$\tilde{y}_{\text{test}} = \arg\max_{y} \mathbb{E}_{z \sim S(z \mid x_{\text{test}})} \left[ S(y \mid z, x_{\text{test}}) \right] $

2.2.2 Performance Metrics

Default performance comparison

Training data impact

Key achievements:

77% latency reduction (23ms vs 100ms baseline)
90% parameter reduction while maintaining reasoning capability
Successful transfer to unseen tasks (79.6% on SST-2)

These results demonstrate that SCoTD successfully enables smaller models to perform complex reasoning tasks previously only possible with much larger models.

2.2 RL-Enhanced distillation

RL-enhanced distillation extends traditional knowledge distillation by incorporating RL signals. The signals are used to guide student model training. The teacher model provides not just output probabilities but also rewards that help shape the student's behavior. This approach enables smaller models to develop sophisticated reasoning capabilities previously only seen in much larger architectures.

DeepSeek's implementation

DeepSeek demonstrated two key approaches:

Direct RL distillation through DeepSeek-R1-Zero, achieving 71.0% on AIME 2024 without supervised fine-tuning
Hybrid approach with DeepSeek-R1, combining cold-start data with iterative RL fine-tuning, reaching 79.8% on AIME 2024

Performance comparison

3. Benefits of knowledge distillation in language models

Let’s discuss the benefits and limitations of knowledge distillation.

3.1 Benefits of knowledge distillation

Here are some benefits of KD:

Computational efficiency
- Model compression achieves 90% parameter reduction while preserving core reasoning capabilities
- Inference latency drops dramatically (23ms/token vs 100ms baseline)
- Significant reduction in storage requirements and energy consumption during deployment

Performance improvements
- Smoothed knowledge distillation reduces hallucination rates by 37%
- Task-aware intermediate distillation (TAID) prevents mode collapse through adaptive interpolation
- Enhanced generalization across diverse problem domains

Practical applications
- Real-time processing enables deployment on edge devices and mobile platforms
- Broader accessibility through reduced infrastructure requirements
- Cost-effective scaling for production environments

3.2 Limitations and Challenges

Now, let’s discuss the limitations and challenges of KD.

Technical constraints:
- Performance gap remains in highly complex reasoning tasks compared to larger models
- Training process requires significant expertise in temperature tuning and loss function balancing
- Optimal distillation parameters vary by task, making standardization difficult
Implementation challenges:
- Initial setup costs for teacher model training and data preparation can be substantial
- Real-time monitoring and quality assurance require specialized tooling
- Model updates need careful validation to maintain performance across all use cases
Business considerations:
- Not all applications benefit equally from distillation—some tasks still require full-scale models
- Resource requirements for initial training may offset short-term cost benefits
- Team expertise needs may increase during the implementation and maintenance phases

4. Implementing knowledge distillation in LM

In this section, we will discuss some of the frameworks for KD as well as walk through Nvidia’s implementation of KD in LLM.

4.1. Frameworks for knowledge distillation in LLMs

Leading frameworks for implementing knowledge distillation in language models offer robust capabilities for model compression and performance optimization:

Hugging Face Transformers: The Distiller class provides streamlined knowledge transfer between teacher and student models, with built-in support for various distillation techniques and optimization methods.
Nvidia Nemo: It offers a wide range of services for building GenAI models. It is a cloud to develop and deploy your models. Apart from model distillation you can also prune the models.
TensorFlow Model Optimization: Offers comprehensive tools for model pruning, quantization, and distillation, ideal for production deployments.
PyTorch: Specializes in deep learning model compression with extensive utilities for managing the distillation process and optimizing model efficiency.
DeepSpeed: Microsoft’s optimization library includes advanced features for model distillation, particularly suited for large-scale deployments.

4.2 How to implement KD for LLM

In this section, I will show you how to implement KD using the Nvidia Nemo framework. The team from Nvidia has already implemented the tutorial, I am just using the repo to guide you and show you how simple it is to implement KD.

You can find the full tutorial here.

NeMo installation:

For the installation guide to this repo here and use the following command to install NeMo.

conda create --name nemo python==3.10.12 conda activate nemo
Data Preparation:
1. Curate a representative dataset that covers target tasks like the WikiText-103-v1 dataset.
2. Implement data augmentation for improved generalization
3. Ensure proper validation split for monitoring distillation quality
  
  import json import os from datasets import load_dataset # Load the WikiText-103 dataset dataset = load_dataset("wikitext", "wikitext-103-v1") # Define the destination folder data_folder = 'wikitext-data' os.makedirs(data_folder, exist_ok=True) # Define file paths and destination paths file_paths = { 'train': os.path.join(data_folder, 'wikitext-train.jsonl'), 'validation': os.path.join(data_folder, 'wikitext-val.jsonl'), 'test': os.path.join(data_folder, 'wikitext-test.jsonl') } # Function to save dataset split to a JSONL file def save_to_jsonl(file_path, data): with open(file_path, 'w') as file: for item in data: file.write(json.dumps(item) + '\n') # Define splits splits = ["train", "validation", "test"] # Save splits to JSONL files and calculate their sizes for split in splits: if split in dataset: save_to_jsonl(file_paths[split], dataset[split]) else: print(f"Split {split} not found in the dataset.")How to prepare the dataset | Source: LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework
Teacher Model Selection and fine-tuning:
1. Choose a well-performing pre-trained model like the Meta-Llama-3.1-8B.
2. Fine-tune the model on the prepared dataset
  
  %%bash
  export CUDA_DEVICE_MAX_CONNECTIONS=1 # Set path(s) if different: MODEL="/workspace/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo" # Can change these to accommodate resources: TENSOR_PARALLEL_SIZE=8 NODES=1 MICRO_BATCH_SIZE=4 # Don't change the following: EXPERIMENT_DIR="distill_trainings" EXPERIMENT_NAME="megatron_llama_ft" DATA_TRAIN='wikitext_tokenized_train_text_document' DATA_VAL='wikitext_tokenized_test_text_document' DATA_TEST='wikitext_tokenized_val_text_document' STEPS=30 GLOBAL_BATCH_SIZE=128 LOG_INTERVAL=1 VAL_INTERVAL=10 NUM_VAL_BATCHES=5 LR=1e-4 MIN_LR=1e-5 WARMUP_STEPS=2 cmd="torchrun --nproc-per-node=${TENSOR_PARALLEL_SIZE}" ${cmd} /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \ --config-path /opt/NeMo/examples/nlp/language_modeling/conf/ \ --config-name megatron_llama_distill.yaml \ \ name=${EXPERIMENT_NAME} \ \ exp_manager.exp_dir=${EXPERIMENT_DIR} \ exp_manager.checkpoint_callback_params.save_top_k=1 \ exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \ \ trainer.max_steps=${STEPS} \ trainer.log_every_n_steps=${LOG_INTERVAL} \
  Bash command to fine-tune the teacher model | Source: LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework
Model distillation:
1. Initialize student model architecture
2. Configure hyperparameters (learning rate, batch size)
  
  %%bash
  
  export CUDA_DEVICE_MAX_CONNECTIONS=1
  
  # Can change these to accommodate resources:
  
  TENSOR_PARALLEL_SIZE=8
  NODES=1
  MICRO_BATCH_SIZE=4
  
  # Don't change the following:
  
  EXPERIMENT_DIR="distill_trainings"
  EXPERIMENT_NAME="megatron_llama_distill_depth_pruned_student"
  
  TEACHER="${EXPERIMENT_DIR}/megatron_llama_ft/checkpoints/megatron_llama_ft.nemo"
  STUDENT="/workspace/4b_depth_pruned_model.nemo"
  
  FINAL_MODEL_PATH="${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/checkpoints/depth_pruned_distilled_4b_model.nemo"
  
  DATA_TRAIN='wikitext_tokenized_train_text_document'
  DATA_VAL='wikitext_tokenized_test_text_document'
  DATA_TEST='wikitext_tokenized_val_text_document'
  
  STEPS=30
  GLOBAL_BATCH_SIZE=128
  
  LOG_INTERVAL=1
  VAL_INTERVAL=10
  NUM_VAL_BATCHES=5
  
  LR=1e-4
  MIN_LR=1e-5
  WARMUP_STEPS=2
  
  cmd="torchrun --nproc-per-node=${TENSOR_PARALLEL_SIZE}"
  
  ${cmd} /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_distillation.py \
  name=${EXPERIMENT_NAME} \
  \
  exp_manager.exp_dir=${EXPERIMENT_DIR} \
  exp_manager.checkpoint_callback_params.save_top_k=1 \
  \
  trainer.max_steps=${STEPS} \
  trainer.log_every_n_steps=${LOG_INTERVAL} \
  trainer.val_check_interval=${VAL_INTERVAL} \
  trainer.limit_val_batches=${NUM_VAL_BATCHES} \
  +trainer.num_sanity_val_steps=0 \
  \
  trainer.precision=bf16 \
  trainer.devices=${TENSOR_PARALLEL_SIZE} \
  trainer.num_nodes=${NODES} \
  \
  "model.data.data_prefix={train:[1.0,$DATA_TRAIN],validation:[$DATA_VAL],test:[$DATA_TEST]}" \
  \
  model.restore_from_path=${STUDENT} \
  model.kd_teacher_restore_from_path=${TEACHER} \
  model.nemo_path=${FINAL_MODEL_PATH} \
  \
  model.tensor_model_parallel_size=${TENSOR_PARALLEL_SIZE} \
  model.sequence_parallel=True \
  model.micro_batch_size=${MICRO_BATCH_SIZE} \
  model.global_batch_size=${GLOBAL_BATCH_SIZE} \
  \
  model.optim.name=distributed_fused_adam \
  model.optim.lr=${LR} \
  model.optim.sched.min_lr=${MIN_LR} \
  model.optim.sched.warmup_steps=${WARMUP_STEPS}
  
  Bash command to train the student model | Source: LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework
Evaluation and Optimization:
1. Monitor accuracy metrics
2. Measure inference speed improvements
  
  %load_ext tensorboard
  %tensorboard --logdir "distill_trainings/megatron_llama_distill/" --port=6007
  
  Bash command to visualize the model’s performance | Source: LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework

Real-world applications

Here are some business applications for PMs, AI engineers, and startup folks.

For Product Managers:
1. Chatbots and virtual assistants that deliver enterprise-grade performance at consumer-scale costs
2. Real-time NLP tools for customer service with 77% lower latency
3. Mobile-first AI applications previously constrained by model size.
  Source: Sam Altman on X
For AI Engineers:
1. Efficient deployment of reasoning capabilities across edge devices and cloud infrastructure
2. Streamlined model updates and maintenance through reduced computational requirements
3. Integration flexibility with existing tech stacks due to smaller model footprints
For Startup leadership:
- Faster go-to-market with reduced infrastructure investment
- Competitive advantage through advanced AI capabilities at lower operational costs
- Scalable solution that grows efficiently with user demand

Conclusion

Knowledge distillation represents a transformative approach to making large language models more accessible and deployable across diverse environments. This comprehensive exploration demonstrates how organizations can achieve up to 90% parameter reduction while maintaining core model capabilities, revolutionizing the practical implementation of AI systems.

Key section learnings

Foundations: Knowledge distillation leverages temperature-scaled softmax and specialized loss functions to transfer knowledge effectively between teacher and student models
Implementation: Modern frameworks like Hugging Face and NVIDIA NeMo provide robust tooling for distillation, with clear pathways for deployment
Performance: Success stories like DeepSeek show dramatic improvements (77% latency reduction, 37% fewer hallucinations) while maintaining model capabilities
Applications: Real-world implementations demonstrate effectiveness across chatbots, edge computing, and enterprise systems

Stakeholder opportunities

Product Managers can leverage distilled models for cost-effective, real-time applications
Engineers benefit from simplified deployment and maintenance processes
Leadership teams can accelerate AI adoption while managing resource constraints

Future considerations

As we advance in AI deployment, a crucial question emerges: How will knowledge distillation evolve to balance the increasing capabilities of foundation models with the practical constraints of real-world applications? This balance between power and practicality will likely shape the next generation of AI implementations.

Adaline Labs