Open-Source Toolkits
Read the DistillKit v0.1 by Arcee AI Technical Paper: our new open-source tool that's set to change how we create and distribute Small Language Models (SLMs).
Arcee AI's mission is to enable users and businesses to train cost-effective, secure, performant, and domain-specific Small Language Models (SLMs). In pursuit of this mission, we’re excited to announce the release of DistillKit.
Introducing DistillKit
DistillKit is an open-source research effort in model distillation by Arcee AI. Our goal is to provide the community with easy-to-use tools for researching, exploring, and enhancing the adoption of open-source Large Language Model (LLM) distillation methods. This release focuses on practical, effective techniques for improving model performance and efficiency.
Teacher-Student Training: DistillKit's Model Distillation Methods
DistillKit supports two primary distillation methods:
DistillKit Initial Release and Features: Supervised Fine-Tuning and Distillation Experiments
The initial release includes the utilization of Supervised Fine-Tuning (SFT) in distillation routines. All vanilla models were trained using the standard SFTTrainer from Hugging Face's TRL training library. Models for comparison were trained using the same hyper-parameters as the vanilla training but were distilled from Arcee-Spark.
We have internal versions utilizing Continued Pre-Training (CPT) and Direct Preference Optimization (DPO) in distillation methods, which will be released after thorough evaluation.
Distillation Experiments and Results
We are also releasing case studies on DistillKit alongside these release notes. Our case study focuses on the following key areas:
Further details on the experimental setups, training pipeline, and evaluation benchmarks are provided in the results section.
Performance and Memory Requirements
While the implementation of DistillKit is relatively straightforward, the memory requirements for distillation are higher compared to standard SFT. We are actively working on scaling DistillKit to support models larger than 70B parameters, which will involve advanced techniques and efficiency improvements beyond the scope of this release.
Model Evaluations
For evaluation purposes, all models were based on Qwen2 and assessed using multiple benchmarks from Hugging Face's Open LLM leaderboard. However, our results (via lm-evaluation-harness) consistently differed, often showing much higher performance, so our evaluations should be viewed with this context in mind. Results should not be compared against the leaderboard - and are not intended to be. Real world differences in model performance are likely to be lower than those seen here. The primary goal was to compare existing distillation techniques in DistillKit through ablation studies. For domain-specific evaluation, we used overall accuracy across Berkeley function calling benchmarks to assess LLM's ability to call functions (aka tools).
Experiment Results
In this section, we present the evaluation results of our case studies. These evaluations demonstrate the potential efficiency and accuracy gains of each distillation method, offering valuable insights into their impact on model performance and suggesting promising avenues for future optimizations.
Experiment 1 - Comparison of Distillation Techniques with Vanilla SFT
In this experiment, we look at the performance of different models, highlighting the power of distillation. Using Arcee-Spark as the teacher model, we distilled it into Qwen2-1.5B-Base and trained on a carefully curated subset of the Teknium's OpenHermes-2.5 dataset (200k examples).
Our analysis focused on three key models:
Both Hermes-Distilled and Hermes-Hidden-States outperformed the SFT-only variant across major benchmarks like BBH, MUSR, and MMLU-PRO. Performance on GPQA was also impressive, with comparable results across all three models. Notably, the logit-based approach overall delivered superior results over the hidden-state-based method across most benchmarks.
Experiment 2 - Effectiveness of Logit-based Distillation in Generic Domain
Here we evaluate a 1.5B Distilled model (trained on a 200k subset of WebInstruct-Sub), the teacher model Arcee-Spark, and the baseline Qwen2-1.5B-Instruct model. The distilled model showed a performance improvement over vanilla Qwen2-1.5B-Instruct across all metrics. Additionally, the distilled model's performance was comparable to the teacher model, particularly on MUSR and GPQA benchmarks.
Experiment 3 - Distillation on Instruct versus Base Student Models
The 1.5B-Instruct-Distilled (logit based) model showed marked performance improvements over their vanilla counterparts on MMLU. These were trained on WebInstruct-Sub, a dataset which excels at knowledge retrieval.
Experiment 4 - Effectiveness of Domain-specific Distillation
Distilling Arcee-Agent into Qwen2-1.5B-Instruct, using the same dataset that initially trained the teacher model. Arcee-Agent, a 7B parameter model engineered for function calling and tool use, showcased substantial performance gains through distillation. These results highlight the potential of using the same training data for both the teacher and student models, unlocking even greater performance improvements. Additionally, this approach demonstrates the effectiveness of distillation not only for general-purpose capabilities but also for domain-specific tasks.
Main Take-aways
Both logit-based and hidden states-based distillation methods show improvements over standard SFT across most benchmarks. The performance uplift is consistent across a variety of datasets and training conditions.
Arcee-Labs
This release marks the debut of Arcee-Labs, a division of Arcee AI dedicated to accelerating open-source research. Our mission is to rapidly deploy resources, models, and research findings to empower both Arcee AI and the wider community.
In an era of increasingly frequent breakthroughs in LLM research, models, and techniques, we recognize the need for agility and adaptability. Through our efforts, we strive to significantly contribute to the advancement of open-source AI technology and support the community in keeping pace with these rapid developments.
Future Directions
We are excited to see how the community will use and improve DistillKit. It is quite simple in its current form, and can be easily adapted for many use cases and training scenarios. Future releases will include CPT and DPO distillation methods, and we welcome community contributions in the form of new distillation methods, training routine improvements, and memory optimizations.
For more information and to get started, visit the Arcee AI repository on GitHub. We encourage you to share your findings, improvements, and suggestions with the community.