torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free Deep Learning Studies: A Case Study on NLP

Reproducibility in scientific work has been becoming increasingly important in research communities such as machine learning, natural language processing, and computer vision communities due to the rapid development of the research domains supported by recent advances in deep learning. In this work, we present a significantly upgraded version of torchdistill, a modular-driven coding-free deep learning framework significantly upgraded from the initial release, which supports only image classification and object detection tasks for reproducible knowledge distillation experiments. To demonstrate that the upgraded framework can support more tasks with third-party libraries, we reproduce the GLUE benchmark results of BERT models using a script based on the upgraded torchdistill, harmonizing with various Hugging Face libraries. All the 27 fine-tuned BERT models and configurations to reproduce the results are published at Hugging Face, and the model weights have already been widely used in research communities. We also reimplement popular small-sized models and new knowledge distillation methods and perform additional experiments for computer vision tasks.

To address the serious problem, research communities introduced reproducibility checklists.At the time of writing, some venues require authors to complete checklists when submitting their work e.g., Responsible NLP Research Checklist3 (Rogers et al., 2021) at NLP venues (ACL, NAACL, ARR) and Paper Checklist at NeurIPS. 4atsubara (2021) developed torchdistill, a modular, configuration-driven knowledge distillation framework built on PyTorch (Paszke et al., 2019) for reproducible deep learning research.Knowledge distillation (Hinton et al., 2014) is a well known model compression method usually to train a small model (called student) leveraging outputs from a more complex model (called teacher) as part of loss functions to be minimized.Recent knowledge distillation approaches are more complex e.g., using intermediate layers' outputs (embeddings or feature maps) besides the final output (logits) of teacher models with auxiliary module branches attached to teacher and/or student models during training (Kim et al., 2018;Zhang et al., 2020;Chen et al., 2021), using multiple teachers (Mirzadeh et al., 2020;Matsubara et al., 2022b), and training multilingual or non-English models solely with an English teacher model (Reimers and Gurevych, 2020;Li et al., 2022b;Gupta et al., 2023).
For implementing such approaches, researchers unpacked existing model implementations and modified their input-output interfaces to extract and/or hard-code new auxiliary modules (trainable modules to be used only during training) (Zagoruyko and Komodakis, 2016;Passalis and Tefas, 2018;Heo et al., 2019;Park et al., 2019;Tian et al., 2019;Xu et al., 2020;Chen et al., 2021).torchdistill (Matsubara, 2021) was initially designed as a unified knowledge distillation framework to enable users to design experiments by declarative PyYAML configuration files without such hardcoding effort and help researchers complete the ML Code Completeness Checklist5 for high-quality reproducible knowledge distillation studies.One of its key concepts is that a declarative PyYAML configuration file designs an experiment and explains key hyperparameters and components used in the experiment.While the initial framework is well generalized and supports 18 different knowledge distillation methods implemented in a unified way, the implementation of the initial framework is highly dependent on torchvision6 , a package for popular datasets, model architectures, and common image transformations for computer vision tasks.
In this work, we significantly upgrade torchdistill from the initial framework (Matsubara, 2021) to enable further generalized implementations, supporting more flexible module abstractions and enhance the advantage of decralative PyYAML configuration files to design experiments with third-party packages of user's choice, as promised in (Matsubara, 2021).Using GLUE tasks (Wang et al., 2019) as an example, we demonstrate that the upgraded torchdistill and a new script harmonize with Hugging Face Transformers (Wolf et al., 2020), Datasets (Lhoest et al., 2021), Accelerate (Gugger et al., 2022), andEvaluate (Von Werra et al., 2022) to reproduce the GLUE test results reported in (Devlin et al., 2019) by fine-tuning pretrained BERT-Base and BERT-Large models with the upgraded torchdistill.We also conduct knowledge distillation experiments using the fine-tuned BERT-Large models as teachers to train BERT-Base models.All these experiments are performed on Google Colaboratory. 7We also publish all the code and configuration files at GitHub 1 and trained model weights and training logs at Hugging Face 2 for reproducibility and helping researchers build on this work.Our BERT models fine-tuned for the GLUE tasks have already been downloaded 138,000 times in total and widely used in research communities not only in research papers but also in tutorials of deep learning frameworks and ACL 2022.Besides the NLP tasks, we reimplement popular small-sized computer vision models and a few more recent knowledge distillation methods as part of torchdistill, and perform additional experiments to demonstrate that the upgraded torchdistill still supports computer vision tasks.

Related Work
In this section, we briefly summarize related work on open source software that supports end-to-end research frameworks.Yang et al. (2018) propose Anserini, an information retrieval toolkit built on Lucene8 for reproducible information retrieval research.Pyserini (Lin et al., 2021) is a Python toolkit built on PyTorch (Paszke et al., 2019) and Faiss (Johnson et al., 2019) for reproducible information retrieval research with sparse and dense representations, and the sparse representation-based retrieval support comes from Lucene via Anserini.
AllenNLP (Gardner et al., 2018) is a toolkit built on PyTorch for research on deep learning methods in NLP and designed to lower barriers to high quality NLP research e.g., useful NLP module abstractions and defining experiments using declarative configuration files.Highly inspired by AllenNLP, Matsubara (2021) design torchdistill, a module, configuration-driven framework built on PyTorch for reproducible knowledge distillation studies.Similar to AllenNLP, torchdistill enables users to design experiments by declarative PyYAML configuration files and supports highlevel module abstractions.For image classification and object detection tasks, its generalized starter scripts and configurations help users implement knowledge distillation methods without much coding cost.Matsubara (2021) also reimplement 18 knowledge distillation methods with torchdistill and point out that the standard knowledge distillation (Hinton et al., 2014) can outperform many of the recent state of the art knowledge distillation methods for a popular teacher-student pair  with ILSVRC 2012 dataset (Russakovsky et al., 2015).In Section 3, we describe major upgrades in torchdistill from the initial release (Matsubara, 2021).

Major Upgrades from the Initial Release
In this section, we summarize the major upgrades from the initial release of torchdistill (Matsubara, 2021).Figure 1 highlights high-level differences between the initial design (Matsubara, 2021) of torchdistill and a largely upgraded version in this work.The initial torchdistill is dependent on PyTorch and torchvision and contains key modules and functionalities specifically designed to support image classification and object detection tasks.For example, dataset modules that the initial version officially supports are only those in torchvision, and some of dataset-relevant functionalities such as building a sequence of data transforms and dataset loader are based on datasets in torchvision.
In this work, we make torchdistill less dependent on torchvision and support more tasks with third-party packages of users' choice, by generalizing some of the key components in the framework and exporting task-specific implementations to the corresponding executable scripts and local packages.We also reimplement popular small-sized models whose official PyTorch implementations are not either available or maintained.

PyYAML-based Instantiation
A declarative PyYAML configuration file plays an important role in torchdistill.Users can design experiments with the declarative PyYAML configuration file, which defines various types of abstracted modules with hyperparameters such as dataset, model, optimizer, scheduler, and loss modules.To allow more flexibility in PyYAML configurations, we add more useful constructors such as importing arbitrary local packages to register modules but without edits on an executable script, and instantiating an arbitrary class with a log message.Those can be done simply at the very beginning of an experiment when loading the PyYAML configuration file and make the configuration files more self-explanatory since the configuration format used for the initial version does not explicitly tell users whether the experiment needs specific local packages.Those features also help us generalize ways to define key module such as datasets and their components (e.g., pre-processing transforms, samplers).
Figure 2 shows an example that build a sequence of image/tensor transforms with the initial version and torchdistill in this work.While the former requires both a Python function specifically designed for torchvision modules (build_transform) and a list of dict objects defined in a PyYAML configuration to be given to the function as (transform_params_config), the latter can build exactly the same transform when loading the PyYAML configuration and store the instantiated object as part of a dict object with transform key.

Generalized Modules for Supporting
More Tasks The PyYAML-based instantiation feature described in Section 3.1 enables us to remove torchvisionspecific modules mentioned in Section 3 (e.g., build_transform in Fig. 2) so that we can reduce  torchdistill's dependency on torchvision and generalize its modules for supporting more tasks.
The initial version of torchdistill is designed to support image classification and object detection tasks based on torchvision, and torchvision models for the tasks such as ResNet (He et al., 2016) and Faster R-CNN (Ren et al., 2015) require an image (tensor) and an annotation as part of the model in-puts during training.However, this interface does not generalize well to support other tasks.Taking a text classification task as an example, Transformer (Vaswani et al., 2017) models in Hugging Face Transformers (Wolf et al., 2020) have much more input data fields such as (not limited to) token IDs, attention mask, token type IDs, position IDs, and labels for BERT (Devlin et al., 2019), and dif-ferent models have different input data fields e.g., BART (Lewis et al., 2020) has additional input data fields such as token IDs for its decoder.
In order to support diverse models and tasks, we generalize interfaces of model input/output and the subsequent processes in torchdistill such as computing training losses.For demonstrating that the upgraded torchdistill can support more tasks, we provide starter scripts based on the upgraded framework for GLUE (Wang et al., 2019) and semantic segmentation tasks.For the GLUE tasks, we harmonize popular Python libraries with torchdistill in the script such as Hugging Face Transformers (Wolf et al., 2020), Datasets (Lhoest et al., 2021), and Evaluate (Von Werra et al., 2022) for model, dataset, and evaluation modules.We also leverage Accelerate (Gugger et al., 2022) for efficient training and inference.In Section 4.1, we demonstrate GLUE experiments with torchdistill and the third-party libraries.

Reimplemented Models and Methods
We find in recent knowledge distillation studies (Tian et al., 2019;Xu et al., 2020;Chen et al., 2021) that there is still a demand of small models for relatively simple datasets such as ResNet (He et al., 2016) 9 , WRN (Zagoruyko and Komodakis, 2016)10 , and DenseNet (Huang et al., 2017) 11  for image classification tasks with CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009) since the official repositories are no longer maintained and/or not implemented with PyTorch.
For helping the community conduct better benchmarking, we reimplement the models for CIFAR-10 and CIFAR-100 datasets as part of torchdistill and attempt to reproduce the reported results following the original training recipes (See Section 4).With the upgraded torchdistill, we also reimplement and test a few more knowledge distillation methods (He et al., 2019;Chen et al., 2021).

Google Colab Demos
In this section, we demonstrate that the upgraded torchdistill can collaborate with third-party libraries for supporting more tasks.We also attempt to reproduce the CIFAR-10 and CIFAR-100 results reported in the original papers.To lower the barrier to reusing and building on the scripts with torchdistill, we conduct all the experiments on Google Colaboratory 7 , which gives users access to GPUs free of charge.We publish the Jupyter Notebook12 files to run the experiments as part of torchdistill repository 1 so that researchers can easily use them.
We attempt to reproduce GLUE test results reported in a popular study, BERT (Devlin et al., 2019), using the upgraded torchdistill harmonizing with Hugging Face libraries (transformers, datasets, evaluate, and accelerate) (Wolf et al., 2020;Lhoest et al., 2021;Von Werra et al., 2022;Gugger et al., 2022).Following the experiments, we also conduct knowledge distillation experiments that fine-tune pretrained BERT-Base models for GLUE tasks, using the fine-tuned BERT-Large models as teachers for the knowledge distillation method of Hinton et al. (2014) minimizing where L CE is a standard cross entropy.ŷ indicates the student model's estimated class probabilities, and y is the annotated category.L KL is the Kullback-Leibler divergence, and α and τ are a balancing factor and a temperature, respectively.p and q represent the softened output distributions from teacher and student models, respectively.p is used as a target distribution for L KL .Specifically, p = [p 1 , p 2 , . . ., p |C| ] where C is a set of categories in the target task.p i indicates the student model's softened output value (scalar) for the i-th category: where τ is one of the hyperparameters defined in Eq. ( 1).v i denotes a logit value for the i-th category.The same rules are applied to q for the student model.
Table 1 shows the GLUE test results reported by Devlin et al. (2019) and those obtained from GLUE Benchmark 16 for our three configurations: fine-tuning pretrained BERT-Base (FT, Ours) and pretrained BERT-Large (FT, Ours) models and knowledge distillation to fine-tune pretrained BERT-Base (KD, Ours) as a student, using the finetuned BERT-Large as the teacher.Note that Devlin et al. (2019) do not report the results for the WNLI test dataset.
Overall, our fine-tuned BERT-Base and BERT-Large models achieved GLUE test results comparable to the official test results reported by De-14 https://huggingface.co/bert-base-uncased 15 https://huggingface.co/bert-large-uncased 16 https://gluebenchmark.com/vlin et al. ( 2019).The knowledge distillation method (Hinton et al., 2014) helped BERT-Base models improve the performance for most of the tasks, compared to those fine-tuned without the teacher models.All the trained model weights and training logs are published at Hugging Face 2 , and the training configurations are published as part of the torchdistill GitHub repository. 1  The fine-tuned BERT models we published are widely used in the research communities and have already been downloaded about 138,000 times in total at the time of writing.For instance, some of the models are used for benchmarks, ensembling, model quantization, token pruning (Matena and Raffel, 2022;Church et al., 2022;Guo et al., 2022;Lee et al., 2022)

CIFAR-10 and CIFAR-100
We also attempt to reproduce the CIFAR-10 and CIFAR-100 results reported in (He et al., 2016;Zagoruyko and Komodakis, 2016;Huang et al., 2017) using the upgraded torchdistill with the reimplemented ResNet, WRN, and DenseNet models.We follow the original papers and reuse the hyperparameter choices and training recipes such as data augmentations.Note that we do not confider models that can not fit to the GPU memory which Google Colab can offer e.g., ResNet-1202 (He et al., 2016) for CIFAR-10 and DenseNet-BC(k = 24 and k = 40) (Huang et al., 2017) for CIFAR-10 and CIFAR-100.
those we reproduced for CIFAR-10 and CIFAR-100 test datasets, respectively.We can confirm that for most of the reimplemented models, our results are comparable to those reported in the original papers.Those model weights and training configuration files are publicly available, and users can automatically download the weights via the upgraded torchdistill PyPI package.

ILSVRC 2012
As highlighted in Section 3, torchdistill was initially focused on supporting implementations of diverse knowledge distillation in a unified way and dependent on torchvision to specifically support image classification and object detection tasks with its relevant modules (see Fig. 1).To demonstrate that the upgraded torchdistill still preserves the feature, we reimplement a few more knowledge distillation methods with the upgraded torchdistill: knowledge review (KR) framework (Chen et al., 2021) and knowledge translation and adaptation with affinity distillation (KTAAD) (He et al., 2019).Note that Matsubara (2021) present the results of various knowledge distillation methods reimplemented with the initial version of torchdistill for ILSVRC 2012 and COCO 2017 (Lin et al., 2014) datasets.
Those results are not included in this work, and we refer interested readers to (Matsubara, 2021).(Chen et al., 2021) with pretrained .CE: torchvision models pretrained with cross-entropy.Chen et al. (2021) demonstrate that the KR method can outperform other knowledge distillation using ResNet-34 and ResNet-18 (He et al., 2016), a popular pair of teacher and student models for the ImageNet (ILSVRC 2012) dataset (Russakovsky et al., 2015).Using the reimplemented KR method based on the upgraded torchdistill with hyperparameters in (Chen et al., 2021), we successfully reproduce their reported result of ResNet-18 for the ImageNet dataset as shown in Table 4.The trained model weights and configuration are published as part of the torchdistill repository. 1

PASCAL VOC 2012 & COCO 2017
The initial torchdistill (Matsubara, 2021) supports image classification and object detection tasks.As mentioned in Section 3.2, we also provide a starter script for semantic segmentation tasks.Using two popular datasets, PASCAL VOC 2012 (Everingham et al., 2012) and COCO 2017 (Lin et al., 2014), we demonstrate that the upgraded torchdistill supports semantic segmentation tasks as well.
In the experiments with PASCAL VOC 2012 dataset, we use DeepLabv3 (Chen et al., 2017) with ResNet-50 and ResNet-101 backbones (He et al., 2016), using torchvision's pretrained model weights for COCO 2017 dataset.We choose hyperparameters such as learning rate policy and crop size based on the original study of DeepLabv3 (Chen et al., 2017) We also examine our reimplemented KTAAD method (He et al., 2019) for the Lite R-ASPP model (LRASPP in torchvision) (Howard et al., 2019) as a student model, using the COCO 2017 dataset and the pretrained DeepLabv3 with ResNet-50 in torchvision as a teacher model, whose mIoU and global pixelwise accuracy are 66.4 and 92.4,respectively.Since the KTAAD method is not tested on COCO 2017 dataset for LRASPP with MobileNetV3-Large backbone in the original paper of KTAAD (He et al., 2019), our hyperparameter choice is based on torchvision's reference script. 21able 6 presents the semantic segmentation results of LRASPP with MobileNetV3-Large backbone trained without the teacher model and by the KTAAD method we reimplemented.We confirm that the student model trained by KTAAD outperforms the same model trained on COCO 2017 available in torchvision in terms of mean IoU and global pixelwise accuracy.
As with other experiments, the trained model weights and configuration used in this section are published as part of the torchdistill repository. 1

Conclusion
In this work, we significantly upgraded torchdistill (Matsubara, 2021), a modular, configurationdriven framework built on PyTorch (Paszke et al., 2019) for reproducible deep learning and knowledge distillation studies.We enhanced PyYAMLbased instantiation, generalized internal modules for supporting more tasks, and reimplemented popular models and methods.
To demonstrate that the upgraded framework can support more tasks as we claim, we provided starter scripts for new tasks based on the upgraded framework.One of the new starter scripts supports GLUE tasks (Wang et al., 2019) and harmonizes with Hugging Face Transformers (Wolf et al., 2020), Datasets (Lhoest et al., 2021), Accelerate (Gugger et al., 2022), andEvaluate (Von Werra et al., 2022).Using the script on Google Colaboratory, we reproduced the GLUE test results of fine-tuned BERT models (Devlin et al., 2019) and performed knowledge distillation experiments with our fine-tuned BERT-Large models as teacher models.Similarly, we reproduced CIFAR-10 and -100 results of popular small-sized models we reimplemented, using Google Colaboratory.Furthermore, we reproduced the result of ResNet-18 trained with the reimplemented KR method (Chen et al., 2021) for the ImageNet dataset.We also demonstrated a new starter script for semantic segmentation tasks using PASCAL VOC 2012 and COCO 2017 datasets, and the reimplemented KTAAD method (He et al., 2019) improves a pretrained semantic segmentation model in torchvision.
In this study, we also published 27 trained models for NLP tasks 2 and 14 trained models for computer vision tasks. 1 According to Hugging Face Model repositories, the BERT models fine-tuned for the GLUE tasks have already been downloaded about 138,000 times in total at the time of writing.Research communities leverage torchdistill not only for knowledge distillation studies (Liu et al., 2021;Li et al., 2022a;Lin et al., 2022;Dong et al., 2022;Miles and Mikolajczyk, 2023), but also for machine learning reproducibility challenge (MLRC) (Lee and Lee, 2023) and reproducible deep learning studies (Matsubara et al., 2022a,c;Furutanpey et al., 2023b,a;Matsubara et al., 2023).torchdistill is publicly available as a pip-installable PyPI package and will be maintained and upgraded for encouraging coding-free reproducible deep learning and knowledge distillation studies.

Figure 2 :
Figure 2: Example of two different ways to build a sequence of transforms in torchvision (transform) for CIFAR-10 dataset.The initial version (top, left) defines a function for torchvision build_transform in torchdistill and gives the function a list of dict objects in the left PyYAML as transform_params_config. torchdistill in this work (right) can build exactly the same transform by instantiating each of the transform classes step-by-step with !import_call, one of our pre-defined PyYAML constructors in the upgraded torchdistill.

Table 1 :
GLUE test results.Our results are hyperlinked to our Hugging Face Model repositories.FT: Fine-Tuning, KD: Knowledge Distillation using BERT-Large (FT, ours) as teacher.