2022
pdf
bib
abs
CILDA: Contrastive Data Augmentation Using Intermediate Layer Knowledge Distillation
Md Akmal Haidar
|
Mehdi Rezagholizadeh
|
Abbas Ghaddar
|
Khalil Bibi
|
Phillippe Langlais
|
Pascal Poupart
Proceedings of the 29th International Conference on Computational Linguistics
Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning-based data augmentation technique tailored for knowledge distillation, called CILDA. To the best of our knowledge, this is the first time that intermediate layer representations of the main task are used in improving the quality of augmented samples. More precisely, we introduce an augmentation technique for KD based on intermediate layer matching using contrastive loss to improve masked adversarial data augmentation. CILDA outperforms existing state-of-the-art KD approaches on the GLUE benchmark, as well as in an out-of-domain evaluation.
pdf
bib
abs
RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation
Md Akmal Haidar
|
Nithin Anchuri
|
Mehdi Rezagholizadeh
|
Abbas Ghaddar
|
Philippe Langlais
|
Pascal Poupart
Findings of the Association for Computational Linguistics: NAACL 2022
Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforces that all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it acts as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.
2021
pdf
bib
abs
Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation
Yimeng Wu
|
Mehdi Rezagholizadeh
|
Abbas Ghaddar
|
Md Akmal Haidar
|
Ali Ghodsi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Intermediate layer matching is shown as an effective approach for improving knowledge distillation (KD). However, this technique applies matching in the hidden spaces of two different networks (i.e. student and teacher), which lacks clear interpretability. Moreover, intermediate layer KD cannot easily deal with other problems such as layer mapping search and architecture mismatch (i.e. it requires the teacher and student to be of the same model type). To tackle the aforementioned problems all together, we propose Universal-KD to match intermediate layers of the teacher and the student in the output space (by adding pseudo classifiers on intermediate layers) via the attention-based layer projection. By doing this, our unified approach has three merits: (i) it can be flexibly combined with current intermediate layer distillation techniques to improve their results (ii) the pseudo classifiers of the teacher can be deployed instead of extra expensive teacher assistant networks to address the capacity gap problem in KD which is a common issue when the gap between the size of the teacher and student networks becomes too large; (iii) it can be used in cross-architecture intermediate layer KD. We did comprehensive experiments in distilling BERT-base into BERT-4, RoBERTa-large into DistilRoBERTa and BERT-base into CNN and LSTM-based models. Results on the GLUE tasks show that our approach is able to outperform other KD techniques.
2020
pdf
bib
abs
Improving Word Embedding Factorization for Compression Using Distilled Nonlinear Neural Decomposition
Vasileios Lioutas
|
Ahmad Rashid
|
Krtin Kumar
|
Md. Akmal Haidar
|
Mehdi Rezagholizadeh
Findings of the Association for Computational Linguistics: EMNLP 2020
Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain most of the parameters for language models and about a third for machine translation systems. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition and knowledge distillation. First, we initialize the weights of our decomposed matrices by learning to reconstruct the full pre-trained word-embedding and then fine-tune end-to-end, employing knowledge distillation on the factorized embedding. We conduct extensive experiments with various compression rates on machine translation and language modeling, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique is simple to replicate, with one fixed parameter controlling compression size, has higher BLEU score on translation and lower perplexity on language modeling compared to complex, difficult to tune state-of-the-art methods.
2019
pdf
bib
abs
Latent Code and Text-based Generative Adversarial Networks for Soft-text Generation
Md. Akmal Haidar
|
Mehdi Rezagholizadeh
|
Alan Do Omri
|
Ahmad Rashid
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Text generation with generative adversarial networks (GANs) can be divided into the text-based and code-based categories according to the type of signals used for discrimination. In this work, we introduce a novel text-based approach called Soft-GAN to effectively exploit GAN setup for text generation. We demonstrate how autoencoders (AEs) can be used for providing a continuous representation of sentences, which we will refer to as soft-text. This soft representation will be used in GAN discrimination to synthesize similar soft-texts. We also propose hybrid latent code and text-based GAN (LATEXT-GAN) approaches with one or more discriminators, in which a combination of the latent code and the soft-text is used for GAN discriminations. We perform a number of subjective and objective experiments on two well-known datasets (SNLI and Image COCO) to validate our techniques. We discuss the results using several evaluation metrics and show that the proposed techniques outperform the traditional GAN-based text-generation methods.
pdf
bib
abs
Bilingual-GAN: A Step Towards Parallel Text Generation
Ahmad Rashid
|
Alan Do-Omri
|
Md. Akmal Haidar
|
Qun Liu
|
Mehdi Rezagholizadeh
Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation
Latent space based GAN methods and attention based sequence to sequence models have achieved impressive results in text generation and unsupervised machine translation respectively. Leveraging the two domains, we propose an adversarial latent space based model capable of generating parallel sentences in two languages concurrently and translating bidirectionally. The bilingual generation goal is achieved by sampling from the latent space that is shared between both languages. First two denoising autoencoders are trained, with shared encoders and back-translation to enforce a shared latent state between the two languages. The decoder is shared for the two translation directions. Next, a GAN is trained to generate synthetic ‘code’ mimicking the languages’ shared latent space. This code is then fed into the decoder to generate text in either language. We perform our experiments on Europarl and Multi30k datasets, on the English-French language pair, and document our performance using both supervised and unsupervised machine translation.
2014
pdf
bib
Interpolated Dirichlet Class Language Model for Speech Recognition Incorporating Long-distance N-grams
Md. Akmal Haidar
|
Douglas O’Shaughnessy
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers