Abbas Ghaddar


2021

pdf bib
Knowledge Distillation with Noisy Labels for Natural Language Understanding
Shivendra Bhardwaj | Abbas Ghaddar | Ahmad Rashid | Khalil Bibi | Chengyang Li | Ali Ghodsi | Phillippe Langlais | Mehdi Rezagholizadeh
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Knowledge Distillation (KD) is extensively used to compress and deploy large pre-trained language models on edge devices for real-world applications. However, one neglected area of research is the impact of noisy (corrupted) labels on KD. We present, to the best of our knowledge, the first study on KD with noisy labels in Natural Language Understanding (NLU). We document the scope of the problem and present two methods to mitigate the impact of label noise. Experiments on the GLUE benchmark show that our methods are effective even under high noise levels. Nevertheless, our results indicate that more research is necessary to cope with label noise under the KD.

pdf bib
Towards Zero-Shot Knowledge Distillation for Natural Language Processing
Ahmad Rashid | Vasileios Lioutas | Abbas Ghaddar | Mehdi Rezagholizadeh
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Knowledge distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions. In its regular manifestations, KD requires access to the teacher’s training data for knowledge transfer to the student network. However, privacy concerns, data regulations and proprietary reasons may prevent access to such data. We present, to the best of our knowledge, the first work on Zero-shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data. Our solution combines out-of-domain data and adversarial training to learn the teacher’s output distribution. We investigate six tasks from the GLUE benchmark and demonstrate that we can achieve between 75% and 92% of the teacher’s classification score (accuracy or F1) while compressing the model 30 times.

pdf bib
Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation
Yimeng Wu | Mehdi Rezagholizadeh | Abbas Ghaddar | Md Akmal Haidar | Ali Ghodsi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Intermediate layer matching is shown as an effective approach for improving knowledge distillation (KD). However, this technique applies matching in the hidden spaces of two different networks (i.e. student and teacher), which lacks clear interpretability. Moreover, intermediate layer KD cannot easily deal with other problems such as layer mapping search and architecture mismatch (i.e. it requires the teacher and student to be of the same model type). To tackle the aforementioned problems all together, we propose Universal-KD to match intermediate layers of the teacher and the student in the output space (by adding pseudo classifiers on intermediate layers) via the attention-based layer projection. By doing this, our unified approach has three merits: (i) it can be flexibly combined with current intermediate layer distillation techniques to improve their results (ii) the pseudo classifiers of the teacher can be deployed instead of extra expensive teacher assistant networks to address the capacity gap problem in KD which is a common issue when the gap between the size of the teacher and student networks becomes too large; (iii) it can be used in cross-architecture intermediate layer KD. We did comprehensive experiments in distilling BERT-base into BERT-4, RoBERTa-large into DistilRoBERTa and BERT-base into CNN and LSTM-based models. Results on the GLUE tasks show that our approach is able to outperform other KD techniques.

pdf bib
Context-aware Adversarial Training for Name Regularity Bias in Named Entity Recognition
Abbas Ghaddar | Philippe Langlais | Ahmad Rashid | Mehdi Rezagholizadeh
Transactions of the Association for Computational Linguistics, Volume 9

Abstract In this work, we examine the ability of NER models to use contextual information when predicting the type of an ambiguous entity. We introduce NRB, a new testbed carefully designed to diagnose Name Regularity Bias of NER models. Our results indicate that all state-of-the-art models we tested show such a bias; BERT fine-tuned models significantly outperforming feature-based (LSTM-CRF) ones on NRB, despite having comparable (sometimes lower) performance on standard benchmarks. To mitigate this bias, we propose a novel model-agnostic training method that adds learnable adversarial noise to some entity mentions, thus enforcing models to focus more strongly on the contextual signal, leading to significant gains on NRB. Combining it with two other training strategies, data augmentation and parameter freezing, leads to further gains.

pdf bib
End-to-End Self-Debiasing Framework for Robust NLU Training
Abbas Ghaddar | Phillippe Langlais | Mehdi Rezagholizadeh | Ahmad Rashid
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation
Peng Lu | Abbas Ghaddar | Ahmad Rashid | Mehdi Rezagholizadeh | Ali Ghodsi | Philippe Langlais
Findings of the Association for Computational Linguistics: EMNLP 2021

Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel sample-wise loss weighting method, RW-KD. A meta-learner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.

2020

pdf bib
SEDAR: a Large Scale French-English Financial Domain Parallel Corpus
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 12th Language Resources and Evaluation Conference

This paper describes the acquisition, preprocessing and characteristics of SEDAR, a large scale English-French parallel corpus for the financial domain. Our extensive experiments on machine translation show that SEDAR is essential to obtain good performance on finance. We observe a large gain in the performance of machine translation systems trained on SEDAR when tested on finance, which makes SEDAR suitable to study domain adaptation for neural machine translation. The first release of the corpus comprises 8.6 million high quality sentence pairs that are publicly available for research at https://github.com/autorite/sedar-bitext.

2019

pdf bib
Contextualized Word Representations from Distant Supervision with and for NER
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We describe a special type of deep contextualized word representation that is learned from distant supervision annotations and dedicated to named entity recognition. Our extensive experiments on 7 datasets show systematic gains across all domains over strong baselines, and demonstrate that our representation is complementary to previously proposed embeddings. We report new state-of-the-art results on CONLL and ONTONOTES datasets.

2018

pdf bib
Transforming Wikipedia into a Large-Scale Fine-Grained Entity Type Corpus
Abbas Ghaddar | Philippe Langlais
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Robust Lexical Features for Improved Neural Network Named-Entity Recognition
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 27th International Conference on Computational Linguistics

Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted features. While some features do remain in state-of-the-art systems, lexical features have been mostly discarded, with the exception of gazetteers. In this work, we show that this is unfair: lexical features are actually quite useful. We propose to embed words and entity types into a low-dimensional vector space we train from annotated data produced by distant supervision thanks to Wikipedia. From this, we compute — offline — a feature vector representing each word. When used with a vanilla recurrent neural network model, this representation yields substantial improvements. We establish a new state-of-the-art F1 score of 87.95 on ONTONOTES 5.0, while matching state-of-the-art performance with a F1 score of 91.73 on the over-studied CONLL-2003 dataset.

2017

pdf bib
WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition
Abbas Ghaddar | Phillippe Langlais
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We revisit the idea of mining Wikipedia in order to generate named-entity annotations. We propose a new methodology that we applied to English Wikipedia to build WiNER, a large, high quality, annotated corpus. We evaluate its usefulness on 6 NER tasks, comparing 4 popular state-of-the art approaches. We show that LSTM-CRF is the approach that benefits the most from our corpus. We report impressive gains with this model when using a small portion of WiNER on top of the CONLL training material. Last, we propose a simple but efficient method for exploiting the full range of WiNER, leading to further improvements.

2016

pdf bib
WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles
Abbas Ghaddar | Phillippe Langlais
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents WikiCoref, an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia. Our annotation scheme follows the one of OntoNotes with a few disparities. We annotated each markable with coreference type, mention type and the equivalent Freebase topic. Since most similar annotation efforts concentrate on very specific types of written text, mainly newswire, there is a lack of resources for otherwise over-used Wikipedia texts. The corpus described in this paper addresses this issue. We present a freely available resource we initially devised for improving coreference resolution algorithms dedicated to Wikipedia texts. Our corpus has no restriction on the topics of the documents being annotated, and documents of various sizes have been considered for annotation.

pdf bib
Coreference in Wikipedia: Main Concept Resolution
Abbas Ghaddar | Phillippe Langlais
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning