Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Ivan Vulić


Abstract
Massively multilingual Transformers (MMTs), such as mBERT and XLM-R, are widely used for cross-lingual transfer learning. While these are pretrained to represent hundreds of languages, end users of NLP systems are often interested only in individual languages. For such purposes, the MMTs’ language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost. We thus propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer. This is achieved by distilling the MMT *bilingually*, i.e., using data from only the source and target language of interest. Specifically, we use a two-phase distillation approach, termed BiStil: (i) the first phase distils a general bilingual model from the MMT, while (ii) the second, task-specific phase sparsely fine-tunes the bilingual “student” model using a task-tuned variant of the original MMT as its “teacher”. We evaluate this distillation technique in zero-shot cross-lingual transfer across a number of standard cross-lingual benchmarks. The key results indicate that the distilled models exhibit minimal degradation in target language performance relative to the base MMT despite being significantly smaller and faster. Furthermore, we find that they outperform multilingually distilled models such as DistilmBERT and MiniLMv2 while having a very modest training budget in comparison, even on a per-language basis. We also show that bilingual models distilled from MMTs greatly outperform bilingual models trained from scratch.
Anthology ID:
2023.findings-acl.517
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8147–8165
Language:
URL:
https://aclanthology.org/2023.findings-acl.517
DOI:
10.18653/v1/2023.findings-acl.517
Bibkey:
Cite (ACL):
Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, and Ivan Vulić. 2023. Distilling Efficient Language-Specific Models for Cross-Lingual Transfer. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8147–8165, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Distilling Efficient Language-Specific Models for Cross-Lingual Transfer (Ansell et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.517.pdf