FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models

Konstantin Dobler, Gerard de Melo


Abstract
Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model’s embedding matrix. In this paper, we propose FOCUS - **F**ast **O**verlapping Token **C**ombinations **U**sing **S**parsemax, a novel embedding initialization method that effectively initializes the embedding matrix for a new tokenizer based on information in the source model’s embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and empirically show that FOCUS outperforms random initialization and previous work on language modeling and on a range of downstream tasks (NLI, QA, and NER). We publish our model checkpoints and code on GitHub.
Anthology ID:
2023.emnlp-main.829
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13440–13454
Language:
URL:
https://aclanthology.org/2023.emnlp-main.829
DOI:
10.18653/v1/2023.emnlp-main.829
Bibkey:
Cite (ACL):
Konstantin Dobler and Gerard de Melo. 2023. FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13440–13454, Singapore. Association for Computational Linguistics.
Cite (Informal):
FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models (Dobler & de Melo, EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.829.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.829.mp4