On the Importance of Tokenization in Arabic Embedding Models

Mohamed Alkaoud, Mairaj Syed


Abstract
Arabic, like other highly inflected languages, encodes a large amount of information in its morphology and word structure. In this work, we propose two embedding strategies that modify the tokenization phase of traditional word embedding models (Word2Vec) and contextual word embedding models (BERT) to take into account Arabic’s relatively complex morphology. In Word2Vec, we segment words into subwords during training time and then compose word-level representations from the subwords during test time. We train our embeddings on Arabic Wikipedia and show that they perform better than a Word2Vec model on multiple Arabic natural language processing datasets while being approximately 60% smaller in size. Moreover, we showcase our embeddings’ ability to produce accurate representations of some out-of-vocabulary words that were not encountered before. In BERT, we modify the tokenization layer of Google’s pretrained multilingual BERT model by incorporating information on morphology. By doing so, we achieve state of the art performance on two Arabic NLP datasets without pretraining.
Anthology ID:
2020.wanlp-1.11
Volume:
Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–129
Language:
URL:
https://aclanthology.org/2020.wanlp-1.11
DOI:
Bibkey:
Cite (ACL):
Mohamed Alkaoud and Mairaj Syed. 2020. On the Importance of Tokenization in Arabic Embedding Models. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 119–129, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
On the Importance of Tokenization in Arabic Embedding Models (Alkaoud & Syed, WANLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wanlp-1.11.pdf
Data
LABR