Mohamed Alkaoud
2020
On the Importance of Tokenization in Arabic Embedding Models
Mohamed Alkaoud
|
Mairaj Syed
Proceedings of the Fifth Arabic Natural Language Processing Workshop
Arabic, like other highly inflected languages, encodes a large amount of information in its morphology and word structure. In this work, we propose two embedding strategies that modify the tokenization phase of traditional word embedding models (Word2Vec) and contextual word embedding models (BERT) to take into account Arabic’s relatively complex morphology. In Word2Vec, we segment words into subwords during training time and then compose word-level representations from the subwords during test time. We train our embeddings on Arabic Wikipedia and show that they perform better than a Word2Vec model on multiple Arabic natural language processing datasets while being approximately 60% smaller in size. Moreover, we showcase our embeddings’ ability to produce accurate representations of some out-of-vocabulary words that were not encountered before. In BERT, we modify the tokenization layer of Google’s pretrained multilingual BERT model by incorporating information on morphology. By doing so, we achieve state of the art performance on two Arabic NLP datasets without pretraining.
Search