Hints on the data for language modeling of synthetic languages with transformers

Rodolfo Zevallos, Nuria Bel


Abstract
Language Models (LM) are becoming more and more useful for providing representations upon which to train Natural Language Processing applications. However, there is now clear evidence that attention-based transformers require a critical amount of language data to produce good enough LMs. The question we have addressed in this paper is to what extent the critical amount of data varies for languages of different morphological typology, in particular those that have a rich inflectional morphology, and whether the tokenization method to preprocess the data can make a difference. These details can be important for low-resourced languages that need to plan the production of datasets. We evaluated intrinsically and extrinsically the differences of five different languages with different pretraining dataset sizes and three different tokenization methods for each. The results confirm that the size of the vocabulary due to morphological characteristics is directly correlated with both the LM perplexity and the performance of two typical downstream tasks such as NER identification and POS labeling. The experiments also provide new evidence that a canonical tokenizer can reduce perplexity by more than a half for a polysynthetic language like Quechua as well as raising F1 from 0.8 to more than 0.9 in both downstream tasks with a LM trained with only 6M tokens.
Anthology ID:
2023.acl-long.699
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12508–12522
Language:
URL:
https://aclanthology.org/2023.acl-long.699
DOI:
10.18653/v1/2023.acl-long.699
Bibkey:
Cite (ACL):
Rodolfo Zevallos and Nuria Bel. 2023. Hints on the data for language modeling of synthetic languages with transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12508–12522, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Hints on the data for language modeling of synthetic languages with transformers (Zevallos & Bel, ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.699.pdf
Video:
 https://aclanthology.org/2023.acl-long.699.mp4