Improving Large-scale Language Models and Resources for Filipino

Jan Christian Blaise Cruz, Charibeth Cheng


Abstract
In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across three classification tasks with varying difficulty.
Anthology ID:
2022.lrec-1.703
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6548–6555
Language:
URL:
https://aclanthology.org/2022.lrec-1.703
DOI:
Bibkey:
Cite (ACL):
Jan Christian Blaise Cruz and Charibeth Cheng. 2022. Improving Large-scale Language Models and Resources for Filipino. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6548–6555, Marseille, France. European Language Resources Association.
Cite (Informal):
Improving Large-scale Language Models and Resources for Filipino (Cruz & Cheng, LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.703.pdf
Data
CCAlignedNewsPH-NLIWikiText-TL-39