Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset

Gabriel de Jesus, Sérgio Nunes


Abstract
This paper introduces Labadain-30k+, a monolingual dataset comprising 33.6k documents in Tetun, a low-resource language spoken in Timor-Leste. The dataset was acquired through web crawling and augmented with Wikipedia documents released by Wikimedia. Both sets of documents underwent thorough manual audits at the document level by native Tetun speakers, resulting in the construction of a Tetun text dataset well-suited for a variety of natural language processing and information retrieval tasks. This dataset was employed to conduct a comprehensive content analysis aimed at providing a nuanced understanding of document composition and the evolution of Tetun documents on the web. The analysis revealed that news articles constitute the predominant documents within the dataset, accounting for 89.87% of the total, followed by Wikipedia documents at 4.34%, and legal and governmental documents at 3.65%, among others. Notably, there was a substantial increase in the number of documents in 2020, indicating 11.75 percentage points rise in document quantity, compared to an average of 4.76 percentage points per year from 2001 to 2023. Moreover, the year 2017, marked by the increased popularity of online news in Tetun, served as a threshold for analyzing the evolution of document writing on the web pre- and post-2017, specifically regarding vocabulary usage. Surprisingly, this analysis showed a significant increase of 6.12 percentage points in the Tetun written adhering to the Tetun official standard. Additionally, the persistence of Portuguese loanwords in that trajectory remained evident, reflecting an increase of 5.09 percentage points.
Anthology ID:
2024.sigul-1.22
Volume:
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venues:
SIGUL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
177–188
Language:
URL:
https://aclanthology.org/2024.sigul-1.22
DOI:
Bibkey:
Cite (ACL):
Gabriel de Jesus and Sérgio Nunes. 2024. Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 177–188, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset (de Jesus & Nunes, SIGUL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigul-1.22.pdf