Developing Bilingual English-Setswana Datasets for Space Domain

Tebatso G. Moape, Sunday Olusegun Ojo, Oludayo O. Olugbara


Abstract
In the current digital age, languages lacking digital presence face an imminent risk of extinction. In addition, the absence of digital resources poses a significant obstacle to the development of Natural Language Processing (NLP) applications for such languages. Therefore, the development of digital language resources contributes to the preservation of these languages and enables application development. This paper contributes to the ongoing efforts of developing language resources for South African languages with a specific focus on Setswana and presents a new English-Setswana bilingual dataset that focuses on the space domain. The dataset was constructed using the expansion method. A subset of space domain English synsets from Princeton WordNet was professionally translated to Setswana. The initial submission of translations demonstrated an accuracy rate of 99% before validation. After validation, continuous revisions and discussions between translators and validators resulted in a unanimous agreement, ultimately achieving a 100% accuracy rate. The final version of the resource was converted into an XML format due to its machine-readable framework, providing a structured hierarchy for the organization of linguistic data.
Anthology ID:
2024.rail-1.4
Volume:
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Rooweither Mabuya, Muzi Matfunjwa, Mmasibidi Setaka, Menno van Zaanen
Venues:
RAIL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
32–36
Language:
URL:
https://aclanthology.org/2024.rail-1.4
DOI:
Bibkey:
Cite (ACL):
Tebatso G. Moape, Sunday Olusegun Ojo, and Oludayo O. Olugbara. 2024. Developing Bilingual English-Setswana Datasets for Space Domain. In Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024, pages 32–36, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Developing Bilingual English-Setswana Datasets for Space Domain (Moape et al., RAIL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.rail-1.4.pdf