A Qualitative Inquiry into the South African Language Identifier’s Performance on YouTube Comments.

Nkazimlo N. Ngcungca, Johannes Sibeko, Sharon Rudman


Abstract
The South African Language Identifier (SA-LID) has proven to be a valuable tool for data analysis in the multilingual context of South Africa, particularly in governmental texts. However, its suitability for broader projects has yet to be determined. This paper aims to assess the performance of the SA-LID in identifying isiXhosa in YouTube comments as part of the methodology for research on the expression of cultural identity through linguistic strategies. We curated a selection of 10 videos which focused on the isiXhosa culture in terms of theatre, poetry, language learning, culture, or music. The videos were predominantly in English as were most of the comments, but the latter were interspersed with elements of isiXhosa, identifying the commentators as speakers of isiXhosa. The SA-LID was used to identify all instances of the use of isiXhosa to facilitate the analysis of the relevant items. Following the application of the SA-LID to this data, a manual evaluation was conducted to gauge the effectiveness of this tool in selecting all isiXhosa items. Our findings reveal significant limitations in the use of the SA-LID, encompassing the oversight of unconventional spellings in indigenous languages and misclassification of closely related languages within the Nguni group. Although proficient in identifying the use of Nguni languages, differentiating within this language group proved challenging for the SA-LID. These results underscore the necessity for manual checks to complement the use of the SA-LID when other Nguni languages may be present in the comment texts.
Anthology ID:
2024.rail-1.6
Volume:
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Rooweither Mabuya, Muzi Matfunjwa, Mmasibidi Setaka, Menno van Zaanen
Venues:
RAIL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
45–54
Language:
URL:
https://aclanthology.org/2024.rail-1.6
DOI:
Bibkey:
Cite (ACL):
Nkazimlo N. Ngcungca, Johannes Sibeko, and Sharon Rudman. 2024. A Qualitative Inquiry into the South African Language Identifier’s Performance on YouTube Comments.. In Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024, pages 45–54, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Qualitative Inquiry into the South African Language Identifier’s Performance on YouTube Comments. (Ngcungca et al., RAIL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.rail-1.6.pdf