Alexander Van Der Leek
2025
IsiZulu noun classification based on replicating the ensemble approach for Runyankore
Zola Mahlaza
|
C. Maria Keet
|
Imaan Sayed
|
Alexander Van Der Leek
Proceedings of the First Workshop on Language Models for Low-Resource Languages
A noun’s class is a crucial component in NLP, because it governs agreement across the sentence in Niger Congo B (NCB) languages, among others. The phenomenon is ill-documented in most NCB languages, or in a non-reusable format, such as a printed dictionary subject to copyright restrictions. A promising approach by Byamugisha (2022) used a data-driven approach for Runyankore that combined syntax and semantics. The code and data are inaccessible however, and it remains to be seen whether it is suitable for other NCB languages. We aimed to reproduce Byamugisha’s experiment, but then for isiZulu. We conducted this as two independent experiments, so that we also could subject it to a meta-analysis. Results showed that it was reproducible only in part, mainly due to imprecision in the original description, and the current impossibility to generate the same kind of source data set generated from an existing grammar. The different choices made in attempting to reproduce the pipeline as well as differences in choice of training and test data had a large effect on the eventual accuracy of noun class disambiguation but could produce accuracies in the same range as for Runyankore: 80-85%.