Zola Mahlaza

2025

pdf bib abs
IsiZulu noun classification based on replicating the ensemble approach for Runyankore
Zola Mahlaza | C. Maria Keet | Imaan Sayed | Alexander Van Der Leek
Proceedings of the First Workshop on Language Models for Low-Resource Languages

A noun’s class is a crucial component in NLP, because it governs agreement across the sentence in Niger Congo B (NCB) languages, among others. The phenomenon is ill-documented in most NCB languages, or in a non-reusable format, such as a printed dictionary subject to copyright restrictions. A promising approach by Byamugisha (2022) used a data-driven approach for Runyankore that combined syntax and semantics. The code and data are inaccessible however, and it remains to be seen whether it is suitable for other NCB languages. We aimed to reproduce Byamugisha’s experiment, but then for isiZulu. We conducted this as two independent experiments, so that we also could subject it to a meta-analysis. Results showed that it was reproducible only in part, mainly due to imprecision in the original description, and the current impossibility to generate the same kind of source data set generated from an existing grammar. The different choices made in attempting to reproduce the pipeline as well as differences in choice of training and test data had a large effect on the eventual accuracy of noun class disambiguation but could produce accuracies in the same range as for Runyankore: 80-85%.

pdf bib abs
On the Usage of Semantics, Syntax, and Morphology for Noun Classification in IsiZulu
Imaan Sayed | Zola Mahlaza | Alexander van der Leek | Jonathan Mopp | C. Maria Keet
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

There is limited work aimed at solving the core task of noun classification for Nguni languages. The task focuses on identifying the semantic categorisation of each noun and plays a crucial role in the ability to form semantically and morphologically valid sentences. The work by Byamugisha (2022) was the first to tackle the problem for a related, but non-Nguni, language. While there have been efforts to replicate it for a Nguni language, there has been no effort focused on comparing the technique used in the original work vs. contemporary neural methods or a number of traditional machine learning classification techniques that do not rely on human-guided knowledge to the same extent. We reproduce Byamugisha (2022)’s work with different configurations to account for differences in access to datasets and resources, compare the approach with a pre-trained transformer-based model, and traditional machine learning models that relyon less human-guided knowledge. The newly created data-driven models outperform the knowledge-infused models, with the best performing models achieving an F1 score of 0.97.

2024

pdf bib abs
ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines
Zola Mahlaza | Toky Hajatiana Raboanary | Kyle Seakgwa | C. Maria Keet
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

The reproduction of Natural Language Processing (NLP) studies is important in establishing their reliability. Nonetheless, many papers in NLP have never been reproduced. This paper presents a reproduction of Gabriel et al. (2022)’s work to establish the extent to which their findings, pertaining to the utility of large language models (T5 and GPT2) to automatically generate writer’s intents when given headlines to curb misinformation, can be confirmed. Our results show no evidence to support two of their four findings and they partially support the rest of the original findings. Specifically, while we confirmed that all the models are judged to be capable of influencing readers’ trust or distrust, there was a difference in T5’s capability to reduce trust. Our results show that its generations are more likely to have greater influence in reducing trust while Gabriel et al. (2022) found more cases where they had no impact at all. In addition, most of the model generations are considered socially acceptable only if we relax the criteria for determining a majority to mean more than chance rather than the apparent > 70% of the original study. Overall, while they found that “machine-generated MRF implications alongside news headlines to readers can increase their trust in real news while decreasing their trust in misinformation”, we found that they are more likely to decrease trust in both cases vs. having no impact at all.

pdf bib abs
Automatically Generating IsiZulu Words From Indo-Arabic Numerals
Zola Mahlaza | Tadiwa Magwenzi | C. Maria Keet | Langa Khumalo
Proceedings of the 17th International Natural Language Generation Conference

Artificial conversational agents are deployed to assist humans in a variety of tasks. Some of these tasks require the capability to communicate numbers as part of their internal and abstract representations of meaning, such as for banking and scheduling appointments. They currently cannot do so for isiZulu because there are no algorithms to do so due to a lack of speech and text data and the transformation is complex and it may include dependence on the type of noun that is counted. We solved this by extracting and iteratively improving on the rules for speaking and writing numerals as words and creating two algorithms to automate the transformation. Evaluation of the algorithms by two isiZulu grammarians showed that six out of seven number categories were 90-100% correct. The same software was used with an additional set of rules to create a large monolingual text corpus, made up of 771 643 sentences, to enable future data-driven approaches.

2020

pdf bib abs
OWLSIZ: An isiZulu CNL for structured knowledge validation
Zola Mahlaza | C. Maria Keet
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

In iterative knowledge elicitation, engineers are expected to be directly involved in validating the already captured knowledge and obtaining new knowledge increments, thus making the process time consuming. Languages such as English have controlled natural languages than can be repurposed to generate natural language questions from an ontology in order to allow a domain expert to independently validate the contents of an ontology without understanding a ontology authoring language such as OWL. IsiZulu, South Africa’s main L1 language by number speakers, does not have such a resource, hence, it is not possible to build a verbaliser to generate such questions. Therefore, we propose an isiZulu controlled natural language, called OWL Simplified isiZulu (OWLSIZ), for producing grammatical and fluent questions from an ontology. Human evaluation of the generated questions showed that participants’ judgements agree that most (83%) questions are positive for grammaticality or understandability.

Co-authors

Jonathan Mopp 1

Toky Hajatiana Raboanary 1

Kyle Seakgwa 1

Venues

webnlg1

Fix data