Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages

Detecting harmful content on social media plat-forms is crucial in preventing the negative ef-fects these posts can have on social media users.This paper presents our methodology for tack-ling task 10 from SemEval23, which focuseson detecting and classifying online sexism insocial media posts. We constructed our solu-tion using an ensemble of transformer-basedmodels (that have been fine-tuned; BERTweet,RoBERTa, and DeBERTa). To alleviate the var-ious issues caused by the class imbalance inthe dataset provided and improve the general-ization of our model, our framework employsdata augmentation and semi-supervised learn-ing. Specifically, we use back-translation fordata augmentation in two scenarios: augment-ing the underrepresented class and augment-ing all classes. In this study, we analyze theimpact of these different strategies on the sys-tem’s overall performance and determine whichtechnique is the most effective. Extensive ex-periments demonstrate the efficacy of our ap-proach. For sub-task A, the system achievedan F1-score of 0.8613. The source code to re-produce the proposed solutions is available onGithub


Introduction
Low-resourced languages receive less attention in natural language processing (NLP) research because they lack the quality of datasets necessary for training, evaluation, and model implementation. However, the increasing abundance of social media platforms on the internet is changing the game for these languages, giving them more visibility and accessibility. This was not always the case, as a significant amount of NLP research focused on sentiment analysis and other tasks geared toward specific languages having a high online presence, thus resulting in the creation of techniques and models tailored to their needs (Yimam et al., 2020;Muhammad et al., 2022).
The rise of region-specific competitions such as the AfriSenti-SemEval competition has led to ef- † These authors contributed equally to this work.
forts to curate quality datasets sourced from the internet, create new techniques to maximize the use of these datasets for NLP tasks and investigate the adequacy of existing NLP techniques to cater to the linguistic needs of low-resourced languages. The AfriSenti-SemEval Shared Task 12 provides a corpus of Twitter datasets written in 14 African languages for sentiment analysis tasks (Fröbe et al., 2023). This shared task presents a unique opportunity to advance sentiment analysis development in local languages and help bridge the digital divide in this area. The hope is that this contribution will assist future research and developments in this field.
This paper details our submission for the Afrisenti SemEval-2023 Task 12 (Muhammad et al., 2023b), which investigates the effectiveness of pre-trained Afro-centric language models and adapters for sentiment classification tasks. Our work explores the potential of these models and adapters to enhance sentiment analysis performance in the context of African languages and cultures. Our codes and all project materials are publicly available 1 .

Background
Sentiment analysis, a crucial task in natural language processing employs machine learning techniques to identify emotions in text, thus having numerous practical applications in areas such as public health, business, governance, psychology, and more Muhammad et al. (2022). The benefits of sentiment analysis are diverse with its effect felt on almost every aspect of human endeavor (Shode et al., 2022).
While low-resource languages have been neglected in sentiment analysis research, previous works by Yimam et al. (2020), Shode et al. (2022) and Muhammad et al. (2022) have created differ-ent sentiment corpora for Nigerian and Ethiopian languages. With text classification used for benchmarking most pre-trained language models, such as Ogueji et al. (2021) and Alabi et al. (2022), their effectiveness on the downstream task of sentiment classification is yet to be properly explored.
The dataset introduced by Muhammad et al. (2023a) in SemEval 2023 Task 12, the first Afrocentric SemEval shared task, includes 14 African languages divided into three sub-tasks: languagespecific, multilingual, and zero-shot competitions. As part of our contribution, our team participated in all sub-tasks of the shared task, explored the effectiveness of several pre-trained networks trained on African languages, and examined the use of parameter-efficient approaches like adapters (Pfeiffer et al., 2020b) for a zero-shot classification task.

Development Phase
In the development phase of the competition, we tested several algorithms. First, we worked with the unigram count of words and Tf-idf normalized word count features. These features use very simplified word-counting strategies. Using these features, we tested Multinomial Naive Bayes, Multi-Layer Perception, and XGB classifiers.
The next phase of our experiment focuses on pre-trained language models. We worked on Afro-XLMR (small, base) by Alabi et al.  2021), and other language and taskspecific models, such as Barbieri et al. (2021), Dar-ijaBERT 2 and twitter-xlm-roberta-base-sentiment 3 . We further experimented with the text-to-text work done by Jude Ogundepo et al. (2022), and on adapters, as proposed by Pfeiffer et al. (2020a). We found some unusual predictions when experimenting with mT5-based (Xue et al., 2021) Afro-centric models, which was observed in previous work as well (Adewumi et al., 2022a,b).
During the development phase of both languagespecific Task A and multilingual Task B, we trained the previously mentioned models. We also performed data cleaning by removing '@user' tags and eliminating punctuation and emojis. However, we observed no significant improvement after this cleaning process, so we conducted further experiments without cleaning the dataset and used it in its original state.
For Task C, we investigated the linguistic similarities between Tigrinya, Oromo, and other languages. To aid in this exploration, we drew from the previous work of Woldeyohannis and Meshesha (2018), who examined the similarities between Amharic and Tigrinya. Leveraging the Amharic dataset we had, we also worked on translating it to Tigrinya using Meta's No Language Left Behind (NLLB) (Costa-jussà et al., 2022) MT model that promises to deliver high-quality translations across 200 languages.
When working with the Oromo language, we were initially uncertain about its linguistic similarities with other languages. To address this, we experimented with using a language spoken in the same region and other languages within the same family, such as Hausa. After conducting our experiments, we found that training models on Hausa data and then using these same models to predict outcomes in Oromo produced better results than with other languages.
For the zero-shot cross-lingual tasks, we used adapters (Pfeiffer et al., 2020a) while following the two-step procedure as proposed by Pfeiffer et al. (2020b). The procedure is as follows: (i) trained language-specific adapters using monolingual data, and (ii) trained task adapters using the task-specific dataset. For the first step, we trained languagespecific adapters for Tigrinya and Oromo using a monolingual dataset and AfroXLMR-base (Alabi et al., 2022) as our base model. Since the two languages are similar, we used the Amharic training dataset for the task adapter for the Tigrinya zeroshot task, as described in the previous paragraph. Hausa and Swahili datasets were used to train the task adapters for the Oromo zero-shot task as was not a similar language to Oromo in the given training dataset.
We used language adapters for Amharic, Hausa, and Swahili from Adelani (2022). We trained the Amharic task adapter by using the Amharic language adapter and an Amharic sentiment dataset. After training the Amharic task adapter we replaced the Amharic language adapter with the Tigrinya language adapter and evaluated the zero-shot performance for Tigrinya. Similarly, for Oromo, we trained Swahili and Hausa task adapters using Swahili and Hausa language adapters, respectively. After training task adapters, we replaced both language adapters with Oromo and tested their performance for the Oromo zero-shot task.

Test Phase
In the competition's Task A testing phase, we selected the top three models from the ones we trained during the development phase. After identifying the better-performing models for each language, we submitted our results based on those models. For our fourth submission, we used a voting-based ensemble approach with the previous three models. Finally, for our last submission, we used the best-performing multilingual model trained on all available training and validation data to predict a specific language.
In multilingual Task B, we identified the four best-performing models and submitted four entries using them. Based on the models' observed performance, we employed a voting ensemble approach with the top three models for our final submission. For Task C, we submitted five predictions based on the model's performance during the development phase.

Dataset and Evaluation Metrics
During the computation and development phases, we were given only training data with labels and development data without labels. The training dataset was originally from the AfriSenti dataset, which is a corpus of 14 African languages scraped from Twitter for sentiment analysis tasks. To perform experiments repeatedly on the dataset, we created our own training and evaluation data for development, treating the provided development data as a test set. This enabled us to test a list of models we had planned to experiment with. Due to the computationally intensive nature of training, we filtered the best models based on the created data. The promising models based on this data were used in the test phase. We utilized a weighted F1 score as the evaluation metric for our models.
As shown in Figure 1, the dataset across all languages is imbalanced. This problem is more pronounced when the language datasets are joined to form multilingual data. To resolve this, during our development phase, we attempted label-based and language-based balancing, which involved performing repetitive sampling on data with low frequency. However, these experiments did not yield any significant improvement, therefore we did not include them in our final training for submission.

Training
To conduct experiments using pre-trained BERT models, we utilized the Huggingface 4 and PyTorch library 5 . Our implementation of the sentiment analysis models was based on code from the Afrisenti-SemEval GitHub repository 6 . We made modifications to the code to enable repeated training for all models and performed data processing specific to the provided data. Additionally, we explored the use of text-to-text models from the Afriteva GitHub repository 7 and employed adapter-hub implementations 8 for our adapter-related experiments.
We used existing hyper-parameters for most of our experiments, except for the pre-trained language models which we trained for 10 epochs during the testing phase. We selected the weights from the best epoch with the highest evaluation F1 score for submission.
For our final submissions, we combined as much data as possible for training. In task A, we used language-specific training and gold label data, while in task B, we utilized all available training and development datasets. Unfortunately, our experiment aimed at incorporating language-specific tags was unsuccessful, as the dataset did not provide the necessary language tags.

Task A
Our final submission for each of the monolingual datasets was the result of LaBSE multilingual model on each language as described in 3.2. Although this model did not give us the best F1 score for all languages, it worked well for Amharic and Xitsonga -resulting in our team ranking 6th out of 29, and 8th out of 31, respectively, on the leaderboard. DziriBERT produced the best F1 score for Moroccan Arabic/Darija and Algerian Arabic, thus ranking our team 14th out of 32, and 19th out of 30 participants, respectively.
ranking 31st out of 35 and 16th out of 30 participants. AfriBerta-Large produced the best F1 score for Igbo, Yoruba, and Twi languages resulting in our team ranking 26th out of 32, 23rd out of 33, and 21st out of 31 for each language, respectively. The ensemble approach on LaBSE, Afro-XLMR-base, and Bernice model produced the best prediction for the Nigerian Pidgin language, resulting in our team ranking 16th out of 32. Twitter-XLM-Roberta and LaBSE models predicted the best F1 scores for Mozambican Portuguese and Kinyarwanda, resulting in our team ranking 13th out of 30, and 20th out of 34, respectively. The overview of the F1 scores for each of the models we considered can be found in Table 1.

Task B
Our best model was an ensembled AfroXLMRbase, LaBSE multilingual, and twitter-xml-robertabase-sentiment models for the multilingual sentiment classification task. This ensemble model put our team in the 9th position out of 33. The models considered and their F1 scores for this task are

Task C
For the zero-shot classification on Tigrinya (Table 3), our last submitted model was an ensembled AfroXLMR trained on Amharic, multilingual AfroXLMR, and multilingual AfriBerta. This placed our team 18th out of 28. However, our best model was multilingual AfroXLMR which produced an F1 score of 61.48%, as opposed to our submitted ensembled model which produced an F1 score of 57.99%. For the same task on the Oromo language, our ensembled AfroXMLR trained on the multilingual dataset, AfriBerta trained on the multilingual dataset and the adapter model produced the best F1 score during our training process. This resulted in our team ranking 10th out of 29.

Conclusion
In this paper, we presented our submission for the AfriSenti-SemEval Shared Task 12 of SemEval-2023, focusing on the classification of sentiment for monolingual, multilingual, and zero-shot settings for low-resource African languages. We explored Afro-centric, language-specific, and general pretrained language models for fine-tuning. Based on the test result, for monolingual sentiment classification, Afro-centric language models showed better performance for most of the languages. However, language-specific pre-trained language models perform better than Afro-centric language models for Algerian Arabic and Morrocan Darija. For multilingual sentiment classification, Afro-centric language models show promising results. For the zeroshot task, we see that using adapters shows promising results for related languages while Afro-centric languages show better performance for unrelated languages.
We would like to further our research on adapters for low-resource languages as it shows promising results in the zero-shot setting for related languages.