SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)

We present the first Africentric SemEval Shared task, Sentiment Analysis for African Languages (AfriSenti-SemEval) - The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023. AfriSenti-SemEval is a sentiment classification challenge in 14 African languages: Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorb (Muhammad et al., 2023), using data labeled with 3 sentiment classes. We present three subtasks: (1) Task A: monolingual classification, which received 44 submissions; (2) Task B: multilingual classification, which received 32 submissions; and (3) Task C: zero-shot classification, which received 34 submissions. The best performance for tasks A and B was achieved by NLNDE team with 71.31 and 75.06 weighted F1, respectively. UCAS-IIE-NLP achieved the best average score for task C with 58.15 weighted F1. We describe the various approaches adopted by the top 10 systems and their approaches.


Introduction
Sentiment Analysis is a prominent sub-field of Natural Language Processing that focuses on the automatic identification of sentiments or opinions expressed through online content, such as social media posts, blogs, or reviews (Liu, 2020;Mohammad, 2021;Nakov et al., 2016). Example applications are the computational analysis of emotions in language, which has been applied to literary analysis and culturonomics (Mohammad, 2011;Reagan et al., 2016;Hamilton et al., 2016); commercial use (e.g., tracking opinions towards products); and research in psychology and social science (Dodds et al., 2015;Mohammad et al., 2016). Despite tremendous amount of work in sentiment analysis 1 The dataset is available at https://github.com/ afrisenti-semeval/afrisent-semeval-2023. over the last two decades, little work has been conducted on under-represented languages in general and African languages in particular.
Africa has a long and rich linguistic history, experiencing language contact, language expansion, development of trade languages, language shift, and language death, on several occasions. The continent is incredibly linguistically diverse and home to over 2000 languages. This includes 75 languages with at least one million speakers each. Africa has a rich tradition of storytelling, poems, songs, and literature (Carter-Black, 2007;Banks-Wallace, 2002). Yet, it is only in recent years that there is nascent interest in NLP research for African languages, including Named Entity Recognition (NER; Adelani et al., 2021Adelani et al., , 2022cJibril and Tantug, 2023),  (Ghosh et al., 2015) Fine-grained Sentiment Analysis (Bethard et al., 2017) EmoContext: Contextual Emotion Detection in Text (Chatterjee et al., 2019) HaHackathon: Detecting and Rating Humor and Offense (Meaney et al., 2021) Aspect Based Sentiment Analysis (Pontiki et al., 2014) Sentiment Analysis in Twitter (Nakov et al., 2016) Affect in Tweets (Mohammad et al., 2018) Sentiment Analysis of Code-Mixed Tweets (Patwa et al., 2020) Structured Sentiment Analysis ( Barnes et al., 2022)  Machine Translation (MT;Nekoto et al., 2020;Abdulmumin et al., 2022;Adelani et al., 2022b;Belay et al., 2022), and Language Identification (LID) for African languages (Adebara et al., 2022a). However, African sentiment analysis has not yet received comparable attention. Similarly, although sentiment analysis is a common task in SemEval (see tasks examples in Figure 2), previous tasks have mainly focused on high-resource languages.
To this end, we present the AfriSenti-SemEval, a shared task in the 2023 edition of the Semantic Evaluation workshop (Ojha et al., 2023). AfriSenti-SemEval targets sentiment analysis in low-resource African languages. We provide researchers interested in African NLP with 110K sentiment-labeled tweets that were collected using the Twitter API. These tweets are in 14 languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families (Afro-Asiatic, English Creole, Indo-European and Niger-Congo). The annotations were conducted by native speakers of the respective languages. Besides making the annotated dataset public, we also share sentiment lexicons in most of the languages.
AfriSenti-SemEval 2023 consists of 15 tracks from three sub-tasks on the 14 collected datasets as illustrated in Figure 1. We received submissions from 44 teams, with 29 submitting a system description paper. The top-ranked teams for the different subtasks used pre-trained language models (PLMs). In particular, AfroXLMR (Alabi et al., 2022), an Africa-centric model, was the best performing model in both Tasks A (monolingual) and B (multilingual) with an average weighted F1 score of 71.30% and 75.06% respectively. For Task C, sentiment lexicons were used to build a lexiconbased multilingual BERT, which performed best in this setting with an average weighted F1 of 58.15%.
Recent work in sentiment analysis focused on sub-tasks that tackle new challenges, including aspect-based , multimodal (Liang et al., 2022), explainable (Cambria et al., 2022), and multilingual sentiment analysis (Muhammad et al., 2022). On the other hand, standard sentiment analysis sub-tasks such as polarity classification (positive, negative, neutral) are widely considered saturated and almost solved (Poria et al., 2020), with an accuracy of 97.5% in certain domains (Raffel et al., 2020;Jiang et al., 2020). However, while this may be true for high-resource languages in relatively clean, long-form text domains such as movie reviews, noisy user-generated data in low-resource languages still presents a challenge (Yimam et al., 2020). Additionally, African languages exhibit new challenges for sentiment analysis such as dealing with tone, code-switching, and digraphia (Adebara and Abdul-Mageed, 2022). Thus, further research is necessary to assess the efficacy of existing NLP techniques and present solutions that can solve language-specific challenges in African contexts. SemEval, with its widespread recognition and popularity, is an ideal venue to conduct a shared task in sentiment analysis in African languages.
The SemEval competition has become the de facto venue for sentiment analysis shared tasks, featuring at least one task per year as shown in Figure 2. Some tasks focused on three-way sentiment classification-positive, negative, or neutralwhile others explored more fine-grained aspectbased sentiment analysis (ABSA; Rosenthal et al., 2014Rosenthal et al., , 2015Nakov et al., 2016;Patwa et al., 2020). Additionally, there are other closely related tasks, including the Affect in Tweets task, which involves inferring perceived emotional states of a person from their tweet (Mohammad et al., 2018) and stance detection (Mohammad et al., 2016), which refers to the automatic identification of stance of an author towards a target from text, where the stance can be in favor, against, or neutral. Finally, structured sentiment analysis requires participants to predict the sentiment graphs present in a text. Each sentiment graph comprises a sentiment holder, a target, an expression, and a polarity (Barnes et al., 2022).

Task Description and Settings
The AfriSenti-SemEval shared task consists of three sub-tasks: A) monolingual sentiment classification, B) multilingual sentiment classification, and C) zero-shot sentiment classification. As shown in Figure 1, each sub-task also includes one or more tracks depending on the languages involved. Participants were free to participate in one or more sub-tasks and one or more tracks for each chosen subtask.

Task A: Monolingual Sentiment Classification
Given a training set in a language, determine the polarity (positive/negative/neutral) of tweets in the same language. If a tweet conveys both positive and negative sentiments, the strongest sentiment should be chosen. This sub-task involves 12 tracks (all languages except Oromo and Tigrinya), with one track per language. Task C: Zero-Shot Sentiment Classification Given unlabelled tweets in two African languages (Tigrinya and Oromo), use any of the training datasets of Task A to determine the sentiment of a tweet in the two target languages. This sub-task has two tracks (Tigrinya and Oromo).

Pilot Dataset
We released the pilot datasets for our SemEval shared task one month before the start of the shared task. The pilot datasets allowed the participants to have a better understanding of the shared task (i.e., the datasets, the languages involved, and the labels).

Task Settings
The AfriSenti-SemEval shared task consisted of two phases: (1) the development phase and (2) the evaluation phase. In the development phase, we released a training set with gold labels and a development set without gold labels. Participants trained their models on the training set, tested on the development set, and submitted their predictions on the CodaLab competition page for evaluation. The task offers a prize to the best-performing team in each of the three sub-tasks (A, B, and C) based on the following criteria 2 : (1) African League-for teams with at least one African member to encourage African participation; (2) Students League-for Master's and Undergraduate students only; and (3) Worldwide League-open to all participants.

Dataset and Lexicon
The AfriSenti collection covers 14 African languages, each with unique linguistic characteristics, writing systems, and language families, as shown in Table 1. The dataset covers four of the five African sub-regions and includes the top three languages with the largest number of speakers in Africa (Swahili, Amharic, and Hausa). The datasets include tweets collected using locationbased and vocabulary-based (i.e., stopwords, sentiment lexicons, or language-specific terms) heuristics. Figure 3 shows the label distribution for the  Early work on sentiment analysis in SemEval showed that sentiment lexicons can be leveraged and combined with training data and machine learning algorithms to obtain marked improvements in accuracy (Mohammad et al., 2013;Kiritchenko et al., 2014). Therefore, we also provide manually annotated sentiment lexicons in African languages 3 . For languages that do not have manually curated lexicons, we translated existing lexicons into the target languages. Table 1 provides details on the sentiment lexicons in AfriSenti and indicates whether they were manually created or translated.

Evaluation
All three tasks in the AfriSenti-SemEval shared task required participants to perform a sentiment (negative, neutral, positive) classification. To evaluate the performance of the systems submitted by the participating teams, we used weighted F1 as the evaluation metric. For each label, weighted F1 measure its performance and weight it by the number of actual instances it has. This adjusts the 'macro' method to deal with uneven labels. We also provided the evaluation script to the participants to ensure consistency in the evaluation process.
We also created baseline systems for all three sub-tasks using multilingual pre-trained language models (PLMs). The baseline systems are: (1) monolingual baseline models based on multilingual PLMs for the 12 AfriSenti languages with training data; (2) models with multilingual training on all 12 languages, and evaluation on the combined test data of all 12 languages; and (3) zero-shot transfer of models to Oromo (orm) and Tigrinya (tir) from any of the 12 languages with available training data. Our best baseline model is shown in Table 2 and is based on fine-tuning AfroXLMR-large 4 (Alabi et al., 2022) in all the three sub-tasks. For more information on the baseline experimental results, please refer to the AfriSenti dataset paper (Muhammad et al., 2023).

Sub-task B (multilingual)
Sub-task C (zero-shot)  Table 2: Top 10 submissions for Tasks A, B and C. We only ranked systems with corresponding paper submissions. See Table 6 in Appendix C for paper information and teams' affliations. * show the baseline result from the AfriSenti dataset paper (Muhammad et al., 2023).

Participating Systems and Results
The AfriSenti-SemEval competition had 213 registered participants on the CodaLab competition website. Of these, 44 teams submitted their systems during the evaluation phase. Out of the 44 submissions, 29 submitted system-description papers.
As participants could participate in one or more tasks, certain tasks received more submissions than others. Specifically, Task A (monolingual classification) had the highest number of participants with 44 submissions from different teams, followed by Task C (zero-shot classification) with 34 submissions and Task B (multilingual classification) with 33 submissions from different teams. The majority of the teams participated in all tracks of each task, with 24 teams participating in at least 13 out of 15 tracks. For example, team NLNDE participated in all 12 tracks in Task A, one track in Task B, and two tracks in Task C. To rank the best-performing teams in each task and provide a comparison for future work, we rank each of the top-10 teams that participated in all tracks in a given task based on their average performance, as shown in Table 2. Table 3, Table 4, and Table 5 present the overall results of participating systems for Task A (Monolingual), Task B (Multilingual), and Task C (Zeroshot), respectively. Table 6 in Appendix C presents information regarding the teams and their affiliations. In the following sections, we describe the best systems in each subtask.

Subtask A: Monolingual Sentiment Classification Systems
We describe the top-10 teams that submitted system description papers as highlighted in Table 2.
NLNDE (Wang et al., 2023) used language adaptive pre-training (LAPT) and task adaptive pretraining (TAPT) as an additional pre-training step on AfroXLMR-large. The LAPT approach involved continued pre-training of the PLM on the monolingual portion of the Leipzig Corpus Collection (Goldhahn et al., 2012) (covering Wikipedia, Community, Web, and News corpora) for the target language. TAPT involved continued pre-training on the AfriSenti training data of the target language. By leveraging LAPT followed by TAPT, they achieved significant improvements over finetuning AfroXLMR-large directly. NLNDE ranked first in 7 out of 12 languages, and first in subtask A.
PALI (Jin et al., 2023)    UCAS-IIE-NLP (Hu et al., 2023) used a lexicon-based multilingual transformer model based on AfroXLMR-base to facilitate language adaptation and sentiment-aware representation learning. Additionally, they applied a supervised adversarial contrastive learning strategy to improve the sentiment-spread representations and enhance model generalization. On average, their approach was worse than the AfriSenti baseline likely because they used AfroXLMR-base rather than the large version. Interestingly, they achieved much better results than the baseline on Amharic, Xitsonga, and Yorùbá by over 3 F1 points.
HausaNLP (Salahudeen et al., 2023) used two BERT-based models: AfroXLMR-large and an Arabic BERT (Inoue et al., 2021) fine-tuned on a sentiment corpus 7 . They used AfroXLMR-large for all languages except the Arabic dialects. On average across 12 languages, the HausaNLP system ranked lower than the AfriSenti paper baseline.
UIO (Rønningstad, 2023)  Portuguese. In general, their best systems ranked lower than the AfriSenti baseline. Apart from the top-10 teams that submitted their papers, there were other teams that only worked on one or few languages, and achieved excellent rankings. For instance, KINLP (Nzeyimana, 2023) only attempted the task for the Kinyarwanda language. Their approach was based on KinyaBERT (Nzeyimana and Niyongabo Rubungo, 2022), a Kinyarwanda PLM that incorporates morphological features of the language during pretraining. KINLP ranked second for Kinyarwanda. Bhattacharya_Lab (Hughes et al., 2023) only worked on Nigerian-Pidgin and Yorùbá. They pre-trained a RoBERTa-style (Liu et al., 2019) transformer-based architecture jointly on the two languages using the AfriBERTa training corpus and AfriSenti data. Bhattacharya_Lab ranked first for Nigerian-Pidgin, and fifth for Yorùbá.

Subtask B: Multilingual Sentiment Classification Systems
Most teams used the same model as in sub-task A for sub-task B with minor changes. We highlight here the teams that used strategies apart from jointly training a PLM on the concatenation of all 12 sub-task A languages. We describe the top-10 teams with corresponding system description papers as shown in  this sub-task are shown in Table 4.
NLNDE For each target language, they first chose the best source languages for multilingual training to prevent harmful interference from dissimilar languages. For selecting the source language set, they performed forward and backward source language selection, similar to feature selection approaches (Tsamardinos and Aliferis, 2003;Borboudakis and Tsamardinos, 2019). Forward feature selection starts with an empty set of languages and adds languages to it, while backward feature selection starts with a complete set of languages and then excludes languages from it. For example, the best source languages for multilingual training for Hausa using Forward selection are Kinyarwanda, Twi, Algerian Arabic and Nigerian-Pidgin. For Yorùbá, the best source languages according to the backward selection are Kinyarwanda, Xitsonga, Twi, and Algerian-Arabic. NLNDE used multiple models for this task rather than a single one like most teams did. The target language of the tweet would determine the corresponding multiple training strategy. NLNDE ranked first in this sub-task.
DN (Homskiy and Maloyan, 2023) They finetuned AfroXMLR-large on the 12 languages available in the training data. They performed additional pre-processing on the tweets before training, i.e., they removed links, hashtags, and @mentions, which boosted the performance of their system over those trained on a single multilingual model on all 12 languages. DN ranked third in this sub-task.
GMNLP , unlike in sub-task A, did not use a phylogeny-based adapter fine-tuning for this subtask due to the absence of language ID information. They only performed task adapter training.
Hausa NLP used the same approach as in subtask A. They used AfroXLMR-large for multilingual training, which was previously fine-tuned on MasakhaNER 2.0 10 (Adelani et al., 2022c).
NLP-LISAC used the same approach described in sub-task A. They chose the mDeBERTaV3 PLM to fine-tune on the multilingual corpus.
The other teams i.e. UM6P, Masakhane-AfriSenti, UCAS-IIE-NLP, and ABCD Team used the same approach as the one they used for sub-task A. The only difference was that they trained a PLM on all 12 languages instead of training a monolingual sentiment model.

Subtask C: Zero-Shot Sentiment Classification Systems
We provide an overview of the top-10 submissions with system description papers in Table 2 and show the complete results for this sub-task in Table 5.
UCAS-IIE-NLP used the same approach described in sub-task A. They used additional lexicon information for zero-shot transfer to both Oromo and Tigrinya. UCAS-IIE-NLP ranked first for sub-task C, first for Oromo, and second for Tigrinya. Surprisingly, their best performance is  Table 5: Task C Results. The ranking is based on the average of the scores. Partial submissions are not included in the final ranking. (NR -No Ranking.) below the AfriSenti baseline for Oromo (−1.28 F1), which is based on choosing the best language for zero-shot transfer. Muhammad et al. (2023) identified Hausa and Amharic as the best source languages for Oromo and Hausa and Yorùbá as the best source languages for Tigrinya. Co-training on the two languages led to a better performance.
NLNDE used the same approach as in sub-task B. They used forward and backward language selection to decide the best source languages to transfer from. For Oromo, the best source languages they identified were Kinyarwanda, Hausa, Yorùbá, and Xitsonga using forward selection, and Yoruba, Mozambique Portuguese, and Xitsonga using backward selection. Similarly, for Tigrinya, they identified Hausa, Kinyarwanda, Amharic, Moroccan Arabic, and Mozambique Portuguese in the forward selection, and Mozambican Portuguese, Yorùbá, and Hausa. The languages selected were similar to those identified by Muhammad et al. (2023). NL-NDE ranked second on sub-task C and first for Tigrinya.
Masakhane-AfriSenti used the multilingual model they introduced in sub-task B based on AfroXLMR-base and AfriBERTa. They also tried adapter-based training based on MAD-X (Pfeiffer et al., 2020). Their final result is based on an ensemble of the three methods.
FIT BUT (Aparovich et al., 2023) used AfroXLMR-small with additional adversarial training but they only achieved average performance. This is probably due to the use of a small PLM for training.
Other teams like NLP-LISAC, UM6P, DN, GMNLP, and ABCD adopted approaches similar to those of sub-task B: they trained on all multilingual datasets and performed zero-shot evaluation on Oromo and Tigrinya.

Discussion
We summarize some of the approaches that led to the best results in different sub-tasks.
Sub-task A All of the top-10 best-performing teams with systems description papers employed multilingual pre-trained models, especially Afrocentric models. For example, eight of the ten teams make use of AfroXLM-one of the best-performing PLM for African languages. AfroXLMR-large with additional pre-training often led to the best results while multilingual PLMs like mDeBERTaV3 and LaBSE led to competitive results. A few teams used other PLMs, specifically trained on Arabic variants such as DziriBERT. Some teams also reported significant languagespecific improvements using further domain and task-specific pre-training. For instance, the NL-NDE team, which ranked first, used both language and task adaptive pre-training. UIO and Masakhane-AfriSenti also demonstrated the benefit of domain adaptive pre-training. In addition, PALI and Masakhane-AfriSenti showed that using a PLM that has already been trained on sentiment classification can help. Interestingly, other teams using an ensemble of different fine-tuned PLMs tended to perform worse, which highlights that the quality of individual models is important.
Sub-task B Most teams used a single multilingual PLM and fine-tuned it on all languages. In fact, most of the best-ranking teams used AfroXLMRlarge as it performed well on sub-task A. The bestperforming team for this task, NLNDE, chose to select the most appropriate languages to co-train for each language before performing multilingual training, highlighting the importance of the choice of source languages.
Sub-task C UCAS-IIE-NLP ranked first and used a lexicon-based multilingual BERT. This shows the usefulness of leveraging sentiment lexicons as side information in building language models. However, their best performance was below the AfriSenti paper baseline for Oromo (−1.28 F1).
The top-performing teams in each subtask were not affiliated with African institutions. They developed the best models despite a lack of language expertise. This highlights both the generality of existing models and adaptation paradigms as well as the need for a more collaborative approach to building more effective and inclusive solutions for Africa-centric sentiment analysis.

Conclusion
We presented the SemEval-2023 Task 12: Sentiment Analysis for African Languages, the first Se-mEval shared task that focuses on sentiment analysis for African languages. The task included monolingual classification (in Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Swahili, Twi, Xitsonga, and Yorùbá), multilingual classification, and zero-shot classification (in Oromo and Tigrinya). We described the task settings, datasets, and baselines.
We discussed the main findings of the 44 participating teams who submitted their systems, based on their system description papers (i.e., 29 papers) as well as our observations and analysis of some common errors. Overall, the best ranking teams used pre-trained language models (PLMs), with Africa-centric models such as AfroXLMR performing the best in the Task A (monolingual) and Task B (multilingual) classification tasks, with an average weighted F1 of 71.3%, and 75.06%, respectively. For Task C (zero-shot), the top team used lexiconbased multilingual BERT and achieved an average weighted F1 of 58.15.%. These scores indicate that there is still room for improvement in polarity classification in low-resource settings.
By sharing our insights, we aim to encourage researchers to work on under-resourced and understudied African languages and help them improve the performance of current sentiment analysis systems. In the future, we will extend our task to more languages by building additional datasets.

Ethics Statement
People often express sentiment in unique and interesting ways. Thus, there is large amounts of person-person variation. Therefore, any automatic method for sentiment analysis will achieve different results on data from different people, from different domains, etc. We do not recommend the use of automatic methods of sentiment analysis (based on individual instances of text) to make important decisions that can impact an individual. Instead, it is often better to use automatic sentiment analysis to determine broad trends of sentiment across large amounts of data. Sentiment analysis, like many other AI technologies, can be used not just for beneficial purposes, but also to cause harm such as using it to identify and suppress dissent. There are several such ethical considerations that should be accounted for when developing and deploying sentiment analysis systems. We refer to Mohammad (2022Mohammad ( , 2023 for a comprehensive discussion of ethical considerations relevant to sentiment and emotion analysis.

A Algorithms Used
Algorithms used by the participants for data preprocessing and for building the classification systems are shown in Figure 4.

B Tools Used
Tools used by the participants to implement their systems are shown in Figure 5.  Table 6 shows the participating teams, the tasks they made submissions for, their system description paper, and their affiliations.