AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).


Introduction
Africa has a long and rich linguistic history, experiencing language contact, language expansion, development of trade languages, language shift, and language death, on several occasions. The continent is incredibly linguistically diverse and home to over 2000 languages. This includes 75 languages with at least one million speakers each. Africa has a rich tradition of storytelling, poems, 1 The dataset is available at https://github.com/a frisenti-semeval/afrisent-semeval-2023. songs, and literature (Carter-Black, 2007;Banks-Wallace, 2002) while recent years have seen a proliferation of communication in digital and social media. Code-switching is common in these new forms of communication where speakers alternate between two or more languages in the context of a single conversation (Santy et al., 2021;Angel et al., 2020;Thara and Poornachandran, 2018). However, despite this linguistic richness, African languages have been comparatively under-represented in natural language processing (NLP) research.
An influential sub-area of NLP deals with sentiment, valence, emotions, and affect in language (Liu, 2020). Computational analysis of emotion states in language and the creation of systems that predict these states from utterances have applications in literary analysis and culturonomics (Rea-   Table  1. AfriSenti is an extension of NaijaSenti (Muhammad et al., 2022), which is a sentiment corpus in four major Nigerian languages: Hausa, Igbo, Nigerian Pidgin,and Yorùbá.
The datasets are used in the first Afrocentric SemEval shared task, SemEval 2023 Task 12: Sentiment analysis for African languages (AfriSenti-SemEval). AfriSenti allows the research community to build sentiment analysis systems for various African languages and enables the study of sentiment and contemporary language use in African languages. We publicly release the corpora, which provide further opportunities to investigate the difficulty of sentiment analysis for African languages.
Our contributions are: (1) the creation of the largest Twitter dataset for sentiment analysis in African languages by annotating 10 new datasets and curating four existing ones (Muhammad et al., 2022), (2) the discussion of the data collection and annotation process in 14 low-resource African languages, (3) the release sentiment lexicons for all languages, (4) the presentation of classification baseline results using our datasets.
Recent work in sentiment analysis focused on sub-tasks that tackle new challenges, including aspect-based , multimodal (Liang et al., 2022), explainable (neuro-symbolic) (Cambria et al., 2022), and multilingual sentiment analysis (Muhammad et al., 2022). On the other hand, standard sentiment analysis sub-tasks such as polarity classification (positive, negative, neutral) are widely considered saturated and solved (Poria et al., 2020), with an accuracy of 97.5% in certain domains (Raffel et al., 2020;Jiang et al., 2020). However, while this may be true for highresource languages in relatively clean, long-form text domains such as movie reviews, noisy usergenerated data in under-represented languages still presents a challenge (Yimam et al., 2020). Additionally, African languages present new challenges for sentiment analysis such as dealing with tone, code-switching, and digraphia (Adebara and Abdul-Mageed, 2022). Existing work in sentiment analysis for African languages has therefore mainly focused on polarity classification (Mataoui et al., 2016;El Abdouli et al., 2017;Moudjari et al., 2020;Figure 2: Language Family (shown in green) in the AfriSenti datasets. Yimam et al., 2020;Muhammad et al., 2022;Martin et al., 2021). We present with AfriSenti the largest and most multilingual dataset for sentiment analysis in African languages.

Overview of the AfriSenti Datasets
AfriSenti covers 14 African languages, each with unique linguistic characteristics and writing systems, which are shown in Table 2. As shown in Figure 2, the dataset includes six languages of the Afroasiatic family, six languages of the Niger-Congo family, one from the English Creole family, and one from the Indo-European family.
Writing Systems Scripts serve not only as a means of transcribing spoken language, but also as powerful cultural symbols that reflect people's identity (Sterponi and Lai, 2014). For instance, the Bamun script is deeply connected to the identity of Bamun speakers in Cameroon, while the Geez/Ethiopic script (for Amharic and Tigrinya) evokes the strength and significance of Ethiopian culture (Sterponi and Lai, 2014). Similarly, the Ajami script, a variant of the Arabic script used in various African languages such as Hausa, serves as a reminder of the rich African cultural heritage of the Hausa community (Gee, 2005).
African languages, with a few exceptions, use the Latin script, written from left to right, or the Arabic script, written from right to left (Gee, 2005;Meshesha and Jawahar, 2008), with the Latin script being the most widely used in Africa (Eberhard et al., 2020). Ten languages out of fourteen in AfriSenti are written in Latin script, two in Arabic script, and two in Ethiopic (or Geez) script. On social media, people may write Moroccan Arabic (Darija) and Algerian Arabic (Darja) in both Latin and Arabic characters due to various reasons including access to technology, i.e., the fact that Arabic keyboards were not easily accessible on commonly used devices for many years, code-switching, and other phenomena. This makes Algerian and Moroccan Arabic digraphic, i.e., their texts can be written in multiple scripts on social media. Similarly, Amharic is digraphic and is written in both Latin and Geez script (Belay et al., 2021). This constitutes an additional challenge to the processing of these languages in NLP. 3 Geographic Representation AfriSenti covers the majority of African sub-regions. Many African languages are spoken in neighbouring countries within the same sub-regions. For instance, variations of Hausa are spoken in Nigeria, Ghana, and Cameroon, while Swahili variants are widely spoken in East African countries, including Kenya, Tanzania, and Uganda. AfriSenti also includes datasets in the top three languages with the highest numbers of speakers in Africa (Swahili, Amharic, and Hausa). We show the geographic distribution of languages in AfriSenti in Figure 1. Table 3. For the existing datasets whose test sets are public, we created new test sets to further evaluate their performance in the AfriSenti-SemEval shared task.

Data Collection and Processing
Twitter's Limited Support for African Languages Since many people share their opinions on Twitter, the platform is widely used to study sentiment analysis (Muhammad et al., 2022). However, the Twitter API's support for African languages is limited, which makes it difficult for researchers to collect data. Specifically, the Twitter language API currently supports only Amharic out of more than 2000 African languages 4 . This disparity in language coverage highlights the need for further research and development in NLP for low-resource languages.

Tweet Collection
We used the Twitter Academic API to collect tweets. However, as the API does not provide language identification for tweets in African languages, we use location-based and vocabularybased heuristics to collect the datasets.

Location-based data collection
For all languages except Algerian Arabic and Afan Oromo, we used a location-based collection approach to filter out results. Hence, tweets were collected based on the names of the countries where the majority of the target language speakers are located. For Afaan Oromo, tweets were collected globally due to the small size of the data collected from Ethiopia.

Vocabulary-based Data Collection
As different languages are spoken within the same region in Africa (Amfo and Anderson, 2019), the location-based approach did not help in all cases. For instance, searching for tweets from "Lagos" (Nigeria) returned tweets in multiple languages, such as Yorùbá, Igbo, Hausa, Pidgin, English, etc.
To address these challenges, we combined the location-based approach with vocabulary-based collection strategies. These included the use of stopwords, sentiment lexicons, and a language detection tool. For languages that used the Geez script, we used the Ethiopic Twitter Dataset for Amharic (ETD-AM), which includes tweets that were collected since 2014 (Yimam et al., 2019).
Data collection using stopwords Most African languages do not have curated stopword lists (Emezue et al., 2022). Therefore, we created stopword lists for some AfriSenti languages and used them to collect data. We used corpora from different domains, i.e. news data and religious texts, to rank words based on their frequency (Adelani et al., 2021). We filtered out the top 100 words by deleting domain-specific words (e.g., the word God in religious texts) and created lists based on the top 50 words that appeared across domains.
We also used a word co-occurrence-based approach to extract stopwords (Liang et al., 2009) using text sources from different domains. We lower-cased and removed punctuation symbols and numbers, constructed a co-occurrence graph, and filtered out the words that occurred most often. Native speakers verified the generated lists before use. This approach worked the best for Xistonga. Data collection using sentiment lexicons As data collection based on stopwords sometimes results in tweets that are inadequate for sentiment analysis (e.g., too many neutral tweets), we used a sentiment lexicon-a dictionary of positive and negative words-for tweet collection. This allows for a balanced collection across sentiment classes (positive/negative/neutral). For Moroccan Darija, we used emotion word list curated by Outchakoucht and Es-Samaali (2021). Table 4 provides details on the sentiment lexicons in AfriSenti and indicates whether they were manually created or translated.
Data collection using mixed lists of words Besides stopwords and sentiment lexicons, native speakers provided lists of language-specific terms including generic words. For instance, this strategy helped us collect Algerian Arabic tweets, and the generic terms included equivalents of words such as " 1 " (the crowd) and names of Algerian cities.

Language Detection
As we mainly used heuristics for data collection, the result included tweets in a language that is different from the target one. For instance, when collecting tweets using lists of Amharic words, some returned tweets were in Tigrinya, due to Amharic-Tigrinya code-mixing. Similarly, we applied an additional manual filtering step in the case of Tunisian, Moroccan, and Modern Standard Arabic tweets that were returned when searching for Algerian Arabic ones due to overlapping terms.
Hence, we used different techniques for language detection as a post-processing step.
Language detection using existing tools Few African languages have pre-existing language de-tection tools (Keet, 2021). We used Google CLD3 5 and the Pycld2 library 6 for the supported AfriSenti languages (Amharic, Oromo and Tigrinya).
Manual language detection For languages that do not have a pre-existing tool, the detection was conducted by native speakers. For instance, annotators who are native speakers of Twi and Xitsonga manually labeled 2,000 tweets in these languages. In addition, as native speakers collected the Algerian Arabic tweets, they deleted all possible tweets that were expressed in another language or Arabic variation instead.
Language detection using pre-trained language models To reduce the effort spent on language detection, we also used a pretrained language model fine-tuned on 2,000 manually annotated tweets (Caswell et al., 2020) to identify Twi and Xitsonga.
Despite our efforts to detect the right languages, it is worth mentioning that as multilingualism is common in African societies, the final dataset contains many code-mixed tweets.

Tweet Anonymization and Pre-processing
We anonymized the tweets by replacing all @mentions by @user and removed all URLs. For the Nigerian language test sets, we further lower-cased the tweets (Muhammad et al., 2022).

Data Annotation Challenges
Tweet samples were randomly selected based on the different collection strategies. Then, with the exception of the Ethiopian languages, each tweet was annotated by three native speakers. We followed the sentiment annotation guidelines by Mohammad (2016) and used majority voting  to determine the final sentiment label for each tweet (Muhammad et al., 2022). We discarded the cases where all annotators disagree. The datasets of the three Ethiopian languages (Amharic, Tigriniya, and Oromo) were annotated using two independent annotators, and then curated by a third more experienced individual who decided on the final gold labels.
Prabhakaran et al. (2021) showed that the majority vote conceals systematic disagreements between annotators resulting from their sociocultural backgrounds and experiences. Therefore, we release all the individual labels to the research community. We report the free marginal multi-rater kappa scores (Randolph, 2005) in Table 5 since chance-adjusted scores such as Fleiss-κ can be low despite a high agreement due to the imbalanced label distributions (Randolph, 2005;Falotico and Quatto, 2015;Matheson, 2019). We obtained intermediate to good levels of agreement (0.40 − 0.75) across all languages, except for Oromo where we obtained a low agreement score due the annotation challenges that we discuss in Section 5. Table 6 shows the number of tweets in each of the 14 datasets. The Hausa collection of tweets is the largest AfriSenti dataset and the Xitsonga dataset is the smallest one. Figure 3 shows the distribution of the labeled classes in the datasets. We observe that the distribution for some languages such as ha is fairly equitable while in others such as pcm, the proportion of tweets in each class varies widely. Sentiment annotation for African languages presents some challenges (Muhammad et al., 2022) that we highlight in the following.
Twi A significant portion of tweets in Twi were ambiguous, making it difficult to accurately categorize sentiment. Some tweets contained symbols that are not in the Twi alphabet, which is a frequent occurrence due to the lack of support for certain Twi letters on keyboards (Scannell, 2011). For example, "O" is replaced by the English letter "c", and "E" is replaced by "3".
Additionally, tweets are more often annotated as negative (cf. Figure 3). This is due to some common expressions that can be seen as offensive depending on the context. For instance, "Tweaa" was once considered an insult but has become a playful expression through trolling, and "gyae gyimii" is commonly used by young people to say "stop" while its literal meaning is "stop fooling".
Mozambican Portuguese and Xitsonga One of the significant challenges for the Mozambican Portuguese and Xitsonga data annotators was the presence of code-mixed and sarcastic tweets. Codemixing in tweets made it challenging for the annotators to determine the intended meaning of the tweet as it involved multiple languages spoken in Mozambique that some annotators did not understand. Similarly, the presence of two variants of Xitsonga spoken in Mozambique (Changana and Ronga) added to the complexity of the annotation   task. Additionally, sarcasm was a source of disagreement among annotators, leading to the exclusion of many tweets from the final dataset.
Ethiopian languages For Oromo and Tigrinya, challenges included finding annotators and the lack of a reliable Internet connection and access to personal computers. Although we trained the Oromo annotators, we observed severe problems in the quality of the annotated data which led to a low agreement score.
Algerian Arabic For Algerian Arabic, the main challenge was the use of sarcasm. When this caused a disagreement among the annotators, the tweet was further labeled by two additional annotators. If all the annotators did not agree on one final label, we discarded it. As Twitter is also commonly used to discuss controversial topics in the region, we removed offensive tweets.

Setup
For our baseline experiments, we considered three settings: (1) monolingual baseline models based on multilingual pre-trained language models for 12 AfriSenti languages with training data, (2) multilingual training of all 12 languages, and their evaluation on a combined test of all 12 languages, (3) Zero-shot transfer to Oromo (orm) and Tigrinya (tir) from any of the 12 languages with available training data.
Monolingual baseline models We fine-tune massively multilingual pre-trained language models (PLMs) trained on 100 languages from around the world and Africa-centric PLMs trained exclusively on languages spoken in Africa. For the massively multilingual PLMs, we selected two representative PLMs: XLM-R-{base & large} (Conneau et al., 2020) and mDeBERTaV3 (He et al., 2021). For the Africa-centric models, we make use of AfriBERTa-large (Ogueji et al., 2021a) and AfroXLMR-{base & large} (Alabi et al., 2022) an XLM-R model adapted to African languages. AfriBERTa was pre-trained from the scratch on 11 African languages including nine of the AfriSenti languages while AfroXLMR supports 10 of the AfriSenti languages. Additionally, we fine-tune XLM-T (Barbieri et al., 2022a), an XLM-R model adapted to the multilingual Twitter domain, supporting over 30 languages but fewer African languages due to a lack of coverage by Twitter's language API (cf. §4). Table 7 shows the results of the monolingual baseline models on AfriSenti. AfriBERTa obtained the worst performance on average (61.7) especially for languages it was not pre-trained on, e.g., < 50 for the Arabic dialects. However, it achieved a good results for languages it has been pre-trained on, such as hau, ibo, swa, yor. XLM-R-base led to a performance that is comparable to AfriBERTa on average, was worse for most African languages, but better for Arabic dialects and pt-MZ. On the other hand, AfroXLMR-base and mDeBERTaV3 achieve similar performance, although AfroXLMRbase performs slightly better for kin and pcm compared to other models. Overall, considering models with up to 270M parameters, XLM-T achieves   the best performance, which highlights the importance of domain-specific pre-training. XLM-T performs particularly well on Arabic and Portuguese dialects, i.e., arq, ary and pt-MZ, where it outperforms AfriBERTa by 21.8, 14.2, and 13.0 and AfroXLMR-base by 4.0, 5.9, and 4.7 F1 points respectively. AfroXLMR-large achieves the best overall performance and improves over XLM-T by 2.5 F1 points, which highlights the benefit of scaling for large PLMs. Scaling is of limited use for XLM-R-large, however, as it has not been pretrained on many of the African languages. Overall, our results demonstrate the importance of both language and domain-specific pre-training as well as the benefits of scale for appropriately pre-trained models. Table 8 shows the performance of multilingual models that were fine-tuned on the combined training data and evaluated on the combined test data of all languages. Similar to before, AfroXLMRlarge achieves the best performance, outperforming AfroXLMR-base, XLM-R-large, and XLM-T-base by more than 2.5 F1 points.

Experimental Results
Finally, Table 9 shows the zero-shot crosslingual transfer performance from models trained on different source languages with available training data to the test-only languages orm and tir. The best source languages are Hausa or Amharic for orm, and Hausa or Yorùbá for tir. Hausa even outperforms a multilingually trained model. The impressive performance for transfer between Hausa and Oromo may be because both are from the same language family and share a similar Latin script. In addition, Hausa has the largest training dataset in AfriSenti. Both linguistic similarity and size of source language data have been shown to correlate with successful cross-lingual transfer (Lin et al., 2019). However, it is unclear why Yorùbá performs particularly well for tir despite the difference in script. One hypothesis is that Yorùbá may be a good source language in general, as shown in Adelani et al. (2022) where Yorùbá is the second best source language for named entity recognition in African languages.

Conclusion and Future Work
We presented AfriSenti, a collection of sentiment Twitter datasets annotated by native speakers in 14 African languages used in the first Afro-centric Se-mEval shared task-SemEval 2023 Task 12: Sentiment analysis for African languages (AfriSenti-SemEval). We reported the challenges faced during data collection and annotation as well as experimental results using state-of-the-art pre-trained language models. We release the datasets, and data resources to the research community. AfriSenti opens up new avenues for sentiment analysis research in under-represented languages. In the future, we plan to extend AfriSenti to more African languages and different sentiment analysis sub-tasks.

A Focus Languages
Afaan Oromo Afaan Oromo is spoken by more than 37 million speakers and is written in the Latin script (Eberhard et al., 2020). It is the most widely spoken language in Ethiopia and the third most widely spoken language in Africa next to Arabic and Hausa languages. In the Horn of Africa including Ethiopia, Kenya, and Somalia alone, there are over 45 million native Afaan Oromo speakers.
Algerian Arabic/ Darja Algerian Arabic/Darja is the Arabic "dialect" spoken in Algeria. It varies across the Algerian region (Bougrine et al., 2017) and is mastered by almost all Algerians (more than 40 million people). It has mostly an Arabic vocabulary but it also contains Berber (Amazigh), French, Andalusian Arabic, Turkish and Spanish influences and loanwords (Elimam, 2009;Haspelmath and Tadmor, 2009;Harrat et al., 2016).
Amharic Amharic is an Ethio-Semitic and Afro-Asiatic language. It is spoken in Ethiopia, Israel, and America (Eberhard et al., 2020). It has about 57 million speakers, where 32 million of them are native speakers and uses Ge'ez or Fidel script for writing.
Kinyarwanda Kinyarwanda is a language spoken in Central and East Africa, it is the official language of Rwanda but is also spoken in Uganda, D.R.C, Burundi and Tanzania, it is spoken by over 13 million people. It is one of the major Bantu languages, and it is mutually intelligible with Kirundi. Kinyarwanda uses the Latin Alphabet, composed of 24 letters used in English excluding x and q.
Moroccan Darija Moroccan Arabic Darija is the dialect of Arabic spoken in Morocco. It is a mixture of classical Arabic, Berber, and French with some Spanish and Portuguese influences. According to the 2014 general census 7 , 92% of the Moroccan population speak Arabic Darija. This dialect retains many of the characteristics that make it unique 7 http://rgphentableaux.hcp.ma/Default1/ among other dialects. Its phonology and syntax are quite different from other forms of spoken Arabic. However, Darija is not widely used outside Morocco and, therefore, may be difficult to find resources either online or in print, which makes it severely low-resourced like the majority of African vernaculars. Its written form has only started appearing in social media using either Arabic script or a mix of numbers and the Latin alphabet.
Mozambican Portuguese The Portuguese spoken in Mozambique is called Mozambican Portuguese, commonly referred to as the Portuguese of Mozambique. It differs from other Portuguese variants in a few ways, such as the lexicon, which incorporates many African terms and expressions that are common in Mozambique and other Portuguesespeaking nations. Additionally, it has a distinctive accent and rhythm that are affected by the Mozambican languages used locally in its pronunciation. Some grammar forms and structures in Mozambican Portuguese are different from those used in European Portuguese. Additionally, there are loan terms that were acquired from Mozambican and other African languages.
Tigrinya Also spelt Tigringa, is a Semitic language spoken in the Tigray region and Eritrea. The language uses Geez script with some additional Tigrinya alphabets and is closely related to Geez, and Amharic. The language has around 10 million speakers and 6.4 million are found in the Ethiopian Tigray region.
Xitsonga/Tsonga Xitsonga is a Bantu language originally from Mozambique but also spoken in different southern African countries. In Mozambique, the same language is referred to as Changana. It is part of the Tswa-Ronga language group, which also includes Tshwa and Rhonga. These three languages are mutually intelligible, meaning speakers of one can understand the other two languages in the same group. In addition to Mozambique, Xitsonga is also spoken in South Africa, Eswatini (formerly Swaziland), and Zimbabwe. In Mozambique, the following dialectal variants of Changana are recognized: Xihlanganu, Xidzonga, Xin'walungu, Xibila, and Xihlengwe. According to Omniglot 8 there are about 8.9 million speakers of Xitsonga, including 5.68 million in South Africa (in 2013), 3.1 million speakers in Mozambique (in