NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification

Africa has over 2000 indigenous languages but they are under-represented in NLP research due to lack of datasets. In recent years, there have been progress in developing labelled corpora for African languages. However, they are often available in a single domain and may not generalize to other domains. In this paper, we focus on the task of sentiment classification for cross-domain adaptation. We create a new dataset, Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian Pidgin, and Yoruba). We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. By leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While machine translation to low-resource languages are often of low quality, our analysis shows that sentiment related words are often preserved.


Introduction
Nigeria is the sixth most populous country in the world 1 and the most populous in Africa with over 500 languages (Eberhard et al., 2021). These languages are spoken by millions of speakers, and the four most spoken indigenous languages (Hausa, Igbo, Nigerian-Pidgin (Naija), and Yorùbá) have more than 25 million speakers but they are still under-represented in NLP research (Adebara and Abdul-Mageed, 2022;van Esch et al., 2022). The development of NLP for Nigerian languages and other African languages is often limited by a lack of labelled datasets (Adelani et al., 2021b;Joshi et al., 2020). While there have been some progress in recent years (Eiselen, 2016;Adelani et al., 2022b;NLLB-Team et al., 2022;Muhammad et al., 2023;Adelani et al., 2023), most benchmark datasets for African languages are only available in a single domain, and may not transfer well to other target domains of interest (Adelani et al., 2021a).
One of the most popular NLP tasks is sentiment analysis. In many high-resource languages like English, sentiment analysis datasets are available across several domains like social media posts/tweets (Rosenthal et al., 2017), product reviews (Zhang et al., 2015;He and McAuley, 2016) and movie reviews (Pang and Lee, 2005;Maas et al., 2011). However, for Nigerian languages, the only available dataset is NaijaSenti (Muhammad et al., 2022) -a Twitter sentiment classification dataset for four most-spoken Nigerian languages. It is unclear how it transfers to other domains.
In this paper, we focus on the task of sentiment classification for cross-domain adaptation. We create the first sentiment classification dataset for Nollywood movie reviews known as NollySenti -a dataset for five widely spoken Nigerian languages (English, Hausa, Igbo, Nigerian-Pidgin, and Yorùbá). Nollywood is the home for Nigerian movies that depict the Nigerian people and reflect the diversities across Nigerian cultures. Our choice of this domain is because Nollywood is the second-largest movie and film industry in the world by annual output 2 , and the availability of Nollywood reviews on several online websites. However, most of these online reviews are only in English. To cover more languages, we asked professional translators to translate about 1,000-1,500 reviews from English to four Nigerian languages, similar to Winata et al. (2023). Thus, NollySenti is a parallel multilingual sentiment corpus for five Nigerian languages that can be used for both sentiment classification and evaluation of machine translation (MT) models in the user-generated texts domainwhich is often scarce for low-resource languages.
Additionally, we provide several supervised and transfer learning experiments using classical machine learning methods and pre-trained language models. By leveraging transfer learning, we compare the performance of cross-domain adaptation from the Twitter domain to the Movie domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from the Twitter domain in the same target language. To further mitigate the domain difference, we leverage MT from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While MT to low-resource languages are often of low quality, through human evaluation, we show that most of the translated sentences preserve the sentiment in the original English reviews. For reproducibility, we have released our datasets and code on Github 3 .

Related Work
African sentiment datasets There are only a few sentiment classification datasets for African languages such as Amharic dataset (Yimam et al., 2020), and NaijaSenti (Muhammad et al., 2022)for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá. Recently, Muhammad et al. (2023 expanded the sentiment classification dataset to 14 African languages. However, all these datasets belong to the social media or Twitter domain. In this work, we create a new dataset for the Movie domain based on human translation from English to Nigerian languages, similar to the NusaX parallel sentiment corpus for 10 Indonesia languages (Winata et al., 2023).
MT for sentiment classification In the absence of training data, MT models can be used to translate texts from a high-resource language like English to other languages, but they often introduce errors that may lead to poor performance (Refaee 3 https://github.com/IyanuSh/NollySenti and Rieser, 2015; Poncelas et al., 2020). However, they do have a lot of potentials especially when translating between high-resource languages like European languages, especially when combined with English (Balahur andTurchi, 2012, 2013). In this paper, we extend MT for sentiment classification to four low-resource Nigerian languages. This paper is an extension of the YOSM paper (Shode et al., 2022) -A Yorùbá movie sentiment corpus.

Focus Languages
We focus on four Nigerian languages from three different language families spoken by 30M-120M.
Hausa belongs to the Afro-Asiatic/Chadic language family with over 77 million speakers (Eberhard et al., 2021). It is a native to Nigeria, Niger, Chad, Cameroon, Benin, Ghana, Togo, and Sudan. However, the significant population for the language reside in northern Nigeria. Hausa is an agglutinative language in terms of morphology and tonal with two tones -low and high. It is written with two major scripts: Ajami (an Arabic-based script) and Boko script (based on Latin script)the most widely used. The Boko script make use of all the Latin letters except for "p,q,v, and x" including the following additional letters "á, â, Î,¯, kw, Îw, gw, ky, Îy, gy, sh, and ts".
Igbo belongs to the Volta-Niger sub-group of the Niger-Congo language family with over 31 million speakers (Eberhard et al., 2021). It is native language to South-Eastern Nigeria, but also spoken in Cameroon and Equatorial Guinea in Central Africa. Igbo is an agglutinative language in terms of its sentence morphology and tonal with two toneshigh and low. The language utilizes 34 Latin letters excluding "c,q and x", however, it includes additional letters "ch, gb, gh, gw, kp, kw, nw, ny, o . ,ȯ, u . and sh".
Nigerian-Pidgin aka Naija is from the English Creole Atlantic Krio language family with over 4 million native speakers and 116 million people second language speakers. It is a broken version of Nigerian English that is also a creole because it is used as a first language in certain ethnic communities (Mazzoli, 2021). It serves as a common language for all as it facilitates communication between several ethnicities. Naija has 26 letters similar to English with an analytical sentence morphology.
Yorùbá belongs to the Volta-Niger branch of the Niger-Congo language family with over 50 million speakers (Eberhard et al., 2021) thus making it the third most spoken indigenous African language. Yorùbá is native to South-Western Nigeria, Benin and Togo, and widely spoken across West Africa and Southern America like Sierra Leone, Côte d'Ivoire, The Gambia, Cuba, Brazil, and some Caribbean countries. Yorùbá is an isolating language in terms of its sentence morphology and tonal with three lexical tones -high, mid and low -that are usually marked by diacritics which are used on syllabic nasals and vowels. Yorùbá orthography comprises 25 Latin letters which excludes "c, q, v, x, and z" but includes additional letters "gb, e . , s . and o . ".

NollySenti creation
Unlike Hollywood movies that are heavily reviewed with hundreds of thousands of reviews all over the internet, there are fewer reviews about Nigerian movies despite their popularity. Furthermore, there is no online platform dedicated to writing or collecting movie reviews written in the four indigenous Nigerian languages. We only found reviews in English. Here, we describe the data source for the Nollywood reviews and how we created parallel review datasets for four Nigerian languages. Table 1 shows the data source for the NollySenti review dataset. We collected 1,018 positive reviews (POS) and 882 negative reviews (NEG). These reviews were accompanied with ratings and were sourced from three popular online movie review platforms -IMDB, Rotten Tomatoes and, Letterboxd. We also collected reviews and ratings from four Nigerian websites like Cinemapointer, Nollyrated. Our annotation focused on the classification of the reviews based on the ratings that the movie reviewer gave the movie. We used a rating scale to classify the POS or NEG reviews and defined ratings between 0-4 to be in the NEG category and 7-10 as POS.

Human Translation
We hire professional translators in Nigeria and ask them to translate 1,010 reviews randomly chosen from the 1,900 English reviews. Thus, we have a parallel review dataset in English and other Nigerian languages and their corresponding ratings. For quality control, we ask a native speaker per lan-guage to manually verify the quality of over 100 randomly selected translated sentences, and we confirm that they are good translations, and they are not output of Google Translate (GT). 4 All translators were properly remunerated according to the country rate 5 . In total, we translated 500 POS reviews and 510 NEG reviews. We decided to add 10 more NEG reviews since they are often shorterlike one word e.g. ("disappointing").

Experimental Setup
Data Split Table 2 shows the data split into Train, Dev and Test splits. They are 410/100/500 for hau, ibo and pcm. To further experiment with the benefit of adding more reviews, we translate 490 more reviews for yor. The ratio split for yor is 900/100/500, while for eng is 1,300/100/500. We make use of the same reviews for Dev and Test for all languages. For our experiments of transfer learning and machine translation, we make use of all the training reviews for English (i.e 1,300). We make use of a larger test set (i.e. 500 reviews) for hau, ibo and pcm because the focus of our analysis is on zero-shot transfer, we followed similar data split as XCOPA (Ponti et al., 2020), COPA-HR (Ljubesic and Lauc, 2021) and NusaX datasets. The small training examples used in NollySenti provides an opportunity for researchers to develop more data efficient cross-lingual methods for under-resourced languages since this is a more realistic scenario.

Baseline Models
Here, we train sentiment models using classical machine learning models like Logistic regression and Support Vector Machine (SVM) and fine-tune several pre-trained language models (PLMs). Unlike classical ML methods, PLMs can be used for crosslingual transfer and often achieve better results (Devlin et al., 2019;Winata et al., 2023). We fine-tune the following PLMs: mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020), mDeBERTaV3 (He et al., 2021), AfriBERTa (Ogueji et al., 2021), and AfroXLMR (Alabi et al., 2022). The last two PLMs have been pre-trained or adapted to all the focus languages. For XLM-R and AfroXLMR, we make use of the base versions. The classical ML methods were implemented using Scikit-Learn (Pedregosa et al., 2011). Appendix B provides more details.    Table 3 provides the baseline results using both logistic regression, SVM, and several PLMs. All baselines on average have over 80% accuracy. However, in all settings (i.e. all languages and number of training samples, N=400,900, and 1300),

Baseline Results
PLMs exceed the performance of classical machine learning methods by over 5 − 7%. In general, we find Africa-centric PLMs (AfriBERTa-large and AfroXLMR-base) have better accuracy than massively multilingual PLMs pre-trained on around 100 languages. Overall, AfriBERTa achieves the best result on average, but slightly worse for English and Nigerian-Pidgin (an English-based creole language) since it has not been pre-trained on the English language.

Zero-shot Evaluation Results
We make use of AfriBERTa for the zero-shot evaluation since it gave the best result in Table 3 (see avg. excl. eng). Table 4 shows the zero-shot evaluation.

Performance of Cross-domain adaptation
We obtained an impressive zero-shot result by evaluating a Twitter sentiment model on movie review (73.8 on average). All have over 70 except for yor.
Performance Cross-lingual adaptation We evaluated two sentiment models, trained on either imdb or NollySenti (eng) English reviews. Our result shows that the adaptation of imdb has similar performance as the cross-domain adaptation, while the NollySenti (eng) exceeded the performance by over +6%. The imdb model was probably worse despite the large training size due to a slight domain difference between Hollywood reviews and Nollywood reviews -may be due to writing style and slight vocabulary difference among English dialects (Blodgett et al., 2016). An example of a review with multiple indigenous named entities including a NEG sentiment is "'Gbarada' is a typical Idumota 'Yoruba film' with all the craziness that come with that sub-section of Nollywood. " that may not frequently occur in Hollywood reviews. Another observation is that the performance of pcm was unsurprisingly good for both setups (84.0 to 86.2) because it is an English-based creole.

Machine Translation improves adaptation
To mitigate the domain difference, we found that by automatically translating N=410 reviews using a    pre-trained MT model improved the average zeroshot performance by over +4%. With additional machine translated reviews (N=1300), the average performance improved further by +3%. Combining all translated sentences with English reviews does not seem to help. Our result is quite competitive to the supervised baseline (−1.9%). As an additional experiment, we make use of MT to translate 25k IMDB reviews, the result was slightly worse than NollySenti (lang). This further confirms the slight domain difference in the two datasets.
Sentiment is often preserved in MT translated reviews Table 5 shows that despite the low BLEU score (< 15) for hau, ibo and yor, native speakers (two per language) of these languages rated the machine translated reviews in terms of content preservation or adequacy to be much bet-ter than average (3.8 to 4.6) for all languages on a Likert scale of 1-5. Not only does the MT models preserve content, native speakers also rated their output to preserve more sentiment (i.e. achieving at least of 90%) even for some translated texts with low adequacy ratings. Appendix C provides more details on the human evaluation and examples.

Conclusion
In this paper, we focus on the task of sentiment classification for cross-domain adaptation. We developed a new dataset, NollySenti for five Nigerian languages. Our results show the potential of both transfer learning and MT for developing sentiment classification models for low-resource languages. As a future work, we would like to extend the creation of movie sentiment corpus to more African languages.

Limitations
One of the limitations of our work is that we require some form of good performance of machine translation models to generate synthetic reviews for sentiment classification. While our approach seems to work well for some low-resource languages like yor with BLEU score of 3.53, it may not generalize to other sequence classification tasks like question answering where translation errors may be more critical.

Ethics Statement
We believe our work will benefit the speakers of the languages under study and the Nollywood industry. We look forward to how this dataset can be used to improve the processes of the Nollywood industry and provide data analytics on movies. We acknowledge that there maybe some bias introduced due to manually translating the dataset from English, but we do not see any potential harm in releasing this dataset. While the texts were crawled online, they do not contain personal identifying information.

A Focus Languages
We focus on four Nigerian languages from three different language families. Hausa (hau) is from the Afro-Asiatic/Chadic family spoken by over 77 million (M) people. Igbo (ibo) and Yorùbá (yor) are both from Niger-Congo/ Volta-Niger family spoken by 30M and 46M respectively. While Nigerian-Pidgin (pcm) is from the English Creole family, spoken by over 120M people. The Nigerian-Pidgin is ranked the 14th most spoken language in the world 7 . All languages make use of the Latin script. Except for Nigerian-Pidgin, the remaining are tonal languages. Also, Igbo and Yorùbá make extensive use of diacritics in texts which are essential for the correct pronunciation of words and for reducing ambiguity in understanding their meanings.

B Hyper-parameters for PLMs
For fine-tuning PLMs, we make use of Hugging-Face transformers (Wolf et al., 2019). We make use of maximum sequence length of 200, batach size of 32, number of epochs of 20, and learning rate of 5e − 5 for all PLMs.

C Human Evaluation
To verify the performance of the MT model, we hire at least two native speakers of each Nigerian indigenous languages -three native Igbo speakers, four native Yorùbá speakers, four native speakers of Nigerian Pidgin and two Hausa native speakers. The annotators were individually given 100 randomly selected translated reviews in Excel sheets 7 https://www.ethnologue.com/guides/ethnologue200 to report the adequacy and sentiment preservation (1: if they preserve sentiment, 0:otherwise) of the MT outputs. Alongside the sheets, the annotators are given an annotation guideline to guide them during the course of the annotation. Asides that the annotators are of the Nigerian descent as well as native speakers of the selected languages, their minimum educational experience is a bachelor's degree which qualifies them to efficiently read, write and comprehend the annotation materials and data to be annotated.
To measure the consistency of our annotators, we added repeated 5 examples out of the 100 examples. Our annotators were consistent with their annotation. We measure the inter-agreement among the two annotators per task. For adequacy, the annotators achieved Krippendorff's alpha scores of 0.675, 0.443, 0.41, 0.65 for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá respectively. Similarly, for sentiment preservation, Krippendorff's alpha scores of 1.0, 0.93, 0.48, and 0.52 for Hausa, Igbo, Nigerian-Pidgin, and Yorùbá respectively. In general, annotators reviewed the translated texts to have adequacy of 3.8 and 4.6. Nigerian-Pidgin (4.6) achieved better adequacy result as shown in Table 5 because of her closeness to English language, Igbo was rated to have a lower adequacy score (3.8). Overall, all annotators rated the translated sentences to preserve sentiment at least in 90% of the time i.e 90 out of 100 translations preserve the original sentiment in the English sentence.

C.1 Qualitative analysis
The human evaluation is to verify the manually verify the quality of over 100 randomly selected translated sentences manually. Also, the reports from the annotators were automatically computed to support our claim that sentiment is usually preserved in MT outputs. The examples listed in Table 6 are extracted during the annotation process. The examples illustrate the noticeable mistakes in MT outputs. The annotators are expected to give a rating scale between 1-5 if the randomly selected machine translated review is adequately translated and a binary 0-1 rating scale if the sentiment of the original review is retained in the the randomly selected machine translated review.
The examples that are listed in Table 6  Temi Otedola's performance was truly stunning. I thoroughly enjoyed the layers that the story had and the way that each key piece of information was revealed.
I thoroughly enjoyed the layers that the story had and the way that each key piece of information was revealed.

Incorrect and Incomplete translation, sentiment not preserved
Nice cross-country movie. The only thing that I don't like about this movie is the way there was little or no interaction with the Nigerian or Indian environment. Beautiful romantic movie .
The only thing that I don't like about this movie is the way there was little or no interaction with the Nigerian or Indian environment Target Language: PCM -Nigerian Pidgin Incorrect translation, sentiment preserved.
Nice cross-country movie . The only thing that I don't like about this movie is the way there was little or no interaction with the Nigerian or Indian environment. Beautiful romantic movie .
The only thing wey I no like about this film na because e no too get interaction with Nigerian or Indian people.
The only thing that I don't like about this movie is the way there was little or no interaction with the Nigerian or Indian people.

Incorrect translation, sentiment preserved.
A flawed first feature film , but it shows a great deal of promise Fear first feature film, but e show plenti promise.
Fear was featured in the film firstly but it shows a great deal of promise Incorrect and Incomplete translation, sentiment not preserved Spot On!!! Definitely African movie of the year, enjoyed every minute of the 2hours 30minutes Na almost every minute of the 2hours 30minutes wey dem take play for Africa film dem dey play.
It is almost every minute of the 2hours 30minutes that they play African movie they play Table 6: Examples of translation mistakes observed and impact on the sentiment. The Gray color identifies the sentiment portion of the review and meaning of the movie review that is originally written in English, which eventually could lead to losing the sentiment of the movie review. Also, as shown in Table 6, the sentiments of some reviews are preserved regardless of the incorrect or missing translations and the idea or meaning of the review is not totally lost.

C.2 Annotation Guideline
We provide the annotation guideline on Github 8 . 8