With Prejudice to None: A Few-Shot, Multilingual Transfer Learning Approach to Detect Social Bias in Low Resource Languages

,


Introduction
Social media has become one of the most important sources of information for users in recent years (Twenge and Campbell, 2019).The rise in the use of social media platforms (Kemp, 2020), combined with the ease of access, allows unrestricted flow of content containing social biases, and stereotypes on such sites.Furthermore, inequalities in socialmedia use exist across countries and regions as a result of various societal norms, cultures, and histories.Thus, prejudice and societal biases vary across cultures.
Figure 1: Examples from our Dataset.Each post is annotated with multiple categorical labels.In addition, annotators are asked to write the rationale behind the social bias in the post.
Motivation: There has been a lot of focus on how to identify social bias either in data or model in recent years.This is because it is crucial that the systems we create do not encourage pre-existing prejudices.While there has been some study of this topic, it has largely been limited to high-resource languages like English (Dev et al., 2022;Röttger et al., 2022).
Despite the fact that social bias is inextricably linked to cultural and linguistic characteristics of the language, non-English datasets (Lauscher et al., 2020;Kurpicz-Briki, 2020;Liang et al., 2020) are limited, hindering the development of social bias identification in other languages.In this work, we are expanding the bias detection task from english language to non-english languages.The results show that large datasets are not always required to develop efficient methods for identifying social bias in these resource-constrained languages.
We intend to address the identification of Social Bias in the Hindi language.With 602 million active speakers1 , Hindi is the world's third most spoken language.Despite the fact that a sizable proportion of these folks choose to communicate online in Devanagari2 (script for Hindi), there has been essentially no research on social bias detection in Hindi.Kumar et al. (2021a) focuses on social bias detection in code-mixed and transliterated Hindi language and (Bhatt et al., 2022) releases a social bias benchmarking dataset using Indian English.Although both of these datasets are focusing on social biases in the Indian context, none of them are in the Hindi language.
In this paper, we present our manually annotated dataset for Social Bias detection in Hindi.We have annotated ∼ 9 online social media posts with multiple categorical labels.We empirically investigate the impact of languages from two different language families on the downstream task of social bias detection.Further, we show that for bias detection, translating all datasets into English does not perform well.
Our contributions are: • Identification of social bias in text across four languages (e.g., Hindi, English, Italian, and Korean) using multilingual transfer learning.
• A new social bias detection dataset in Hindi with ∼ 9 instances, along with an accompanying annotation guideline which will be a valuable resource for researchers studying social bias detection in low-resource languages.
• Baseline experiments as useful benchmarks for future research on social bias detection in Hindi and other languages.
The rest of the paper is organized as follows: Related works are discussed in section 2. Section 3 gives insight into our dataset, terminologies, and annotation process.The methodologies and experiments are discussed in Section 4. Detailed error analysis is presented in Section 5 followed by the concluding remarks, and the discussion on future works in Section 6.

Related Works
The presence of social bias in language representations is mostly caused by the undesired and skewed associations within the training data.Given the growing social effect of NLP applications, studying these undesired relationships is paramount (Bender and Friedman, 2018;Crawford, 2017).The initial attempts to tackle this issue focused on measuring and mitigating gender biases from word embeddings (Bolukbasi et al., 2016;Caliskan et al., 2017;Zhao et al., 2017;Garg et al., 2018;Sun et al., 2019).Additionally, multiple works have explored solutions to identify race, and religion bias in word embeddings (Manzini et al., 2019).Many subsequent works (May et al., 2019;Zhao et al., 2019;Kurita et al., 2019) have also focused on contextualized language representation from models like BERT for bias detection.
More recently, many datasets (Nangia et al., 2020;Sap et al., 2020) have been created to measure social biases like gender, race, profession, religion, age, and so on in language models.Blodgett et al. (2021) has reported that these datasets lack clear definitions and have ambiguities and inconsistencies in annotations.Researchers have also investigated the presence of biases in models trained for various NLP tasks like machine translation (Stanovsky et al., 2019;Savoldi et al., 2021), question answering (Li et al., 2020), and coreference resolution (Webster et al., 2018).
There have been a lot of notable efforts towards identifying data bias in the problem of hate speech and offensive languages detection (Waseem and Hovy, 2016;Davidson et al., 2019;Sap et al., 2019;Mozafari et al., 2020).Borkan et al. (2019) has discussed unintended bias in hate speech detection models for identity terms like Islam, lesbian, bisexual, etc.Recent studies have also investigated the usefulness of counter-factual data augmentation (Dixon et al., 2018;Nozza et al., 2019;Sahoo et al., 2022;de Vassimon Manela et al., 2021) to reduce the effect of unintended bias in these tasks.
However, most of the bias detection and mitigation research is in English and has focused on western culture.Few recent works have explored the is- sue of social bias in languages such as Arabic, Italian, Spanish, French, and Korean (Lauscher et al., 2020;Sanguinetti et al., 2020;Zhou et al., 2019;Kurpicz-Briki, 2020;Moon et al., 2020).There are very few research works towards tackling this challenge on Indian context.Pujari et al. ( 2019) explores binary gender bias in Hindi languages, and Gupta et al. (2021) investigates gender bias in Hindi-English machine translation using different fairness metrics.Sambasivan et al. (2021) analyzes and discusses multiple dimensions of algorithmic fairness in India.Through a detailed qualitative study, the authors suggest seven potential dimensions of algorithmic unfairness in India: Caste, Gender, Religion, Ability, Class, Sexual Orientation, and Ethnicity.Gangula et al. ( 2019) made available a dataset to identify bias towards political parties in Telugu.Kumar et al. (2021b) released a multilingual dataset in four languages like Hindi, Bangla, Meitei, and Indian English.
Multilingual models acquire cross-lingual knowledge through the sharing of layers that allow for the alignment of representations across languages (Wang et al., 2019).In general, the low-resource languages in a multilingual framework benefit from the existence of other languages (Liu et al., 2020).Lees et al. (2020) explores the multilingual transfer learning between English and Italian for hate speech and stereotype detection tasks.We study the effectiveness of multilingual transfer learning (also in few-shot setting) for four different languages.
In the following section, we provide a comprehensive discussion of our annotated dataset, delving into its details, the definitions of each categorical label, accompanied by relevant examples, and an overview of the annotation process.

Hindi Social Bias Dataset
We have constructed this dataset3 intending to explore the identity-related social prejudices and stereotyping in Hindi from social media platforms such as Instagram, Facebook, Twitter, etc.The major part of the dataset was initially developed as part of CONSTRAINT-20214 (Bhardwaj et al., 2020) which focuses on hostility detection in lowresource regional languages.We augment this dataset with 994 more twitter posts which are scraped using Twitter API5 .We scraped the tweets using some keywords relating to the Indian peninsula, such as, adivasi, dalit, dowry, child labor, casteism, farmer-protest, muslim, islam, hindu, hinduism, article15, article370, jain, poverty, sikh.For scraping we use both English and corresponding hindi keywords.After collecting all the tweets, we use Language Identification Models to filter Hindi tweets.After that we again filer the tweets based on the likes and retweet count.The tweets with minimum 100 likes and retweets are used for annotation purposes.The tweets were collected between January 1, 2021 and November 30, 2022.
We follow the hierarchical annotation scheme by (Zampieri et al., 2019;Singh et al., 2022) and annotate each post for (1) the presence of social bias in it, (2) the category of social bias, (3) specific identity group(s) if the post holds prejudice against/towards it, and (4) sentiment for each post.We also ask the annotators to ground their judgment by mentioning the rationale behind the social bias specified in the post.

Terminologies
Social Bias: People typically have predetermined opinion or prejudice towards others who do not belong to their social group.A social bias is considered as the preferences for or against individuals or groups based on their social identities such as religion, race, political affiliation, profession, etc (Hammersley and Gomm, 1997).In this work, we focus on religion bias, political bias, occupation bias, and person-directed statements.The definitions of each category are as follows: • Religion Bias: Bias towards/against persons or groups based on their religion or religious beliefs (Muralidhar, 2021) such as Christianity, Islam, Hinduism, etc.The religion against which the prejudice is aimed is the target word. Example: • Political Bias: Preconceived statement directed towards or against persons or groups based on their political beliefs.In India, the major political parties with various philosophies are the BJP, Congress, and Shiv Sena, among others.Example: • Personal Attack: The social media posts that contain biased remarks made against renowned personalities.Personal attacks also include verbal abuse, insults, or threats directed at the individual (Vidgen et al., 2021).Example: • Personal Favor: It shows favoritism towards famous individuals.There are many instances of bias towards politicians and celebrities from the entertainment industry in our dataset. Example: • Occupation Bias: Prejudice towards individuals on the basis of their professions.It also displays the preconceived belief against any vocation.Example: • Caste Bias: Caste bias demonstrates societal injustice by highlighting caste inequalities (Sambasivan et al., 2021).Example: • Other Biases: We also annotate for the other biases like race bias, class bias, etc. Discrimination based on economic background is referred to as class prejudice.Race bias is referred to as favoritism for a group of people on basis of their dialect, color, or region.We do not evaluate these categories individually due to their marginal presence in the dataset.Example:

Annotation Process
Because the task is so complicated, we decided to engage three specialized annotators with understanding of Indian history, culture, and politics rather than crowd-sourcing.Each annotator determines if there is prejudice against any identity, such as religion, race, or person-directed statements, given a social media post.If the post is labeled as biased, annotators have to annotate for bias categories and targets.For biased posts, annotators have to further mention the rationale behind the underlying bias in the form of free text.Finally, for each post, the annotators are asked to provide the also label for sentiment (Positive/Negative).The hierarchical annotation approach is depicted in Figure 1.
Acknowledging the difficulty of the task, we provide a detailed guideline and questionnaire set.The questionnaire set contains multiple two-choice (yes/no) questions for each categorical variables of our task.The Inter Annotator Agreement (IAA) was calculated using Krippendorff's alpha (Krippendorff, 2011).The IAA () for bias label, sentiment are 0.662 and 0.72 respectively, which shows good agreement among annotators.
To get the gold label, the data discrepancies were resolved through adjudication.Figure 2 shows some samples annotations from the dataset.We discuss more details of annotation process and guidelines in Appendix C.

Annotator Demographics and Treatment
All the three annotators were trained and selected through extensive one-on-one discussions.We paid very reasonable salary to all of the them for doing the annotations.They went through few days of initial training where they would annotate many examples which would then be validated by an expert and were communicated properly about any wrong annotations during training.As there are potential negative side effects of annotating such biased and sensitive posts, we used to have regular discussion sessions with them to make sure they are not excessively exposed to the harmful contents.All the annotators were Indian female and were of age between 27 to 42.One of the annotator has master degree in computer applications.Other two annotators have master degree in linguistics.The expert was an Indian female with post-graduation degree in sociology.

Data Statistics
The final dataset contains 9154 instances, of which 2300 posts are labelled as biased and 2203 posts as positive.We divided the dataset into 70:10:20 for train set, validation set, and test set, ensuring a uniform distribution of each bias category in each set.The training set contains 6388 posts, whereas validation and test sets each include 901 and 1863 posts respectively.
The majority of the biased instances come from the religion, political, and personal attack categories.This is possibly because the major source of the dataset was BBCHindi, social media posts related various news articles.The two most frequent targets for political class are the BJP and the Congress, which reflects the political affiliation among Indians.Similarly, most posts on religion bias target Hindus and Muslims, reflecting the Hindu-Muslim strife in India.

Experiments
This section describes all the experimental configurations.The training methodology is detailed in the Appendix A. Our experiments focus on predicting the presence of bias and its categories at the sentence level.We investigate multilingual transfer learning to measure the extent of task generalization across languages.Dataset: To explore multi-lingual transfer learning we use our annotated dataset in Hindi and publicly available English (Nadeem et al., 2020), Italian (Sanguinetti et al., 2020), Korean (Moon et al., 2020) datasets.English dataset (Stereoset) has posts collected after curating a set of target terms from Wikidata triplets.For each target term, there are three associated sentences corresponding to stereotypical, anti-stereotypical, and unrelated associations.We disregard the anti-stereotype associations as many of them lack relevancy and veracity (Blodgett et al., 2021).The dataset was created to assess bias in four categories: gender, race, religion, and profession.As the entire dataset is not publicly available, we use a portion of it that is.The Korean dataset was constructed using the comments from entertainment, news platforms.The dataset has two major bias labels: gender bias and other bias, which takes into account prejudice towards various attributes such as political affiliation, age, and religion.The Italian dataset is an expansion of an Italian hatespeech dataset that has been annotated for the existence of stereotypes towards Muslims, Roma, and immigrants.Table 1 depicts the distribution of train, test, and validation sets across all four datasets.

Baselines
Along with random class and majority class baselines, we use Logistic Regression (LR) and Support Vector Machines (SVM) (Hearst et al., 1998) as baselines.For SVM, we experimented with different kernels and C-values.Linear kernel with C-value of 5 performs the best for the binary bias prediction task.
For LR and SVM baselines, we experimented with TF-IDF features (for bigrams and trigrams) and features from transformer based model (XLM-RoBERTa).In SVM and LR, the class weight parameter is set to balanced allowing the model to discover the appropriate weights for imbalance classes.

Mono-lingual models
We looked into two well-known multilingual pretrained language models such as m-BERT6 (Devlin et al., 2018), XLM-R7 (Conneau et al., 2019).As we aim to compare and study four different languages in a single framework, we did not use any langugae specific transformer models like KoBERT, indic-BERT, etc.We fine-tune each model on supervised datasets and use a fully connected layer on top of each of the language model to get two outputs (for binary bias prediction).This we call as monolingual fine-tuning.The best performance is reported based on the macro-F1 Score of test set results with tuned hyperparameters (refer B).Table 2 shows the results of all the fine-tuned multilingual models when tested on the in-domain data (e.g: testing Hindi data on the model trained using Hindi train set).XLM-R8 outperforms mBERT for all four languages.As a result, we only use XLM-R for all other experiments.We call these monolingual models as XLM_L, where L can be one among ENG, HI, KOR, and IT.

Multi-lingual transfer learning
We investigate multilingual transfer learning (MTL) to determine how successfully training can be transferred9 from one language to another.In table 4, we show the results of zero-shot bias detection (direct inference) for target language, as well as the performance improvements after sequential fine-tune of the model using target language.In sequential fine-tuning step, we continuously finetune the source language models using the target language.We call these multilingual models as XLM_S_L, where both S and L can be one among ENG, HI, KOR, and IT.Both S and L can not be same.

MTL based on Translation
For this study, we translate all the non-English datasets into English using Google translate10 api.As there are abudant of resources (datasets and models) already available for English, a general approach is to do classification followed by English translation.We investigate the effectiveness of this approach for bias detection using Hindi, Italian and Korean datasets.Similar to previous approach, we perform both zero-shot and sequential fine-tuning for translated datasets and report the results in table 5.  4: Comparison of monolingual fine-tuning vs multilingual fine-tuning for all datasets.Four source language models, XLM_ENG, XLM_HI, XLM_IT and XLM_KOR are the fine-tuned XLM-R models on English, Hindi, Italian and Korean datasets respectively.Last four columns correspond to sequential fine-tuning of all datasets using source language models.Best F1-scores are shown in bold.

Few-shot MTL
The creation of dataset for culture-specific social bias detection is very time-consuming and expensive.To tackle this issue, we explore the multilingual transfer learning in a few-shot setting.We evaluate the performances using XLM_L and XLM_ENG_L models.For fine-tuning, we use few examples (represented as ) from training sets of Hindi (HI), Italian (IT) and Korean (KOR) languages.We use following values of  : 25, 50, 100, 200, 400, 800, 1600.We randomly sample equal number of instances from both neutral and bias classes of the three datasets.We repeat this experiment five times for each  to report (table 6) mean and standard deviation of macro F1scores and for plotting the 95% confidence interval in figure 3.

Results and Analysis
Table 2 shows how both mBERT and XLM-R models performed on Hindi, English, Italian and Korean datasets.XLM-R performs better on all four datasets (macro-F1 of 77.9, 95.8, 75.2, 74.1 respectively on Hindi, English, Italian and Korean datasets).The English dataset used for experimental purposes is balanced for bias and neutral classes and was created following some templates.It is not scraped from any social media platforms.On the other hand, Korean, Italian, and Hindi datasets are scraped from respective social media platforms; they have more natural and long sentences as compared to English datasets.Due to this, there are human errors (Grammatical, spelling, syntactical, and pragmatic errors) and convoluted constructions in datasets other than the English dataset.Both models perform best on the English dataset, possibly because the English dataset is less complex and balanced as compared to other datasets.As XLM-R consistently performs better for all languages, we use XLM-R for all other experiments.
Table 3 shows the comparison among all baseline models for on the Hindi dataset for bias prediction.For both LR and SVM, the features extracted from XLM-R model works better than Tf-Idf features.However, fine-tuning XLM-R model on the Hindi dataset gives the best performance of 77.9 macro-F1.
Table 4 shows performances of all the multilingual transfer learning experiments.In zeroshot setting (direct inference), there is very poor knowledge transfer between English and any other dataset.However, Hindi performs decently on XLM_IT and XLM_KOR models in zero shot setting.This is due to the fact that English dataset is template based dataset and other three datasets are annotated using social media comments.Also, we show that all the models perform better after finetuning them with training set of target language.The hypothesis is that the source language models trained on monolingual data provides better initialization for multilingual fine-tuning.Multilingual fine-tuning using Hindi data performs better over monolingual fine-tuning (macro F1 of 77.9) for every source language model.The Korean dataset performs best when ENG model is used as base model.Italian dataset performs best when the HI model is used as base model and vice versa.In general, multilingual fine-tuning outperforms monolingual fine-tuning across languages for bias detection.
Hindi also performs well on English base model data due to a good overlap of religion, occupation, and race biases in both datasets.Korean dataset has majorly gender bias instances along with other biases like political affiliation, religion, race, etc.Only the English dataset has significant instances of gender bias in addition to Korean.This improves the performance of the Korean dataset when English is used as the model.Due to category overlap between the two datasets, such as political affiliation, religion, and race, the Korean dataset also performs well (macro F1 of 74.5) when Hindi is used as the base model.When measured using the XLM-R model, the average perplexity of the English, Hindi, Italian, and Korean datasets are respectively, 77.05, 85.02, 103.23, 145.57.From perplexity scores and the performances of the monolingual model, it is evident that the Korean dataset is complex in nature, and the gain in performance in the multilingual model for the Korean dataset can be attributed to the learning from the source language (English or Hindi).
The Italian dataset has a higher percentage of re-ligion biases than race biases, and both the Italian and Hindi datasets were gathered from their respective social media platforms.The English dataset was derived from a wiki corpus, whereas the Korean dataset was derived from news articles.As a result, both Hindi and Italian dataset help each other.
The results of multilingual few-shot experiments are shown in figure 3 and table 6 (Appendix).When fine-tuned on XLM_L, we can attain an F1-score of at least 70 utilising ∼ 350 training instances for the Hindi dataset.However, we need only ∼ 150 instances to achieve similar F1 score when fine-tuned using XLM_ENG_L model.This behaviour is also observed in Korean and Italian datasets.Furthermore, for all values of N, multilingual few-shot finetuning (sequential fine-tune) performs better than monolingual fine-tuning for all three languages.When the target-language data is limited, there is considerable benefit accruing from an initial round of fine-tuning using English data.
However, the correlation between the amount of target language fine-tuning data and the improvement in model performance is inversely proportional.The marginal gains of increasing  decline sharply across models and target language test sets.
For example, for Hindi, XLM_L improves by 32 macro-F1 from  = 25 to 400, and by just 4 from  = 400 to 1600.Similar trend is also observed for XLM_ENG_L model.
Why translations can not be used directly?One fundamental question is whether we can utilise publicly accessible English datasets to forecast bias in datasets in other languages by translating them to Table 5 shows that zero-shot inference using an English translation of a Hindi dataset yields a macro-F1 of 56.6, much lower than the highest F1 score of 80.8.The trend is similar for Italian and Korean datasets also.One leading cause might be the loss of meaning following translation.Another possibility is that present translation algorithms are incapable of interpreting region-specific slur phrases correctly, as mentioned in the original text.Interestingly, even though the translation is correct the English translation has low frequency of occurrence attenuating its influencing power.The findings from the XLM_L-Tr and XLM_ENG_L-Tr combinations in table 5 support both of these interpretations.Even after fine-tuning the XLM or ENG basic model, the results are still subpar on certain datasets.Also, mismatches in bias categories can contribute to poor generalisation.

Conclusion and Future Work
We present a comprehensive dataset of ∼ 9 Hindi posts with multiple annotations: social bias and their categories, the sentiment of the post, the target group, and the rationales of the bias in the post.We demonstrate the capability of multilingual transfer learning using our dataset and publicly available English, Italian, and Korean datasets.Multilingual fine-tuning (sequential fine-tune) is found to be effective for Hindi, Italian, and Korean datasets, in the sense of reducing data requirements given a performance level or increasing performance level, given a fixed amount of data.Our results show that irrespective of the language family (we have dealt with Indo-European and the Altaic family here), the bias detection task benefits from multilingual sequential fine-tuning.Using few-shot experiments we show that only a small amount of target-language fine-tuning data is required to achieve strong performance and initial fine-tuning on English data can ameliorate data requirement.We report benchmarks on our dataset for the bias detection task in 4 languages.We plan to investigate the effect of multi-task training, for example, bias-and-sentiment, bias-and-explanation, and so on.

Acknowledgements
We would like to thank the anonymous reviewers as well as the ACL action editors.Their insightful comments helped us improve the current version of the paper.Additionally, we would like to thank Manisha, Rashmi, Sandhya Singh for their contributions to data annotation and useful comments.

Ethics Statement
Our work aims at capturing various social biases in Hindi social media posts and demonstrates the annotation quality on biases in one of existing dataset.We briefly discuss the annotation guidelines given to the annotators for the task.Also, studies of social biases come with ethical concerns of risks in deployment (Ullmann and Tomalin, 2020).As these biased news articles or social media posts can create potentially harm to any user or community, it is required to conduct this kind of research to detect them.If done with precautions, such research can be quite helpful in automatic flagging of users and news firms creating such contents.
Researchers working the problem of social bias detection on any form of text would benefit from the dataset we have collated and from the inferences we got from multiple training strategies.

Limitations
The most notable limitation of our work is the lack of external context.Consideration of external contexts that may be relevant for the classification task in our current models, such as the profile bio, user gender, post history, current and past political scenarios of the concerned region, and so on, might prove beneficial for the results in this field.Our research now focuses majorly on only six types of social biases rather than all conceivable degrees of prejudice.We also focused on utilising Hindi, English, Korean, and Italian in our study, and the Hindi dataset is primarily from the Indian context.The limited scope of concern can be further explored with our presented experiments to prove to be fruitful for a wider range of audiences by covering datasets bias annotations pertaining to other low-resource languages.We show the effectiveness of few-shot transfer learning using language models with relatively fewer parameters as compared to recent state-of-the-art language models.

𝑖=1
be the test dataset.Given a sequence of words  = {  }  =1 and corresponding target , where  is the length of sequence , we encode the input instance using model .For logistic regression and SVM, the encoding  is the TF-IDF vector corresponding to the input .For transformer based model, we first tokenize the input  into subword token  = {  }  =1 , where  is the number of subword tokens corresponding to the input .
Then we feed "[ ] [ ]" as input to the transformer encoder and obtain a   -dimensional hidden representation  for each input instance.Here,  is the embedding corresponding to [ ] token of the final layer of the transformer.For the training set, the hidden representation can be represented as  = {ℎ  }   =1 .
ŷ = softmax (   + ) The final hidden representation ℎ  is fed into the linear layer, which is then followed by a softmax function to generate the predicted label distribution ŷ ∈ R   for bias detection or bias category detection task.  is two for bias detection task and six for bias category detection. ∈ R   ×  and  ∈ R   are trainable parameters along with internal parameters of transformer for transformer based models.We use cross-entropy loss between ground truth labels   and the predicted labels ŷ for each instance  to train the classifier.and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?C D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?C 13330

Figure 2 :
Figure 2: Examples of social media posts in our Dataset, along with annotations.

Figure 3 :
Figure 3: Macro F1 scores on the test set of three target languages Hindi, Korean and Italian for different values of , the number of training examples in the few-shot setting.The label XLM_L represents the monolingual fine-tuning of XLM with the data of a target language  (Hindi/Korean/Italian; call this L-pretraining).  _ _, on the other hand, represents sequential fine-tuning, first with ENG data and then with L data.Notice the impact of sequential pre-training.GIVEN a desired F1-score, the data requirement reduces compared to L-pretraining, and GIVEN a fixed amount of training data, the F1-score is pushed up.F1 scores for all the values of N are mentioned in Appendix D (table6).
We fine-tune all the multilingual model for five epochs.Max token length of 128 is used.We also use a dropout layer in our model.We use Adam optimizer and experiment with different learning rates: 1 − 05, 2 − 05, 3 − 05, 4 − 05, 5 − 05, different batch sizes: 8, 16, 32, epsilon = 1 − 08, decay = 0.01, clipnorm = 1.0 were used.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?B C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?5; E C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?B D Did you use human annotators (e.g., crowdworkers) or research with human participants?3; C D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? C D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students)

Table 1 :
Statistics of all datasets used for experiments.Percentage of biased instances in each training dataset is shown in bracket.

Table 2 :
Results of different multi-lingual models using in-domain dataset.The first column reflects the language used to train and test models.The top performances among models are in bold.

Table
)}   =1 be a training dataset with   examples, where    is ground truth label for    training instance.Further, let   = {(   ,    )}