Towards more equitable question answering systems: How much more data do you need?

Question answering (QA) in English has been widely explored, but multilingual datasets are relatively new, with several methods attempting to bridge the gap between high- and low-resourced languages using data augmentation through translation and cross-lingual transfer. In this project we take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages. Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs. In addition, we make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems.


Introduction
Automatic question answering (QA) systems are showing increasing promise that they can fulfil the information needs of everyday users, via information seeking interactions with virtual assistants. The research community, having realized the obvious needs and potential positive impact, has produced several datasets on information seeking QA. The effort initially focused solely on English, with datasets like WikiQA (Yang et al., 2015), MS MARCO (Nguyen et al., 2016), SQuAD (Rajpurkar et al., 2016), QuAC (Choi et al., 2018), CoQA (Reddy et al., 2019), and Natural Questions (NQ) (Kwiatkowski et al., 2019), among others. More recently, heading calls for linguistic and typological diversity in natural language processing 1 Code and data for reproducing our experiments are available here: https://github.com/NavidRajabi/ EMQA. * Equal contribution.
research (Joshi et al., 2020), larger efforts have produced datasets in multiple languages, such as TyDi QA , XQuAD (Artetxe et al., 2020), or MLQA (Lewis et al., 2020). Despite these efforts, the linguistic and typological coverage of question answering datasets is far behind the world's diversity. For example, while TyDi QA includes 11 languages -less than 0.2% of the world's approximately 6,500 languages (Hammarström, 2015)-from 9 language families, its typological diversity is 0.41, evaluated in a [0,1] range with the measure defined by Ponti et al. (2020); MLQA provides data in 7 languages from 4 families, for a typological diversity of 0.32. The total population coverage of TyDi QA, based on population estimates from Glottolog (Nordhoff and Hammarström, 2012), is less than 20% of the world's population (the TyDiQA languages total around 1.45 billion speakers).
Obviously, the ideal solution to this issue would be to collect enough data in every language. Unfortunately, this ideal seems unattainable at the moment. In this work, we perform extensive analysis to investigate the next-best solution: using the existing resources, large multilingual pre-trained models, data augmentation, and cross-lingual learning to improve performance with just a few or no training examples. Specifically: • we study how much worse a multilingual fewshot training setting would perform compared to training on large training datasets, • we show how data augmentation through translation can reduce the performance gap for few-shot setting, and • we study the effect of different fixed-budget allocation for training data creation across languages, making suggestions for future dataset creators.

Problem Description and Settings
We focus on the task of simplified minimal answer span selection over a gold passage: The inputs to the model include the full text of an article (the passage or context) and the text of a question (query). The goal is to return the start and end byte indices of the minimal span that completely answers the question.
Our models follow the current state-of-the-art in extractive question answering, relying on large multilingually pre-trained language models (in our case, multilingual BERT (Devlin et al., 2019)) and the task-tuning strategy of Alberti et al. (2019), which outperforms approaches like Documen-tQA (Clark and Gardner, 2018) or decomposable attention (Parikh et al., 2016). In all cases, we treat the official TyDi QA development set as our test set, since the official test set is not public. 2 We provide concrete details (model cards, hyperparmeters, etc) on our model and training/finetuning regime in Appendix A.
To simulate the scenario of data-scarce adaptation of such a model to unseen languages, we will treat the TyDi QA languages as our test, unseen ones. We will assume that we have access to (a) other QA datasets in more resource-rich languages (in particular, the SQuAD dataset which provides training data in English), and (b) translation models between the languages of existing datasets (again, English) and our target "unseen" languages.
In the experiments sections, we first focus on few-and zero-shot experiments ( §3) and then study the effects of language selection and budgetrestricted decisions on training data creation ( §4).
Evaluation We report F1 score on the test set of each language, as well as a macro-average excluding English (avg L ). In addition, to measure the expected impact on actual systems' users, we follow

Is Few-Shot a Viable Solution?
We first set out to explore the effect of the amount of available data on downstream performance. Starting with baselines relying solely on Englishonly SQuAD, we implement a few-shot setting for fine-tuning on the target languages of TyDi QA. 3 To our knowledge, this is the first study of its type on the TyDi QA benchmark.
The straightforward baseline simply provides zero-shot results on TyDi QA after training only on English. Table 1 provides our (improved) reproduction of the baseline experiments of . The skyline results (bottom of Table 1) reflect the presumably best possible results under our current modeling approach, which trains jointly on all languages using all available TyDi QA training data. We note that for most languages the gap between the baseline and the skyline is more than 20 percentage points, with the exception of English where -unsurprisingly-there is a difference of only 3.3 percentage points. The performance gap is smallest for Russian (rus) at 10.9 percentage points, and largest for Telugu (tel) at 34 points.
We first study a monolingual few-shot setting. That is, we fine-tune the model trained on the English SQuAD dataset, with only a small amount of data (10, 20, or 50 training instances) in the test language. Due to space limitations, we only present results with 50 examples per language in Table 1, but the full experiments are available in Appendix C. We observe that even just 50 additional training instances are enough for significant improvements, which are consistent across all languages. For example, the improvement in Finnish (fin) exceeds 15 percentage points and covers about more than 60% of the performance gap between the baseline and the skyline.
We now turn to a multilingual few-shot setting. Exactly as before, we assume a scenario where we only have access to a small amount of data in each language, but now we fine-tune using that small amount of data in all languages. For example, 10 training instances in each language result in training with 90 training examples over the 9 test languages. A sample of our experimental results are presented in Table 1 under "multilingual fewshot," with complete results in Appendix C.
Simply adding 50 instances from each language we obtain an F1 score of 67.9 over the zero-shot baseline, an improvement of almost 7 percentage points which reduces the zero-full gap by 43.4%. We note that the total 450 training instances represent less than 1% of the full TyDi QA training set! Doubling that amount of data to 100 examples per language further increases downstream performance to an average overall F1 score of 71.7. Going further to the point of adding 500 training instances per language (for a total of 4500 examples) leads to even larger improvements for an average F1 score of 76.7. That is, using less than 10% of the available training data we can reduce the average F1 score performance gap by more than 82%. For a few languages the gap reduction is even more notable, e.g., more than 92% for Finnish.

Results
Data Augmentation through Translation Generating translations of English dataset to train systems in other languages has a long history and has been successful in the QA context as well (Yarowsky et al., 2001;Xue et al., 2020, inter alia). We follow the same approach, translating all SQuAD paragraphs, questions, and answers to all TyDi QA languages using Google Translate. 4 For each language, we keep between 20-50% of the question-answer pairs where the translated answer has an exact match in the translated paragraph, which becomes the target span. 5 Details of the resulting dataset (which we refer to as tSQuAD) are in Table 3 in Appendix B. A second approach translates the question of a training instance into one language, but keeps the answer and context into the original language. The result is a modified training set (which we name mSQuAD) that requires better cross-lingual modeling, as the question and contexts are in different languages. Both approaches improve over the zero-shot baseline with F1 score of 61.4 (+3) and 66.7 (+8). Notably, though, they are not as effective as fewshot training even with just 50 instances per languages. This further strengthens the discussion of  on the qualitative differences between the SQuAD and TyDi QA dataset. Nevertheless, combining tSQuAD (or mSQuAD) with a few examples from the TyDi QA dataset leads to our best-performing methods. In particular, augmentation through translation leads to an 1-2 percentage point improvements over the multilingual few-shot approach (cf. 76.7 to 78.1/78.7 F1 score in Table 1 Table 2: A more egalitarian budget allocation leads to better and more equitable performance across languages (avg±std: higher average, lower std. deviation) reducing the gap (∆ l ) between best and worst performing languages.

How to Spend the Annotation Budget?
In the previous section we show that the combination of data augmentation techniques with a few new annotations can reach almost 98% of the performance one would obtain by training on 10x more data. In this section we explore how one should allocate a fixed annotation budget, in order to achieve not only higher average but also more equitable performance across languages. Keeping our budget fixed to 4500 instances, we study 3 scenarios. The first is monolingual allocation, where the whole budget is consumed by collecting training examples on a single language. We repeat the study over all 8 languages of our test set, randomly sampling training instances from the TyDi QA training set. Second, we study a tri-lingual budget allocation scheme, where we equally split the budget across 3 languages for 1500 training instances per language. We repeat this experiment 7 times, each time randomly selecting 3 languages. Last, the third and more egalitarian scenario splits the budget equally across all 8 languages, matching our previously analyzed few-shot scenario where we only have 500 additional training examples per language. In all experiments, we use our best-performing approach from the previous section, also utilizing tSQuAD for pre-training.
Our findings are summarized in Table 2. For the repeated monolingual and tri-lingual scenarios we report average performance across our experiment repetitions (full results in Appendix E). We can conclusively claim that a uniform budget allocation leads to not only better average performance, but also to more equitable performance. We report two straightforward measures for the equitability of the average accuracy across languages. First, we report the standard deviation of the accuracy across languages; the lower the standard deviation, the more equitable the performance. We also report the difference between the best and the worst performing language for each experiment, as well as the averages for the languages that are seen and unseen during fine-tuning.
Having no budget for additional annotation (essentially, attempting the task in zero-shot fashion) leads to the most inequitable performance. The monolingual scenario typically leads to the highest accuracy when evaluating on the same language as the new training examples (the ideal section of Table 2) but the zero-shot performance on all other languages is generally significantly worse, leading to inequity. The tri-lingual scenarios follow similar patterns, with performance close to state-of-theart for the four languages (three plus English) that have been included in the fine-tuning process, but with the rest of the languages lagging behind: the difference between seen and unseen languages is on average 10.4 points. In our experiments we randomly sampled (without replacement) three of the seven languages, but one could potentially use heuristics or a meta-model like that of Xia et al. (2020) to find or suggest the best subset of candidate languages for transfer learning; we leave such an investigation for future work.
Encouragingly, the uniform budget allocation scenario leads to higher average performance, while also reducing the gap between worst and best performing languages from around 30 percentage points to less than 12 points (60% reduction). Note that a 8x larger budget (ideal scenario) with 4500 instances per language would further improve downstream accuracy and equitability. Note that in this case where some resources are available, simple multilingual fine-tuning might not be the best approach for some languages, e.g. compared to monolingual fine-tuning or meta-learning approaches (Wang et al., 2020;Muller et al., 2021, inter alia). We leave an investigation of such settings for future work.

Discussion
We show that data augmentation through translation along with few-shot fine-tuning on new languages with a uniform budget allocation leads to a performance close to 98% of an approach using 10x more data, while producing more equitable models than other budget-constrained alternatives.
The implications of our findings become clear with a counter-factual exploration. The Gold Passage portion of the TyDi QA dataset includes around 87,000 annotated examples (50k for training across 9 languages and about 37k development and test samples). Consider the scenario where, given this annotation budget, we maintain the same evaluation standards collecting 4k development and test examples per language, but we only collect 500 training examples per language. In that case, we could have created a much more diverse resource that would include at least 19 languages! Now consider the expectation of the downstream accuracy in our counterfactual scenario: uniform budget allocation on 19 languages would lead to an average accuracy (F1 score) of around 78% (similar to our experiments). Instead, under the (currently factual) scenario where we only have training data for 9 languages, the average accuracy for these 9 languages is around 80%, but the zero-shot expected average on the other 10 languages is 10 points worse -in that case, the overall average accuracy would be around 74%, 4 points lower than that of the egalitarian allocation scenario. Hence, as long as the ideal scenario of collecting a lot of data for a lot of languages remains infeasible, we suggest that the community puts an additional focus on the linguistic diversity of our evaluation sets and use other techniques to address the lack of training data.

A Experimental Settings
For the experiments, we've used "bert-multi-lingual-base-uncased" (mBERT) (Hugging Face -mBERT, 2020) as mentioned as the main baseline on TyDi QA paper . It is a pre-trained model on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective (Devlin et al., 2019). From preliminary experiments, we realized that the optimum trade-off between the highest F1 score and the least computational cost is achieved by training for 3 epochs, using batch size of 24, and learning rate of 3e-5. Therefore, we applied these hyperparameter settings for our experiments. The main script we used was a module under the Huggingface library (Wolf et al., 2020) (called run squad), which is being used widely for fine-tuning transformers for multi-lingual question answering datasets.

B SQuAD Translation Details
We augmented the English SQuAD with translated SQuAD (tSQuAD) instances for each language. Here, the contexts, questions and answers from SQuAD instances are translated to the target languages using Google Translate (with the google-trans-new API) and only the instances where an exact match of translated answer is found in the translated context, are kept for augmentation. The total number of instances per language, we ended up with after translation is listed in Table 3.

C Complete Few-Shot Experiments
Provided in Table 4.

D Mix-and-Match Experiments
Provided in Table 5.

E Budget Allocation Experiments
The complete results for our experiments are presented in Table 6.