AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

African languages have far less in-language content available digitally, making it challenging for question-answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems—those that retrieve answer content from other languages while serving people in their native language—offer a means of filling this gap. To this end, we create A FRI QA, the first cross-lingual QA dataset with a focus on African languages. A FRI QA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, A FRI QA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic trans-∗

systems-those that retrieve answer content from other languages while serving people in their native language-offer a means of filling this gap.To this end, we create AFRIQA, the first cross-lingual QA dataset with a focus on African languages.AFRIQA includes 12,000+ XOR QA examples across 10 African languages.While previous datasets have focused primarily on languages where crosslingual QA augments coverage from the target language, AFRIQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content.Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA.Our experiments demonstrate the poor performance of automatic trans-

Introduction
Question Answering (QA) systems provide access to information (Kwiatkowski et al., 2019) and increase accessibility in a range of domains, from healthcare and health emergencies such as COVID-19 (Möller et al., 2020;Morales et al., 2021) to legal queries (Martinez-Gil, 2021) and financial questions (Chen et al., 2021).Many of these applications are particularly important in regions where information and services may be less accessible and where language technology may thus help to reduce the burden on the existing system.At the same time, many people prefer to access information in their local languages-or simply do not speak a Dataset QA? CLIR?Open Retrieval?# Languages # African Languages XQA (Liu et al., 2019) ✓ ✓ ✓ 9 Nil XOR QA (Asai et al., 2021) ✓ ✓ ✓ 7 Nil XQuAD (Artetxe et al., 2020) ✓ ✗ ✗ 11 Nil MLQA (Lewis et al., 2020) ✓ ✗ ✗ 7 Nil MKQA (Longpre et al., 2021) ✓ ✗ ✓ 26 Nil TyDi QA (Clark et al., 2020) ✓ ✗ ✓ 11 1 AmQA (Abedissa et al., 2023) ✓ ✗ ✗ 1 1 KenSwQuAD (Wanjawa et al., 2023) ✓ ✗ ✗ 1 1 AFRIQA (Ours) ✓ ✓ ✓ 10 10 (see Table 3) Table 1: Comparison of the Dataset with Other Question Answering Datasets.This table provides a comparison of the current dataset used in the study with other related datasets.The first, second, and third columns, "QA", "CLIR", and "Open Retrieval", indicate whether the dataset is question answering, cross-lingual or open retrieval, respectively.The fourth column, "# Languages", shows the total number of languages in the dataset.The final column lists the African languages present in the dataset.
language supported by current language technologies (Amano et al., 2016).To benefit the more than three billion speakers of under-represented languages around the world, it is thus crucial to enable the development of QA technology in local languages.
In this work, we lay the foundation for research on QA systems for one of the most linguistically diverse regions by creating AFRIQA, the first QA dataset for 10 African languages.AFRIQA focuses on open-retrieval QA where information-seeking questions 2 are paired with retrieved documents in which annotators identify an answer if one is available (Kwiatkowski et al., 2019).As many African languages lack high-quality in-language content online, AFRIQA employs a cross-lingual setting (Asai et al., 2021) where relevant passages are retrieved in a high-resource language spoken in the corresponding region and answers are translated into the source language.To ensure utility of this dataset, we carefully select a relevant source 2 These questions are information-seeking in that they are written without seeing the answer, as is the case with real users of QA systems.We contrast this with the reading comprehension task where the question-writer sees the answer passage prior to writing the question; this genre of questions tends to have both higher lexical overlap with the question and elicit questions that may not be of broad interest.language (either English or French) based on its prevalence in the region corresponding to the query language.AFRIQA includes 12,000+ examples across 10 languages spoken in different parts of Africa.The majority of the dataset's questions are centered around entities and topics that are closely linked to Africa.This is an advantage over simply translating existing datasets into these languages.By building a dataset from the ground up that is specifically tailored to African languages and their corresponding cultures, we are able to ensure better contextual relevance and usefulness of this dataset.
We conduct baseline experiments for each part of the open-retrieval QA pipeline using different translation systems, retrieval models, and multilingual reader models.We demonstrate that crosslingual retrieval still has a large deficit compared to automatic translation and retrieval; we also show that a hybrid approach of sparse and dense retrieval improves over either technique in isolation.We highlight interesting aspects of the data and discuss annotation challenges that may inform future annotation efforts for QA.Overall, AFRIQA proves challenging for state-of-the-art QA models.We hope that AFRIQA encourages and enables the development and evaluation of more multilingual and equitable QA technology.The dataset will be released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
In summary, we make the following contributions: • We introduce the first cross-lingual question answering dataset with 12,000+ questions across 10 geographically diverse African languages.This dataset directly addresses  the deficit of African languages in existing datasets.
• We conduct an analysis of the linguistic properties of the 10 languages, which is crucial to take into account when formulating questions in these languages.
• Finally, we conduct a comprehensive evaluation of the dataset for each part of the openretrieval QA pipeline using various translation systems, retrieval models, and multilingual reader models.

AFRIQA
AFRIQA is a cross-lingual QA dataset that was created to promote the representation and coverage of under-resourced African languages in NLP research.We show examples of the data in Table 2.In §2.1, we provide an overview of the 10 languages discussing their linguistic properties, while §2.2 and §2.3 describe the data collection procedure and quality control measures put in place to ensure the quality of the dataset.

Discussion of Languages
African languages have unique typologies, grammatical structures, and phonology, many of them being tonal and morphologically rich (Adelani et al., 2022b).We provide a high-level overview of the linguistic properties of the ten languages in AFRIQA that are essential to consider when crafting questions for QA systems.

Data Collection Procedure
For each of the 10 languages in AFRIQA, a team of 2-6 native speakers was responsible for the data collection and annotation.Each team was led by a coordinator.The annotation pipeline consisted of 4 distinct stages: 1) question elicitation in an African language; 2) translation of questions into a pivot language; 3) answer labeling in the pivot language based on a set of candidate paragraphs; and 4) answer translation back to the source language.All data contributions were compensated financially.

Question Elicitation
The TyDi QA methodology (Clark et al., 2020) was followed to elicit locally relevant questions.
Team members were presented with prompts including the first 250 characters of the most popular Wikipedia3 articles in their languages, and asked to write factual or procedural questions for which the answers were not contained in the prompts.
Annotators were encouraged to follow their natural curiosity.This annotation process avoids excessive and artificial overlap between the question and answer passage, which can often arise in data collection efforts for non-information-seeking QA tasks such as reading comprehension. 4For languages like Fon and Bemba without a dedicated Wikipedia, relevant prompts from French and English Wikipedia were used to stimulate native language question generation.For Swahili, unanswered TyDi QA questions were curated for correctness.The inability of the original TyDi QA team to locate a suitable Swahili paragraph to answer these questions necessitated their inclusion.Simple spreadsheets facilitated this question elicitation process.
Before moving on to the second stage, team coordinators reviewed elicited questions for grammatical correctness and suitability for the purposes of information-seeking QA.

Question Translation
Elicited questions were translated from the original African languages into pivot languages following Asai et al. (2021).English was used as the pivot language across all languages except Wolof and Fon, for which French was used. 5Where possible, questions elicited by one team member were allocated to a different team member for translation to further ensure that only factual or procedural questions that are grammatically correct make it into the final dataset.This serves as an additional validation layer for the elicited questions.

Answer Retrieval
Using the translated questions as queries, Google Programmable Search Engine6 was used to retrieve Wikipedia paragraphs that are candidates to contain an answer in the corresponding pivot language.The Mechanical Turk interface was employed-all annotations were carried out by team members.was used to show candidate paragraphs to team members who were then asked to identify 1) the paragraph that contains an answer and 2) the exact minimal span of the answer.In the case of polar questions, team members had to select "Yes" or "No" instead of the minimal span.In cases where candidate paragraphs did not contain the answer to the corresponding question, team members were instructed to select the "No gold paragraph" option.
As with question elicitation, team members went through a phase of training, which included a group meeting where guidelines were shared and annotators were walked through the labeling tool.Two rounds of in-tool labeling training were conducted.

Answer Translation
To obtain answers in the African languages, we translated the answers in the pivot languages to the corresponding African languages.We allocated the task of translating the answers labeled by team members to different team members in order to ensure accuracy.Translators were instructed to minimize the span of the translated answers.In cases where the selected answers were incorrect or annotators failed to select the minimum span, we either removed the question, corrected the answer, or re-annotated the question using the annotation tool.

Quality Control
We enforced rigorous quality control measures throughout the dataset generation process to ascertain its integrity, quality, and appropriateness.Our strategy involved recruiting native language speakers as annotators and team coordinators.Each annotator underwent initial training in question elicitation via English prompts, with personalized feedback focusing on factual question generation and avoiding prompt-contained answers.Annotators were required to achieve at least 90% accuracy to proceed to native language elicitation, with additional one-on-one training rounds provided as necessary.
Each language team comprised a minimum of three members, with Fon and Kinyarwanda teams as exceptions, hosting two members each.This structure was designed to ensure different team members handled question elicitation and translation, enhancing quality control.Non-factual or inappropriate questions were flagged during the translation and answer labeling phases, leading to their correction or removal.Team coordinators meticulously reviewed all question-and-answer pairs alongside their translations, while central managers checked translation consistency.Postannotation controls helped rectify common issues such as answer-span length and incorrect answer selection.

Final Dataset
The statistics of the dataset are presented in Table 3, which includes information on the languages, their corresponding pivot languages, and the total number of questions collected for each language.The final dataset consists of a total of 12,239 questions across 10 different languages, with 8,892 corresponding question-answer pairs.We observed a high answer coverage rate, with only 27% of the total questions being unanswerable using Wikipedia.This can be attributed to the lack of relevant information on Wikipedia, especially for Africa related entities with sparse information.Despite this sparsity, we were able to find answers for over 60% of the questions in most of the languages in our collection.

Tasks and Baselines
As part of the evaluation for AFRIQA, we follow the methodology proposed in Asai et al. (2021) and assess its performance on three different tasks: XOR-Retrieve, XOR-PivotLanguageSpan, and XOR-Full.Each task poses unique challenges for cross-lingual information retrieval and QA due to the low-resource nature of many African languages.

Translation Systems
A common approach to cross-lingual QA is to translate queries from the source language into a target language, which is then used to find an answer in a given passage.For our experiments, we explore the use of different translation systems as baselines for AFRIQA.We consider human translation, Google Translate, and open-source translation models such as NLLB (NLLB Team et al., 2022) and finetuned M2M-100 models (Adelani et al., 2022a) in zeroshot settings.
Google Translate.We use Google Translate because it provides out-of-the-box translation for 7 out of 10 languages in our dataset.Although Google Translate provides a strong translation baseline for many of the languages, we cannot guarantee the future reproducibility of these translations as it is a product API and is constantly being updated.For our experiments, we use the translation system as of February 20237 .

NLLB.
NLLB is an open-source translation system trained on 100+ languages and provides translation for all the languages in AFRIQA.At the time of release, NLLB provides state-of-the-art translation in many languages and covers all the languages in our dataset.For our experiments, we use the 1.3B size NLLB models.

Passage Retrieval (XOR-Retrieve)
We present two baseline retrieval systems: translate-retrieve and cross-lingual baselines.In the translate-retrieve baseline, we first translate the queries using the translation systems described in §4.1.The translated queries are used to retrieve relevant passages using different retrieval systems outlined below.Alternatively, the cross-lingual baseline directly retrieves passages in the pivot language without the need for translation using a multilingual dense retriever.
BM25.BM25 (Robertson and Zaragoza, 2009) is a classic term-frequency-based retrieval model that matches queries to relevant passages using the frequency of word occurrences in both queries and passages.We use the BM25 implementation provided by Pyserini (Lin et al., 2021) with default hyperparameters k1 = 0.9, b = 0.4 for all languages.mDPR.We evaluate the performance of mDPR, a multilingual adaptation of the Dense Passage Retriever (DPR) model (Karpukhin et al., 2020) using multilingual BERT (mBERT).We finetuned mDPR on the MS MARCO passage ranking dataset (Bajaj et al., 2018) for our experiments.Retrieval is performed using the Faiss Flat Index implementation provided by Pyserini.
Sparse-Dense Hybrid.We also explore sparsedense hybrid baselines, a combination of sparse (BM25) and hybrid (mDPR) retrievers.We use a linear combination of both systems to generate a reranked list of passages for each question.

Answer Span Prediction
To benchmark models' answer selection capabilities on AFRIQA, we combine different translation, extractive, and generative QA approaches.
Generative QA on Gold Passages.To evaluate the performance of generative QA, we utilize mT5-base (Xue et al., 2021) finetuned on SQuAD 2.0 (Rajpurkar et al., 2016) and evaluate it using both translated and original queries.The model was provided with the queries and the gold passages that were annotated using a template prompt and generates the answers to the questions.
various retrieval baselines outlined in §4.2.The model is trained to extract answer spans from each passage, along with the probability indicating the likelihood of each answer.The answer span with the highest probability is selected as the correct answer.We trained a multilingual DPR reader model, which was initialized from mBERT and finetuned on Natural Questions (Kwiatkowski et al., 2019).

XOR-Retrieve Results
We present the retrieval results for recall@10 in using different translation and retrieval systems.We also report the performance with both original and human-translated queries.The table shows that hybrid retrieval using human translation yields the best results for all languages, with an average recall@10 of 73.9.In isolation, mDPR retrieval outperforms BM25 for all translation types.This table also enables us to compare the effectiveness of different translation systems in locating relevant passages for cross-lingual QA in African languages.This is illustrated in Figure 1, showing retriever recall rates for different translation types at various passages containing the answer.cutoffs using mDPR.
We observe that human translation yields better accuracy than all other translation types, indicating that the current state-of-the-art machine translation systems still have a long way to go in accurately translating African languages.Google Translate shows better results for the languages where it is available, while the NLLB model provides better coverage.The cross-lingual retrieval model that retrieves passages using questions in their original language is the least effective of all the model types.This illustrates that the cross-lingual representations learned by current retrieval methods are not yet of sufficient quality to enable accurate retrieval across different languages.

XOR-PivotLanguageSpan Results
Gold Passage Answer Prediction.We first evaluate the generative QA setting using gold passages.We present F1 and Exact Match results using different methods to translate the query in Table 5. Human translation of the queries consistently outperforms using machine-translated queries, which outperforms using queries in their original language.
Retrieved Passages Answer Prediction.We now evaluate performance using retrieved passages from §5.1.We present F1 and Exact Match results with different translation-retriever combinations in Table 7.We extract the answer spans from only the top-10 retrieved passages for each question using an extractive multilingual reader model (see §4.3).The model assigns a probability to each answer span, and we select the answer with the highest probability as the final answer.
Our results show that hybrid retrieval using human-translated queries achieves the best performance across all languages on average.Using human-translated queries generally outperforms using translations by both Google Translate and NLLB, regardless of the retriever system used.In terms of retrieval methods, mDPR generally performs better than BM25, with an average gain of 3 F1 points across different translation types.These results highlight the importance of carefully selecting translation-retriever combinations to achieve the best answer span prediction results in crosslingual QA.

XOR-Full Results
Each pipeline consists of components for question translation, passage retrieval, answer extraction, and answer translation.From Table 8, we observe that Google machine translation combined with mDPR is the most effective.This is followed by a pipeline combining NLLB translation with mDPR.
Datasets for QA and Information Retrieval tasks have also been created.They are, however, very few and cater to individual languages (Abedissa et al., 2023;Wanjawa et al., 2023) or a small subset of languages spoken in individual countries (Daniel et al., 2019;Zhang et al., 2022).Given the region's large number of linguistically diverse and information-scarce languages, multilingual and cross-lingual datasets are encouraged to catalyze research efforts.To the best of our knowledge, there are no publicly available cross-lingual openretrieval African language QA datasets.
Comparison to Other Resources.Multilingual QA datasets have paved the way for language models to simultaneously learn across multiple languages, with both reading comprehension (Lewis et al., 2020) and other QA datasets (Longpre et al., 2021;Clark et al., 2020) predominantly utilizing publicly available data sources such as Wikipedia, SQUAD, and the Natural Questions dataset.To address the information scarcity of the typically used data sources for low-resource languages, crosslingual datasets (Liu et al., 2019;Asai et al., 2021) emerged that translate between low-resource and high-resource languages, thus providing access to a larger information retrieval pool which decreases the fraction of unanswerable questions.Despite these efforts, however, the inclusion of African languages remains extremely rare, as shown in Table 1, which compares our dataset to other closely related QA datasets.TyDi QA features Swahili as the sole African language out of the 11 languages it covers.

Conclusion
In this work, we take a step toward bridging the information gap between native speakers of many African languages and the vast amount of digital information available on the web by creating AFRIQA, the first open-retrieval cross-lingual QA dataset focused on African languages with 12,000+ questions.We anticipate that AFRIQA will help improve access to relevant information for speakers of African languages.By leveraging the power of cross-lingual QA, we hope to bridge the information gap and promote linguistic diversity and inclusivity in digital information access.

Limitations
Our research focuses on using English and French Wikipedia as the knowledge base for creating systems that can answer questions in 10 African languages.While Wikipedia is a comprehensive source of knowledge, it does not accurately reflect all societies and cultures (Callahan and Herring, 2011).There is a limited understanding of African contexts with relatively few Wikipedia articles dedicated to Africa related content.This could potentially limit the ability of a QA system to accurately find answers to questions related to African traditions, practices, or entities.Also, by focusing on English and French and pivot languages, we might introduce some translation inaccuracies or ambiguities which might impact the performance of a QA system.
Addressing these limitations requires concerted efforts to develop more localized knowledge bases or improve existing sources such as Wikipedia by updating or creating articles that reflect the diversity and richness of African societies.

A Preparing Wikipedia Passages
Wikipedia is a popular choice as a knowledge base for open-retrieval question-answering (QA) experiments, where articles are usually divided into fixed-length passages that are indexed and used for retrieval and reading comprehension, as seen in previous works such as (Karpukhin et al., 2020;Asai et al., 2021).However, Tamber et al. ( 2023) highlighted that splitting articles into fragmented and disjoint passages can negatively impact downstream reading comprehension performance.Instead, they proposed a sliding window segmentation approach to create passages from Wikipedia articles.In line with this methodology, we used the same approach to create passages for our crosslingual question-answering experiments.
To create our passages, we downloaded the Wikipedia dumps dated May 01, 2022, for English Wikipedia and April 20, 2022, for French Wikipedia.We then applied the sliding window approach to generate fixed-length passages of 100 tokens each from these dumps.These passages serve as our knowledge base for retrieval and answer span extraction.By adopting the sliding window segmentation approach for creating Wikipedia passages, we aim to improve downstream reading comprehension performance.The fixed-length passages enable efficient indexing and retrieval of relevant information for a given question while reducing the impact of disjoint and fragmented information that may occur when arbitrarily splitting articles.

B Training and Evaluation Details B.1 mDPR Reader:
We train a multilingual DPR reader model using pretrained bert-base-multilingual-uncased9 as the model backbone.The model was trained to predict the correct answer span for a question given a set of relevant passages.We trained our model using the DPR retriever output10 on the training and development set of Natural questions and evaluated on the test set of AFRIQA in a zero-shot manner.The model was trained on 4 A6000 Nvidia GPUs with a batch size of 16 and 2 gradient accumulation steps.We used an initial learning rate of 5e-5 and 500 warmup steps.The full list of training hyperparameters can be found in Table 9.

B.2 AfroXLM-R Reader
To extract answer spans from the gold passages, we train extractive reader models on the training set of Squad 2.0 (Rajpurkar et al., 2016) and fQuad (d'Hoffschmidt et al., 2020) using AfroXLM-R as a backbone.We evaluated the models on the test queries and the annotated gold passages.The models were trained for 5 epochs using a fixed learning rate of 3e-5 and batch size of 16 on a single A100 Nvidia GPU.

B.3 mT5 Reader
We finetuned multilingual pretrained text-to-text transformer (mT5) (Xue et al., 2020) on Squad 2.0 (Rajpurkar et al., 2016) dataset to generate answers from the gold passages.We trained the model for 5 epochs with a learning rate of 3e-5 and batch size of 32 on a single A100 Nvidia GPU.

C Machine Translation BLEU Scores
Table 10 shows the BLEU score of the different translation systems on the test set of AFRIQA, evaluated against the human-translated queries.Google Translate performs the best on the languages it supports while NLLB 1.3B achieves slightly poorer performance with a broader language coverage.This further highlights the downstream effect of translation quality on retriever effectiveness with human translations showing better accuracy than other machine translation systems.

D.2 XOR-Full Results
Table 12 presents the Exact Match Accuracy and BLEU scores of the XOR-Full task.The table contains downstream results of different translationretriever pipelines to extract the answer span and translate it back to the same language as the question.

E Summary of Language Linguistic Properties
In Table 13, we provide a structured breakdown of the typologies, grammatical structures, and phonology of the 10 languages in AFRIQA.

Figure 1 :
Figure 1: Graph of retriever recall@k for different translation systems.The scores shown in this graph are from mDPR retrieval.

Table 2 :
Tableshowingselected questions, relevant passages, and answers in different languages from the dataset.It also includes the human-translated versions of both questions and answers.For the primary XOR QA task, systems are expected to find the relevant passage among all Wikipedia passages, not simply the gold passage shown above.

Table 3 :
Dataset information: This table contains key information about the AFRIQA Dataset

Table 4 :
Retrieval Recall@10: This table displays the retrieval recall results for various translation types on the test set of AFRIQA.The table shows the percentage of retrieved passages that contain the answer for the top-10 retrieved passages.The last column represents crosslingual retrieval, where we skip the translation step and use the original queries.We boldface the best-performing model for each language within the human translation oracle scenario and within the real-world automatic translation scenario.

Table 4 8
. The table includes retriever results8 For recall@k retrieval results, we assume that there is only one gold passage despite the possibility of other retrieved

Table 6 :
(Alabi et al., 2022)ages Answer Prediction: Comparison of F1 and Exact Match Accuracy scores for extractive answer span prediction on the test set using AfroXLMR-base(Alabi et al., 2022)as the backbone.

Table 7 :
F1 scores on pivot language answer generation using an extractive multilingual reader model with different query translation and retrieval methods.

Table 8 :
XOR-Full F1 results combining different translation and retriever systems.

Table 9 :
DPR Reader Training Configurations

Table 10 :
Translation BLEU Scores: BLEU score of some translation systems on the test set for the answer translation task.Note that Google Translate is not yet available in all languages, due to their very low-resource nature.