MIA 2022 Shared Task Submission: Leveraging Entity Representations, Dense-Sparse Hybrids, and Fusion-in-Decoder for Cross-Lingual Question Answering

We describe our two-stage system for the Multilingual Information Access (MIA) 2022 Shared Task on Cross-Lingual Open-Retrieval Question Answering. The first stage consists of multilingual passage retrieval with a hybrid dense and sparse retrieval strategy. The second stage consists of a reader which outputs the answer from the top passages returned by the first stage. We show the efficacy of using entity representations, sparse retrieval signals to help dense retrieval, and Fusion-in-Decoder. On the development set, we obtain 43.46 F1 on XOR-TyDi QA and 21.99 F1 on MKQA, for an average F1 score of 32.73. On the test set, we obtain 40.93 F1 on XOR-TyDi QA and 22.29 F1 on MKQA, for an average F1 score of 31.61. We improve over the official baseline by over 4 F1 points on both the development and test sets.


Introduction
This paper describes our submission to the Multilingual Information Access (MIA) 2022 Shared Task on Cross-Lingual Open-Retrieval Question Answering.Cross-lingual open-retrieval question answering is the task of finding an answer to a knowledge-seeking question in the same language as the question from a collection of documents in many languages.The answer may not necessarily exist in a document that's in the same language as the question, and hence a system need to find the answer across relevant documents in a different language.The shared task at Multilingual Information Access 2022 evaluates cross-lingual open-retrieval question answering systems using two datasets, XOR-TyDi QA (Asai et al., 2020) and MKQA (Longpre et al., 2020). 2e use a two stage approach, similar to the CORA (Asai et al., 2021) baseline, where the first stage performs multilingual passage retrieval and the second stage performs cross-lingual answer generation.In the first stage, we leverage mLUKE (Ri et al., 2021), a pretrained language model that models entities, to train a dual encoder that encodes the question and passage separately (Karpukhin et al., 2020).During retrieval, we perform nearest neighbor search using the query vector on an index of encoded passage vectors.We merge these dense retrieval hits with BM25 sparse retrieval hits using an algorithm we call Sparse-Corroborate-Dense.Finally, we feed the ranked list of passages into a reader based on Fusion-in-Decoder (Ri et al., 2021) and mT5 (Xue et al., 2020) to produce the final answer.We do not perform iterative training to repeat these steps multiple times.
Compared to official baseline 1, we improve the macro-averaged score by 4.1 F1 points.We perform analysis to show the effectiveness of using a multilingual language model with entity representations in pretraining, sparse signals to improve dense hits, and Fusion-in-Decoder.

Datasets
We use the official training data consisting of 76635 English questions and answers from Natural Questions (Kwiatkowski et al., 2019) and 61360 questions and answers from XOR-TyDi QA (Asai et al., 2020) to train our dual encoder model.We do not train on the development data or the subsets of the Natural Questions and TyDi QA (Clark et al., 2020) data, which are used to create MKQA or XOR-TyDi QA data.For training the reader, we leverage Wikipedia language links, which is detailed in Section 3.3.
XOR-TyDi QA consists of annotated questions and short answers across seven typologically di-arXiv:2207.01940v3[cs.CL] 18 Jul 2022 verse languages.It can be broken down into two subsets, questions where the answer can be found in a passage in the same language as the question ("in-language"), which just come from answerable questions in TyDi QA (Karpukhin et al., 2020), and questions where the answer is unanswerable from a passage in the same language as the question and can only be found in an English passage ("cross-lingual"), which are newly added answers in XOR-TyDi QA.A system should be able to succeed at both monolingual retrieval and crosslingual retrieval.
MKQA (Longpre et al., 2020) consists of 10K parallel questions and answers across 26 typologically diverse locales.The original question is taken from Natural Questions (Kwiatkowski et al., 2019) in English and translated to 25 different locales.MKQA does not contain any data for training and is only used for evaluation.

Passage Corpus
We directly use the passages corpus provided by the shared task, with the addition of Tamil (ta) and Tagalog (tl) which are not included in the baseline's passage data.Following the other languages, we use the 20190201 snapshot of the Wikipedia dumps.We follow the same preprocessing steps as the baseline passages data. 3We manually split the data into language-specific files, which are later used to build language-specific dense and sparse indices.Final passage retrieval results are aggregated among different indices.The number of passages in each language is shown in Table 1.

System Architecture and Pipeline
Our system differs from the baseline in three ways.First, in the passage retrieval step, we replace mBERT (Devlin et al., 2019) with mLUKE (Ri et al., 2021).Second, we construct sparse indices from which we will retrieve passages to augment dense retriever-retrieved passages, inspired by Zhang et al. (2021) but uses a different densesparse hybrid approach.Finally, we encode each question and passage independently as opposed to all passages together following the Fusion-in-Decoder (Izacard and Grave, 2020)

mLUKE Dense Retriever
For dense retrieval, we initialize the dual encoder in DPR (Karpukhin et al., 2020) with a multilingual language model pretrained with entity information, mLUKE (Ri et al., 2021), from Wikipedia dumps in 24 languages.We use the same training objective as DPR and also use the last layer's hidden state of the first input token as the representation for both the question and passage.Only in-batch negatives are used to train the dual encoder.mLUKE is pretrained using both the masked language modeling (MLM) task (Vaswani et al., 2017;Devlin et al., 2019) and the masked entity prediction (MEP) task (Yamada et al., 2020).Note that entity representations are only used in pretraining input and only word inputs are used in finetuning the dual encoder and inference for simplicity.This resembles the mLUKE-W variant in the mLUKE paper (Ri et al., 2021), which still observed notable improvements over the baselines from only having MEP as an auxiliary task in pretraining.We use the Hugging Face transformers (Wolf et al., 2020) versions of mluke-base and bert-base-multilingual-uncased.We build off the official DPR codebase to add the mLUKE encoder.4

Dense-Sparse Hybrids
Merged Ranked List We first retrieve the 5 passages with the highest scores from the dense indices, and the top 5 passages from the Bengali sparse index.For the first output slice, we take passages in both lists, ordered by same order as in the dense list, which are doc 5 and doc 2. For the second slice, we add the top remaining passages from the dense list, which are doc 3 and doc 1.For the third slice, we take the top remaining passages from the sparse list, which is doc 8.The max number of sparse results that is allowed to influence the final list is 0.6 × 5 = 3, which are docs 5, 2, and 8.
In order to effectively retrieve passages in a multilingual setting, the retrieval component needs to do well in both monolingual retrieval and crosslingual retrieval.Monolingual retrieval is the setting where we want to retrieve passages in the same language as the question.For more than half of the questions in the XOR-TyDi QA dataset, for example, the answer is found in a passage that's in the same language as the question.Cross-lingual retrieval is the setting where we want to retrieve relevant passages in different language from the question.We use both sparse retrieval (i.e.BM25) and dense retrieval together in our system.Experiments in Mr. TyDi (Zhang et al., 2021) indicate BM25 outperforms DPR (Karpukhin et al., 2020) for the languages in XOR-TyDi QA in the monolingual retrieval setting, but combining the sparse score and DPR score in sparse-dense hybrids perform even better.At the same time, sparse retrieval rely on lexical token matches and cannot do crosslingual retrieval effectively without translating the query to the same language as the passage.To remove the need to use a machine translation system for simplicity, we rely on multilingual dense passage retrieval for cross-lingual retrieval.
For dense retrieval, we use FAISS (Johnson et al., 2019) with IndexFlatIP.For sparse retrieval, we use Pyserini (Yang et al., 2017;Lin et al., 2021) with BM25 with default parameters.We build separate indices for each language for both the dense and sparse setting.For each query, where we want to return K passage, we search for the top K passages globally in the dense indices in all languages, and search for the top K passages in the sparse index in the same language as the question.
We combine results from dense retrieval and sparse retrieval using the following algorithm, which we call Sparse-Corroborate-Dense.Our final ranked list consists of three ordered slices.The first slice consists of passages that are present in both dense and sparse retrieved lists, ranked in the same order as they appear in dense retrieval.The second slice consists of passages that are only in the dense ranked list and not in the sparse ranked list.The last slice consists of top passages in the sparse ranked list.The number of passages from the sparse hits that are allowed to influence the final ranked list is no more than max_f rac * K .We find this works better than the score normalization and combining approach in Mr. TyDi (Zhang et al., 2021) for cross-lingual retrieval.Figure 1 has an illustration of this algorithm running with K = 5 and max_f rac = 0.6 for a Bengali (bn) question.Please refer to Algorithm 1 for code of the algorithm.

Reader
Instead of concatenating the question and all the passages in the input to the encoder like in the baseline, which we will call Fusion-in-Encoder, we use the Fusion-in-Decoder (FiD) approach (Izacard and Grave, 2020).In Fusion-in-Decoder, the encoder processes each of the ctxs passages independently adding special tokens question: lang: title: and context: before the question, title and text of each passage, while the decoder performs attention over the concatenation of the resulting representations of all the retrieved passages.We use mT5 (Xue et al., 2020) as the encoder-decoder architecture, specially using the mt5-base variant in Hugging Face transformers (Wolf et al., 2020) using the Fusion-in-Decoder codebase. 5ndependent processing of passages in the encoder allows to scale linearly to large number of contexts, while processing passages jointly in the decoder helps better aggregate evidence from multiple passages.
In order to semantically ground the entities across different languages together, we use Wikipedia language links to augment the data from retriever while training FiD based reader, like the CORA baseline.First, for each question in the MIA training set that comes from Natural Questions, we use the answer to search for the corresponding Wikipedia page using the Wikipedia API.Generally, only answers that are entities will have a result.This returns the titles of the Wikipedia articles in different languages, which we use as the answer in different languages.We use the DPR checkpoint trained with adversarial examples to retrieve English passages from the index. 6For each English question-answer pair, we find corresponding entries in other languages being evaluated in the task and generate [Query eng , Lang target , Answer target , P assages] tuples for training FiD.This data is augmented to the original training data provided by the retriever.

Results
For training the dual encoder, we use the official training data without any hard negatives with the same hyperparameters as the baseline dual encoder (Asai et al., 2021).For training Fusion-in-Decoder, we combine the all of the retrieval results with sampled Wikipedia language link augmented passages such that the total percentage of training examples from either source is 50%.For Wikipedia language link augmentation, we only sampled English answers from the training set with links to 10 or more languages.We use the baseline retrieval results instead of mLUKE-retrieved results to develop the retriever and reader in parallel.We use learning rate of 0.00005 with linear learning schedule with a weight decay of 0.01 using the AdamW optimizer.The context size (number of passages) in the final submission is 20 passages.Note for retrieval we use K = 60 to use the same retrieval results for different context size experiments, but in the final submitted system take the top 20 from this list for the reader.We use max_f rac = 0.2 for Sparse-Corroborate-Dense.We use the best checkpoint on the development set for both components.

Main Results
We first report end-to-end results using our best system compared to the baseline in Table 2 for the development set and Table 3 for the test set.On the development set, we obtain macro-averaged F1 score of 43.46 across all languages on XOR-TyDi QA, an improvement of 3.70 F1 points over 39.76 obtained by baseline 1.We obtain macroaveraged F1 score of 21.99 across all languages on MKQA, an improvement of 4.61 F1 points over 17.38 obtained by baseline 1.On the test set, we observe fairly consistent results compared to the development set.On XOR-TyDi QA, our system and baseline 1 obtains 40.93 and 37.95 respectively, an improvement of 2.98 F1 points.On MKQA, our system and baseline 1 obtains 22.29 and 17.14 respectively, an improvement of 5.15 F1 points.On both the development set and test set, we outperform the baseline on all languages except for Khmer (km) on MKQA.We observe our system frequently retrieves irrelevant passages for Khmer through qualitatively sampling some passages retrieved for Khmer questions, providing little chance for the reader to find the answer.mLUKE uses 24 languages for pretraining and does not include Khmer, making it difficult to align entities in Khmer.Furthermore, even if we use the baseline retrieval results, we still see a large drop in reader effectiveness when we switch to Fusion-in-Decoder from row (iii) to (ii) in Table 4.We only have 3101 rows in the training data for Khmer for our reader all from Wikipedia language links, out of 275990 rows in total.
On the surprising languages Tagalog (tl) and Tamil (ta), we outperform the baseline by a large margin.Perhaps surprisingly, this large improvement cannot be attributed to the presence of Tagalog and Tamil passages in our corpus, since in our best submission, for example, out of the 350 Tamil questions, only one question has a retrieved passage in Tamil in the top results that are fed to the reader.Instead, the system is able to generate correct answers from English passages.
(iii) to row (ii), even though we did not increase the number of passages for Fusion-in-Decoder and kept it at 20 for the final system.The second largest gain comes from switching mBERT to mLUKE from row (ii) to row (i).Finally, the smaller gain comes from switching dense retrieval only to Sparse-Corroborate-Dense, from row (i) to mLUKE + SP + FiD.We study each of the components in greater detail below.
[ta] ெம அ ல ெரானா ெடா, இ வ யா அ க ேகா கைள அ த ?Messi or Ronaldo, which of the two has scored the most goals?Entity Representations To evaluate the passage retrieval component for XOR-TyDi QA, we measure MRR@60 and Recall@60.We picked 60 because it is the near the maximum number of passages we can feed into Fusion-in-Decoder bound by the GPU memory.For each question, to determine if a passage is relevant, we use a heuristic.First, we find the universe set of answers for the questions, which not only contain answers in the same language, but also possibly answers in English using the English answer in the XOR-English Span task (Asai et al., 2020).We check if the normalized answer is a substring of the passage text, and if so, we mark the passage as relevant.
Note that this is a proxy for measuring passage relevance, since answers may not necessarily be exact spans / substrings or the same answer may appear as a substring in a non-relevant passage, but we found it to correlate well with end-to-end effectiveness.We see from Table 5 that overall using mLUKE improves passage retrieval effectiveness.Qualitatively, we also find examples where the dual encoder trained with mLUKE can find passages cross-lingually with the relevant entity whereas that trained with mBERT could not.In Figure 2, we see mLUKE can retrieve an English top passage about the soccer players Messi and Ronaldo asked in Tamil, but mBERT returns just an English passage about another soccer player not relevant to the question.
Figure 3: Here we see a highly relevant passage found by sparse monolingual retrieval that is not found by dense retrieval, and a relevant passage found by dense retrieval cross-lingually that is not found by sparse retrieval.
Dense-Sparse Hybrids Next, we evaluate the benefits of using dense retrieval in conjunction with  54.93 29.56 46.88 45.76 33.16 42.03 46.28 42.66 13.23 38.01 29.57 25.36 11.45 27.28 18.37 2.53 18.59 28.22 25.43 21.98 21.67 (ii) mBERT + FiD 53.19 29.25 46.97 43.25 30.38 42.79 44.22 41.44 10.94 37.42 28.18 21.89 9.63 27.20 15.00 2.11 16.41 26.96 21.86 20.24 19.82 (iii) mBERT + FiE 49.71 29.15 42.72 41.20 30.64 40.16 38.57 38.88 8.95 33.87 25.08 21.15 6.72 24.55 15.27 6.05 15.60 25.53 20.44 13.71 18.07   sparse retrieval as opposed to using only dense retrieval in Table 6.The dense retriever here is mLUKE.We see that dense retrieval always works better than sparse retrieval when used independently, and the score combination approach used in Mr. TyDi (Zhang et al., 2021) does not outperform dense retrieval in recall, but does improve the MRR.We use Sparse-Corroborate-Dense, which piggybacks on dense retrieval results, but boosts the ranking of some passages in dense retrieval, and add in additional passages not found by dense retrieval to the end of the top-K list.Compared to dense only, it is better on both MRR and recall.When both dense and sparse retrieval finds the same passage, it is a strong signal the passage is relevant.Nonetheless, sparse retrieval can still find passages that dense retrieval cannot find, and adding these to the candidate passage list passed to the reader can provide additional relevant evidence passages.In Figure 3, we see sparse retrieval can find a highly relevant passage related to World War I in Telugu (te) to the Telugu question that cannot be found by dense retrieval, and dense retrieval can find a passage related to Bernard Montgomery cross-lingually in Swedish (sv) to a Finnish (fi) question that cannot be found by sparse retrievalthey can complement each other.
Fusion-in-Decoder We want to understand the effect of increasing the number of passages sent to the reader by comparing the effectiveness of the reader when there are 20 passages versus 60 passages.Intuitively, there could be relevant passages found in positions 21-60, which should strengthen the evidence needed to output the final answer.
From Table 7 we observe using more evidence passages consistently improve results, and this scaling advantage is key over Fusion-in-Encoder.However, due to time limitations, we only used the 20 passages setting for the final shared task submission.

Conclusion
We describe our submission for the MIA 2022 Shared Task and detail some experiments we perform to improve specific components of the system.We find that using mLUKE (Ri et al., 2021), a multilingual language model that models entities during pretraining, combing dense and sparse results using Sparse-Corroborate-Dense, and Fusion-in-Decoder, are helpful for improving the effective-ness for cross-lingual question answering over the baseline.

Figure 1 :
Figure1: An illustration of the Sparse-Dense-Corroborate algorithm, running with K = 5 and max_f rac = 0.6 for a Bengali (bn) question.We first retrieve the 5 passages with the highest scores from the dense indices, and the top 5 passages from the Bengali sparse index.For the first output slice, we take passages in both lists, ordered by same order as in the dense list, which are doc 5 and doc 2. For the second slice, we add the top remaining passages from the dense list, which are doc 3 and doc 1.For the third slice, we take the top remaining passages from the sparse list, which is doc 8.The max number of sparse results that is allowed to influence the final list is 0.6 × 5 = 3, which are docs 5, 2, and 8.

[Figure 2 :
Figure2: The top passage for a Tamil question retrieved by mBERT and mLUKE.We see mLUKE is able to find English passages related to entities Messi and Ronaldo, but mBERT struggles and only finds a general passage related to another unrelated soccer player related to goal scoring.

Table 1 :
approach.Number of passages in corpus for each language.

Table 4 :
Ablation studies on the development sets.mLUKE + SP + FiD is our submission with mLUKE + Sparse-Corroborate-Dense. (i) mLUKE + FiD only relies on dense retrieval, and we observe a slight decrease in the F1 score of most languages compared with our submission.(ii) mBERT + FiD changes the retriever to mBERT, and we observe a larger drop in F1 score compared to mLUKE in row (i).(iii) mBERT + FiE changes Fusion-in-Decoder to Fusion-in-Encoder as in the baseline and we see an even larger drop in F1 score compared with row (ii).

Table 5 :
MRR@60 and Recall@60 of passage retrieval for XOR-TyDi QA dev set for different pretrained language models.

Table 7 :
Exact Match (EM) and F1 score for different number of passages on the development sets from mLUKE retrieved passages.