Cross-language Sentence Selection via Data Augmentation and Rationale Training

This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data. Moreover, when a rationale training secondary objective is applied to encourage the model to match word alignment hints from a phrase-based statistical machine translation model, consistent improvements are seen across three language pairs (English-Somali, English-Swahili and English-Tagalog) over a variety of state-of-the-art baselines.


Introduction
Sentence-level query relevance prediction is important for downstream tasks such as query-focused summarization and open-domain question answering; accurately pinpointing sentences containing information that is relevant to the query is critical to generating a responsive summary/answer (e.g., Baumel et al. (2016Baumel et al. ( , 2018). In this work, we focus on sentence-level query relevance prediction in a cross-lingual setting, where the query and sentence collection are in different languages and the sentence collection is drawn from a low-resource language. Our approach enables English speakers (e.g., journalists) to find relevant information expressed in local sources (e.g., local reaction to the pandemic and vaccines in Somalia).
While we can use machine translation (MT) to translate either the query or each sentence into a common language, and then use a monolingual Information Retrieval (IR) system to find relevant sentences, work on Probabilistic Structured Queries (PSQ) (Darwish and Oard, 2003) has shown that the performance of such MT+IR pipelines is hindered by errors in MT. As is well known, complete translation of the sentence collection is not necessary. Inspired by previous work (Vulić and Moens, 2015), we go a step further and propose a simple cross-lingual embedding-based model that avoids translation entirely and directly predicts the relevance of a query-sentence pair (where the query and sentence are in different languages).
For training, we treat a sentence as relevant to a query if there exists a translation equivalent of the query in the sentence. Our definition of relevance is most similar to the lexical-based relevance used in Gupta et al. (2007) and Baumel et al. (2018) but our query and sentence are from different languages. We frame the task as a problem of finding sentences that are relevant to an input query, and thus, we need relevance judgments for query-sentence pairs. Our focus, however, is on low-resource languages where we have no sentence-level relevance judgments with which to train our query-focused relevance model. We thus leverage noisy parallel sentence collections previously collected from the web. We use a simple data augmentation and negative sampling scheme to generate a labeled dataset of relevant and irrelevant pairs of queries and sentences from these noisy parallel corpora. With this synthetic training set in hand, we can learn a supervised cross-lingual embedding space.
While our approach is competitive with pipelines of MT-IR, it is still sensitive to noise in the parallel sentence data. We can mitigate the negative effects of this noise if we first train a phrase-based statistical MT (SMT) model on the same parallel sentence corpus and use the extracted word alignments as additional supervision. With these alignment hints, we demonstrate consistent and significant improvements over neural and statistical MT+IR (Niu et al., 2018;Koehn et al., 2007;Heafield, 2011), three strong cross-lingual embedding-based models (Bivec (Luong et al., 2015), SID-SGNS (Levy et al., 2017), MUSE (Lample et al., 2018)), a probabilistic occurrence model (Xu and Weischedel, 2000), and a multilingual pretrained model XLM-RoBERTa (Conneau et al., 2020). We refer to this secondary training objective as rationale training, inspired by previous work in text classification that supervises attention over rationales for classification decisions (Jain and Wallace, 2019).
To summarize, our contributions are as follows. We (i) propose a data augmentation and negative sampling scheme to create a synthetic training set of cross-lingual query-sentence pairs with binary relevance judgements, and (ii) demonstrate the effectiveness of a Supervised Embedding-based Cross-Lingual Relevance (SECLR) model trained on this data for low-resource sentence selection tasks on text and speech. Additionally, (iii) we propose a rationale training secondary objective to further improve SECLR performance, which we call SECLR-RT. Finally, (iv) we conduct training data ablation and hubness studies that show our method's applicability to even lower-resource settings and mitigation of hubness issues (Dinu and Baroni, 2015;Radovanović et al., 2010). These findings are validated by empirical results of experiments in a low-resource sentence selection task, with English queries over sentence collections of text and speech in Somali, Swahili, and Tagalog.

Related Work
Query-focused Sentence Selection Sentencelevel query relevance prediction is important for various downstream NLP tasks such as queryfocused summarization (Baumel et al., 2016(Baumel et al., , 2018Feigenblat et al., 2017) and open-domain question answering (Chen et al., 2017;Dhingra et al., 2017;Kale et al., 2018). Such applications often depend on a sentence selection system to provide attention signals on which sentences to focus upon to generate a query-focused summary or answer a question.
Cross-language Sentence Selection A common approach to cross-language sentence selection is to use MT to first translate either the query or the sentence to the same language and then perform standard monolingual IR (Nie, 2010). The risk of this approach is that errors in translation cascade to the IR system.
As an alternative to generating full translations, PSQ (Darwish and Oard, 2003) uses word-alignments from SMT to obtain weighted query term counts in the passage collection. In other work, Xu and Weischedel (2000) use a 2-state hidden Markov model (HMM) to estimate the probability that a passage is relevant given the query.

Cross-lingual
Word Embeddings Crosslingual embedding methods perform cross-lingual relevance prediction by representing query and passage terms of different languages in a shared semantic space (Vulić and Moens, 2015;Litschko et al., 2019Litschko et al., , 2018Joulin et al., 2018). Both supervised approaches trained on parallel sentence corpora (Levy et al., 2017;Luong et al., 2015) and unsupervised approaches with no parallel data (Lample et al., 2018;Artetxe et al., 2018) have been proposed to train cross-lingual word embeddings.
Our approach differs from previous cross-lingual word embedding methods in two aspects. First, the focus of previous work has mostly been on learning a distributional word representation where translation across languages is primarily shaped by syntactic or shallow semantic similarity; it has not been tuned specifically for cross-language sentence selection tasks, which is the focus of our work.
Second, in contrast to previous supervised approaches that train embeddings directly on a parallel corpus or bilingual dictionary, our approach trains embeddings on an artificial labeled dataset augmented from a parallel corpus and directly represents relevance across languages. Our data augmentation scheme to build a relevance model is inspired by Boschee et al. (2019), but we achieve significant performance improvement by incorporating rationale information into the embedding training process and provide detailed comparisons of performance with other sentence selection approaches.
Trained Rationale Previous research has shown that models trained on classification tasks sometimes do not use the correct rationale when making predictions, where a rationale is a mechanism of the classification model that is expected to correspond to human intuitions about salient features for the decision function (Jain and Wallace, 2019). Research has also shown that incorporating human rationales to guide a model's attention distribution can potentially improve model performance on classification tasks (Bao et al., 2018). Trained rationales have also been used in neural MT (NMT); incorporat-ing alignments from SMT to guide NMT attention yields improvements in translation accuracy (Chen et al., 2016).

Methods
We first describe our synthetic training set generation process, which converts a parallel sentence corpus for MT into cross-lingual query-sentence pairs with binary relevance judgements for training our SECLR model. Following that, we detail our SECLR model and finish with our method for rationale training with word alignments from SMT.

Training Set Generation Algorithm
Relevant query/sentence generation. Assume we have a parallel corpus of bilingual sentence pairs equivalent in meaning. Let (E, S) be one such sentence pair, where E is in the query language (in our case, English) and S is in the retrieval collection language (in our case, low-resource languages). For every unigram q in E that is not a stopword, we construct a positive relevant sample by viewing q as a query and S as a relevant sentence. Because sentences E and S are (approximately) equivalent in meaning, we know that there likely exists a translation equivalent of q in the sentence S and so we label the (q, S) pair as relevant (i.e. r = 1).
We generate the positive half of the training set by repeating the above process for every sentence pair in the parallel corpus. We limit model training to unigram queries since higher order ngrams appear fewer times and treating them independently reduces the risk of over-fitting. However, our model processes multi-word queries during evaluation, as described in Section 3.2.
Irrelevant query/sentence generation. Since learning with only positive examples is a challenging task, we opt to create negative examples, i.e. tuples (q, S, r = 0), via negative sampling. For each positive sample (q, S, r = 1), we randomly select another sentence pair (E , S ) from the parallel corpus. We then check whether S is relevant to q or not. Note that both the query q and sentence E are in the same language, so checking whether q or a synonym can be found in E is a monolingual task. If we can verify that there is no direct match or synonym equivalent of q in E then by transitivity it is unlikely there exists a translation equivalent in S , making the pair (q, S ) a negative example. To account for synonymy when we check for matches, we represent q and the words in E with pretrained word embeddings. Let w q , w q ∈ R d be the embeddings associated with q and the words q ∈ E . We judge the pair (q, S ) to be irrelevant (i.e. r = 0) if: max where λ 1 is a parameter. We manually tuned the relevance threshold λ 1 on a small development set of query-sentence pairs randomly generated by the algorithm, and set λ 1 = 0.4 to achieve highest label accuracy on the development set. If (q, S ) is not relevant we add (q, S , r = 0) to our synthetic training set, otherwise we re-sample (E , S ) until a negative sample is found. We generate one negative sample for each positive sample to create a balanced dataset. For example, if we want to generate a negative example for the positive example (q="meeting", S="ma runbaa madaxweyne gaas baaqday shirka copenhegan", r = 1), we randomly select another sentence pair (E ="many candidates competing elections one hopes winner", S ="musharraxiin tiro badan sidoo u tartamaysa doorashada wuxuuna mid kasta rajo qabaa guusha inay dhinaciisa ahaato") from the parallel corpus. To check whether q="meeting" is relevant to S , by transitivity it suffices to check whether q="meeting" or a synonym is present in E , a simpler monolingual task. If q is irrelevant to S , we add (q, S , r = 0) as a negative example.

Cross-Lingual Relevance Model
We propose SECLR, a model that directly makes relevance classification judgments for queries and sentences of different languages without MT as an intermediate step by learning a cross-lingual embedding space between the two languages. Not only should translation of equivalent words in either language map to similar regions in the embedding space, but dot products between query and sentence words should be correlated with the probability of relevance. We assume the training set generation process (Section 3.1) provides us with a corpus of n query-sentence pairs along with their corresponding relevance judgements, i.e.
We construct a bilingual vocabulary V = V Q ∪ V S and associate with it a matrix W ∈ R d×|V| where w x = W ·,x is the word embedding associated with word x ∈ V.
When the query is a unigram q (which is true by design in our training data D), we model the probability of relevance to a sentence S as: In our evaluation setting, the query is very often a phrase Q = [q 1 , . . . , q |Q| ]. In this case, we require all query words to appear in a sentence in order for a sentence to be considered as relevant. Thus, we modify our relevance model to be: Our only model parameter is the embedding matrix W which is initialized with pretrained monolingual word embeddings and learned via minimization of the cross entropy of the relevance classification task:

Guided Alignment with Rationale Training
We can improve SECLR by incorporating additional alignment information as a secondary training objective, yielding SECLR-RT. Our intuition is that after training, the wordŝ = arg max s∈S w s w q should correspond to a translation of q. However, it is possible thatŝ simply co-occurs frequently with the true translation in our parallel data but its association is coincidental or irrelevant outside the training contexts. We use alignment information to correct for this. We run two SMT word alignment models, GIZA++ (Och and Ney, 2003) and Berkeley Aligner (Haghighi et al., 2009), on the orginal parallel sentence corpus. The two resulting alignments are concatenated as in Zbib et al. (2019) to estimate a unidirectional probabilistic word translation matrix A ∈ [0, 1] |V Q |×|V S | , such that A maps each word in the query language vocabulary to a list of document language words with different probabilities, i.e. A q,s is the probability of translating q to s and s∈V S A q,s = 1.
For each relevant training sample, i.e. (q, S, r = 1), we create a rationale distribution ρ ∈ [0, 1] |S| which is essentially a re-normalization of possible query translations found in S and represents our intuitions about which words s ∈ S that q should be most similar to in embedding space, i.e.
for s ∈ S. We similarly create a distribution under To encourage α to match ρ, we impose a Kullback-Leibler (KL) divergence penalty, denoted as: to our overall loss function. The total loss for a single positive sample then will be a weighted sum of the relevance classification objective and the KL divergence penalty, i.e.
where λ 2 is a relative weight between the classification loss and rationale similarity loss.
Note that we do not consider rationale loss for the following three types of samples: negative samples, positive samples where the query word is not found in the translation matrix, and positive samples where none of the translations of the query in the matrix are present in the source sentence.

Dataset Generation from Parallel Corpus
The parallel sentence data for training our proposed method and all baselines includes the parallel data provided in the BUILD collections of both the MATERIAL 1 and LORELEI (Christianson et al., 2018) programs for three low resource languages: Somali (SO), Swahili (SW), and Tagalog (TL) (each paired with English). Additionally, we include in our parallel corpus publicly available resources from OPUS (Tiedemann, 2012), and lexicons mined from Panlex (Kamholz et al., 2014) and Wiktionary. 2 Statistics of these parallel corpora and augmented data are shown in Table 1 and

Query Sets and Evaluation Sets
We evaluate our sentence-selection model on English (EN) queries over three collections in SO, SW, and TL recently made available as part of the IARPA MATERIAL program. In contrast to our training data which is synthetic, our evaluation datasets are human-annotated for relevance between real-world multi-domain queries and documents. For each language there are three partitions (Analysis, Dev, and Eval), with the former two being smaller collections intended for system development, and the latter being a larger evaluation corpus. In our main experiments we do not use Analysis or Dev for development and so we report results for all three (the ground truth relevance judgements for the TL Eval collection have not been released yet so we do not report Eval for TL). See Table 3 for evaluation statistics. All queries are text. The speech documents are first transcribed with an ASR system (Ragni and Gales, 2018), and the 1-best ASR output is used in the sentence selection task. Examples of the evaluation datasets are shown in Appendix B. We refer readers to Rubino (2020) for further details about MATERIAL test collections used in this work. While our model and baselines work at the sentence-level, the MATERIAL relevance judgements are only at the document level. Following previous work on evaluation of passage retrieval, we aggregate our sentence-level relevance scores to obtain document-level scores (Kaszkiel and Zo-bel, 1997;Wade and Allan, 2005;Fan et al., 2018;Inel et al., 2018;Akkalyoncu Yilmaz et al., 2019). Given a document D = [S 1 , . . . , S |D| ], which is a sequence of sentences, and a query Q, following Liu and Croft (2002) we assign a relevance score by:

Experiment Settings
We initialize English word embeddings with word2vec (Mikolov et al., 2013), and initialize SO/SW/TL word embeddings with FastText (Grave et al., 2018). For training we use a SparseAdam (Kingma and Ba, 2015) optimizer with learning rate 0.001. The hyperparameter λ 2 in Section 3.3 is set to be 3 so that L rel and λ 2 L rat are approximately on the same scale during training. More details on experiments are included in Appendix C.

Baselines
Cross-Lingual Word Embeddings. We compare our model with three other cross-lingual embedding methods, Bivec (Luong et al., 2015), MUSE (Lample et al., 2018), and SID-SGNS (Levy et al., 2017). Bivec and SID-SGNS are trained using the same parallel sentence corpus as the dataset generation algorithm used to train SECLR; thus, Bivec and SID-SGNS are trained on parallel sentences while SECLR is trained on query-sentence pairs derived from that corpus. We train MUSE with the bilingual dictionary from Wiktionary that is used in previous work . The SO-EN, SW-EN and TL-EN dictionaries have 7633, 5301, and 7088 words respectively. Given embeddings W from any of these methods, we compute sentence level relevance scores similarly to our model but use the cosine similarity: since these models are optimized for this comparison function (Luong et al., 2015;Lample et al., 2018;Levy et al., 2017). Document aggregation scoring is handled identically to our SECLR models (see Section 4.2).

MT+IR.
We also compare to a pipeline of NMT (Niu et al., 2018)     data as our SECLR models. The 1-best output from each MT system is then scored with Indri (Strohman et al., 2005) to obtain relevance scores. Details of NMT and SMT systems are included in Appendix C.2.

PSQ.
To implement the PSQ model of Darwish and Oard (2003), we use the same alignment matrix as in rationale training (see Section 3.3) ex-cept that here we normalize the matrix such that ∀s ∈ V D , q∈V Q A q,s = 1. Additionally, we embed the PSQ scores into a two-state hidden Markov model which smooths the raw PSQ scores with a background unigram language model (Xu and Weischedel, 2000). The PSQ model scores each sentence and then aggregates the scores to document level as in Section 4.2.
Multilingual XLM-RoBERTa. We compare our model to the cross-lingual model XLM-RoBERTa (Conneau et al., 2020), which in previous research has been shown to have better performance on lowresource languages than multilingual BERT (Devlin et al., 2019). We use the Hugging Face implementation (Wolf et al., 2019) of XLM-RoBERTa (Base). We fine-tuned the model on the same augmented dataset of labeled query-sentence pairs as the SECLR models, but we apply the XLM-RoBERTa tokenizer before feeding examples to the model. We fine-tuned the model for four epochs using an AdamW optimizer (Loshchilov and Hutter, 2019) with learning rate 2 × 10 −5 . Since XLM-RoBERTa is pretrained on Somali and Swahili but not Tagalog, we only compare our models to XLM-RoBERTa on Somali and Swahili.

Results and Discussion
We report Mean Average Precision (MAP) of our main experiment in Table 4 (SO & SW) and Table 5 (TL). Overall, we see that SECLR-RT consistently outperforms the other baselines in 15 out of 16 settings, and in the one case where it is not the best (SW Dev text), SECLR is the best. SECLR-RT is statistically significantly better than the best baseline on all Eval partitions. 4 Since Analysis/Dev are relatively small, only three out of 12 Analysis/Dev settings are significant. The differences between SECLR and SECLR-RT can be quite large (e.g., as large as 70.4% relative improvement on SO Eval text), suggesting that the rationale training provides a crucial learning signal to the model. Bivec and MUSE under-perform both of our model variants across all test conditions, suggesting that for the sentence selection task the relevance classification objective is more important than learning monolingual distributional signals. Curiously, SID-SGNS is quite competitive with SE-CLR, beating it on SO and SW Eval (both modalities) and TL Dev speech (five out of 16 test conditions) and is competitive with the other baselines. Again, the rationale training proves more effective as SID-SGNS never surpasses SECLR-RT.
While MT+IR is a competitive baseline, it is consistently outperformed by PSQ across all test conditions, suggesting that in low-resource settings it is not necessary to perform full translation to achieve good sentence selection performance. SMT, PSQ, and SECLR-RT all make use of the same word-alignment information but only SMT generates translations, adding additional evidence to this claim. PSQ and SECLR are close in performance on Analysis and Dev sets with SECLR eking out a slight advantage on seven of 12 Anaylsis/Dev set conditions. On the larger Eval partitions, it becomes clearer that PSQ is superior to SECLR, suggesting that the relevance classification objective is not as informative as word alignment information. The relevance classification and trained rationale objectives capture slightly different information it seems; SECLR-RT, which uses both, out-performs PSQ across all 16 test conditions.

Training Data Ablation Study
In Section 5, we have shown that SECLR-RT consistently out-performs all baselines across all languages. Since this work targets cross-language sentence selection in a low-resource setting, we perform a training data ablation study to understand how training data size affects effectiveness.
We performed the ablation study for our two models SECLR and SECLR-RT, and the two strongest baseline methods PSQ and SID-SGNS. To simulate further the scenario of data scarcity, we sub-sampled our parallel corpus uniformly at random for 5%, 10%, 25%, 50% of the sentence pairs of the original corpus. Each sentence pair in the parallel corpus is sampled with equal probability regardless of sentence length. For consistency, for each sample size, the same sampled parallel corpus is used across all models. The word alignment probability matrix used by PSQ and SECLR-RT is generated from the same sampled corpus. Since we tune the vocabulary size on the Dev set, for fair comparison we only report MAP scores on the Analysis and Eval sets.
We plot MAP scores of the four models as a function of percentage of data sampled in Figure 1. Overall, we see that SECLR-RT consistently outperforms other baselines across all sample sizes in 9 out of 10 settings, and in the one case where it does not yield consistent improvement (Tagalog Analysis speech), SECLR-RT achieves comparable performance to PSQ.
In the low-resource setting when the sample size is 5% or 10%, SECLR consistently underperforms other models, confirming our observation that SECLR is sensitive to noise and vulnerable to learning co-occurrences of word pairs that are in fact irrelevant. When the sample size is 5% or 10%, PSQ consistently achieves better performance than SID-SGNS and SECLR (although still under-performing SECLR-RT), indicating that alignment-based methods are more robust to noise and especially useful when data is extremely scarce. The fact that SECLR-RT consistently out-performs SECLR by a wide margin for small sample sizes indicates the necessity and effectiveness of incorporating alignment-based information into SECLR to improve the robustness of the model and learn more precise alignments.

Alleviating the Hubness Problem
In this section, we show that by incorporating alignment information through rationale training, SECLR-RT significantly alleviates the hubness problem present in the trained cross-lingual embedding space produced by SECLR. Previous research on cross-lingual word embeddings has observed that a high-dimensional representation space with a similarity-based metric often induces a hub structure (Dinu and Baroni, 2015). Specifically, in a high-dimensional space (e.g., a cross-lingual word embedding space) defined with a pairwise similarity metric (e.g., cosine similarity), there exist a few vectors that are the nearest neighbors of many other vectors. Such vectors are referred to as "hubs." The hub structure is problematic in IR since the hub vectors are often wrongly predicted as relevant and similar in meaning to queries that are in fact irrelevant (Radovanović et al., 2010). Let V Q and V S be the embedding spaces for the query and sentence collection languages respectively. We define the size of the neighborhood of embeddings around y ∈ V S as N k (y) = |{x ∈ V Q |r x (y) ≤ k}| where r x (y) is the rank of y if we order V S by similarity to x from highest to lowest, and k is a  positive integer. A large value of N k (y) indicates that y is similar to many x ∈ V Q , and suggests that y is a likely hub in embedding space.
Following Radovanović et al. (2010), we use S N 10 = E y∈V S [(N 10 (y) − µ) 3 /σ 3 ] to measure the skewness of the distribution of N 10 , where µ and σ refer to the mean and standard deviation of N 10 (y) respectively. Since cosine similarity is more frequently used as the similarity metric in hubness analysis, we re-train SECLR and SECLR-RT by replacing the dot product similarity metric with cosine similarity and still get performance comparable to Table 4 and Table 5.
We report S N 10 scores for SECLR and SECLR-RT respectively in Table 6. We see that SECLR-RT consistently has lower S N 10 value compared to SECLR on all three languages, indicating that the extra alignment information incorporated with rationale training is helpful in reducing hubness.

Conclusion
In this work, we presented a supervised crosslingual embedding-based query relevance model, SECLR, for cross-language sentence selection and also applied a rationale training objective to further increase model performance. The resulting SECLR-RT model outperforms a range of baseline methods on a cross-language sentence selection task. Study of data ablation and hubness further indicate our model's efficacy in handling lowresource settings and reducing hub structures. In future work, we hope to apply our sentence-level query relevance approach to downstream NLP tasks such as query-focused summarization and opendomain question answering.

A Extra Training Dataset Details
When we train SECLR and SECLR-RT via data augmentation, we randomly split the parallel corpus into train set (96%), validation set (3%) and test set (1%). We then use the dataset augmentation technique introduced in Section 3.1 to generate positive and negative samples for each set.
Augmenting the dataset upon the split corpus allows us to achieve more independence between train/validation/test set compared to splitting the dataset augmented on the entire parallel corpus. Note that we only use the validation set for early stopping but we do not tune hyperparameters with the validation set. We preprocess the parallel corpus, the query collection and the sentence collection with the Moses toolkit (Koehn et al., 2007). The same preprocessing steps are used for all four languages (English, Somali, Swahili, Tagalog). First, we use Moses puncutation normalizer to normalize the raw text. Second, we use the Moses tokenizer to tokenize the normalized text. Finally, we remove the diacritics in the tokenized text as a cleaning step.

B Examples of Evaluation Data
In this section we demonstrate some examples from the MATERIAL dataset used for evaluation. Example queries include: "evidence", "human rights", "chlorine", "academy", "ratify", "constitution", "carnage" and "Kenya". On average only 0.13% of the documents in the Eval collection are relevant to each query, which makes the task hard.
Here are two examples from Somali Analysis text. Because the documents are long, here we only include the relevant segment of a long relevant document. In the first example, the English query is "contravention" and the relevant segment of a long relevant document (translated from Somali to English by human) is "the security forces captured military equipment coming into the country illegally." This segment is relevant to the query because of the word "illegally".
Here is another example where the the English query is "integrity". The relevant segment of a long relevant document (translated from Somali to English by human) is "Hargeisa (Dawan) -Ahmed Mohamed Diriye (Nana) the member of parliament who is part of the Somaliland house of representatives has accused the opposition parties (Waddani and UCID) of engaging in acts of national destruction, that undermines the existence and sovereignty of the country of Somaliland." This segment is relevant to the query because of the word "sovereignty".
Since there are multiple ways to translate a word and since MT performance is relatively poor in lowresource settings, the task is far more challenging than a simple lexical match between queries and translated documents.

C Extra Experimental Details
In this section we include extra implementation and experiment details that are not included in the main paper. Information already included in the main paper are not repeated here for conciseness.

C.1 Model and Training Details
We train our SECLR and SECLR-RT models on Tesla V100 GPUs. Each model is trained on a single GPU. We report training time of SECLR and SECLR-RT on Somali, Swahili and Tagalog in  As is discussed in Section 3.2, the only trainable model parameters of SECLR and SECLR-RT are the word embedding matrices. Thus, SECLR and SECLR-RT have the same number of model parameters. We report the number of trainable parameters of both models on Somali, Swahili and Tagalog in Table 8.
Somali Swahili Tagalog # Params. 14.03M 22.31M 21.35M   512, feed-forward network size of 2048, 8 attention heads, and residual connections. We adopt layer normalization and label smoothing. We tie the output weight matrix with the source and target embeddings. We use Adam optimizer with a batch size of 2048 words. We checkpoint models every 1000 updates. Training stops after 20 checkpoints without improvement. During inference, the beam size is set to 5.
Our SMT system uses the following feature functions: phrase translation model, distance-based reordering model, lexicalized reordering model, 5-gram language model on the target side, word penalty, distortion, unknown word penalty and phrase penalty.
We use backtranslation in earlier versions of MT systems. Following previous work (Niu et al., 2018), we train a bidirectional NMT model that backtranslates source or target monolingual data without an auxiliary model. This backtranslationbased model was the state-of-the-art MT model on Somali and Swahili when the above paper is published.
Later, we discover that decoder pretraining with monolingual data achieves better performance compared to backtranslation. The decoder pretraining scheme we use now is most similar to the paper by Ramachandran et al. (2017), where the authors show state-of-the-art results on the WMT English to German translation task with decoder pretraining.
There is no WMT benchmark for Somali, Swahili or Tagalog, but we use state-of-the-art techniques in our MT systems. We have also experimented with the bilingual data selection method (Junczys-Dowmunt, 2018). However, this technique does not work well, mostly because lowresource MT systems are not good enough to do scoring.

D Extra Experimental Results
In this section we include extra experimental results that are not included in the main text due to limited space.

D.1 SECLR Architecture Exploration
When we are designing the SECLR model, we experiment with adding LSTMs and using the dot product between LSTM hidden states to compute pairwise similarity between the query and the sentence. We report MAP scores of SECLR with LSTM in Table 9. Experimental results show that adding LSTMs reduces model performance consistently across all three languages. We conjecture that in low-resource settings, contextualized models create spurious correlations (Section 3.3). In fact, the XLM-RoBERTa baseline, which captures context effectively via self-attention, also underperforms our SECLR model consistently.

D.2 Word Embeddings Initialization
In our SECLR and SECLR-RT models, we initialize word embeddings with monolingual word embeddings in English, Somali, Swahili and Tagalog (Mikolov et al., 2013;Grave et al., 2018). One natural question is whether we can achieve performance improvement if we directly initialize with crosslingual word embeddings. Because SID-SGNS out-performs both Bivec and MUSE consistently by a wide margin (Table 4 and Table 5), in this experiment we initialize SECLR-RT with the crosslingual embeddings produced by SID-SGNS. The results of monolingual and cross-lingual embedding initialization (SID-SGNS) are shown in Table 10. We see that overall monolingual initialization slightly out-performs cross-lingual initialization. Monolingual initialization yields better performance in eight out of 12 Analysis/Dev set conditions and a MAP improvement of 1.7 points when we take the average across Analysis/Dev and all three languages.