CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval

We present the Charles University system for the MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval. The goal of the shared task was to develop systems for named entity recognition and question answering in several under-represented languages. Our solutions to both subtasks rely on the translate-test approach. We first translate the unlabeled examples into English using a multilingual machine translation model. Then, we run inference on the translated data using a strong task-specific model. Finally, we project the labeled data back into the original language. To keep the inferred tags on the correct positions in the original language, we propose a method based on scoring the candidate positions using a label-sensitive translation model. In both settings, we experiment with finetuning the classification models on the translated data. However, due to a domain mismatch between the development data and the shared task validation and test sets, the finetuned models could not outperform our baselines.


Introduction
Pre-trained language models reach state-of-the-art results in most current natural language processing (NLP) tasks.Whereas in high-resource languages such as English, we observe in-context learning capabilities and emergent abilities (Wei et al., 2022), in less-resourced languages, the results are more modest (Lai et al., 2023a), mostly due to the lack of necessary data needed to train really large models.Moreover, there is usually not enough task-specific data available in these languages.This leads to attempts to reuse the (high-resource) language model capabilities in other (low-resource) languages.
Most of the proposed methods are either based on transfer learning (Lauscher et al., 2020;Yu and Joty, 2021;Zheng et al., 2021;Schmidt et al., 2022) or machine translation (MT), both during training * The author order was determined by a coin toss.and at test time (e.g.mentioned as a baseline by Conneau et al., 2020Conneau et al., , 2018)).
The MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval aims to explore these methods further, applied to many lowresource languages.The participants were tasked to build models for two subtasks: named entity recognition (NER) and question answering (QA).
The shared task setup is inspired by the XTREME-UP dataset (Ruder et al., 2023), which focuses on the most needed tasks for underresourced languages: gathering data in a digital form (speech recognition, optical character recognition, transliteration) and making information in these languages accessible (NER, QA, retrieval for QA).This dataset contains a relatively small amount of data for multiple tasks on low-resource languages, featuring 88 languages in total, including QA datasets for 4 languages and NER datasets for 20 languages.
The shared task evaluation campaign focused on Igbo, Indonesian (QA only), Alsatian, 1 Turkish, Uzbek (QA only), and Yoruba.Out of these languages, only Indonesian is among the XTREME-UP QA datasets, and only Igbo and Yoruba have available NER task data in the benchmark.Upon releasing the validation data close to the end of the campaign, Azerbaijani was added as a surprise language for evaluation (with no data for QA or NER in XTREME-UP).
This setting left the participants with a choice to either collect external training data for languages not present in the benchmark (which was implicitly discouraged by the inclusion of the surprise language) or to develop language-agnostic systems.
Even though a lot of research effort is invested in developing systems that are inherently multilingual, typically based on pre-trained massively multilingual models (Artetxe and Schwenk, 2019;  Lauscher et al., 2020; Pfeiffer et al., 2020; Xue   1 Mistakenly labeled as Swiss German on the task website.arXiv:2310.16528v1[cs.CL] 25 Oct 2023Oct et al., 2021, inter alia), inter alia), our submission is based on the translate-test approach that was recently shown to perform better than the community previously thought (Artetxe et al., 2023).We rely on the translation quality of a multi-lingual machine translation (MT) system, combined with the strong performance of pre-trained LLMs in English.The main ideas that are common to our approaches to both subtasks are described in Section 2. The particularities of our models which are specific to the NER and QA subtasks, are presented in Sections 3 and 4, respectively, including our results on those tasks.
Overall, we find that the translate-test approach can be useful in a multilingual setting.Our results do not outperform supervised, language-specific models, but are considerably better than zero-shot approaches.
To maximize reproducibility, we built our systems using an automated end-to-end development pipeline implemented in Snakemake (Köster and Rahmann, 2012); we release the code online.2

Main Ideas
In both tasks, we employ the translate-test approach, which can be summarized in the following three steps: First, we translate unlabeled examples from the task language into English using a multilingual MT model.Second, we use a pre-trained LLM to perform the task which assigns the labels to the example.Third, we use a label-aware translation model to project the inferred labels back to the target language.
Translation into English.In the first step, we translate the unlabeled data into English.In both subtasks, we use the NLLB-3.3B3multilingual MT model (Costa-jussà et al., 2022).We discuss the task-specific data processing details further in Sections 3 and 4.
Task-specific models.In each subtask, we apply a RoBERTa-large model, which has been finetuned on the task (Liu et al., 2019).This predicts labels for the English data.For NER, these are BIOencoded labels, marking the span and type of each named entity in the example.Specifically, the output is a sequence of labels of the same length as the input sentence.For QA, the labels mark a span in the context representing the answer.This is encoded using two numbers, which denote character offsets in the detokenized version of the context paragraph.
Translation into the target language.The translate-test approach is less challenging when the labels are language-independent, which is also the case of both subtasks.However, span labeling tasks (such as NER and QA) require careful handling of the projection of the spans, i.e., we need to find the corresponding spans in the original language.
Our systems adopt the label projection method for cross-lingual transfer, originally meant for the translate-train approach (Chen et al., 2023).The authors of the paper finetune the NLLB model4 to translate texts containing inserted tags so that the tags generated in the translation mark equivalent parts of the source sentence.In contrast to the original use-case of generating the whole target sentence with tags, we already know the target sentence in the shared task scenario.Therefore, we are only interested in the placement of the tags.
To find the best possible placement of the tags, we propose to use the aforementioned finetuned model as a scorer.We place the tags at all possible positions (subject to minimum/maximum span length constraints) and select the highest-scoring candidate.We then either reconstruct the label sequence (in the case of NER) or extract the appropriate passage from the context (for QA).

Named Entity Recognition
The goal of the NER subtask was to classify words and phrases into one of four categories: person (PER), organization (ORG), location (LOC), and date (DAT).Since most state-of-the-art NER classifiers (including the one we used) use a richer set of labels, we apply rule-based mapping to reduce the label set to the four categories: geopolitical entities (GPE) and facilities (FAC) are replaced with LOC, time with DAT.
The scheme of the translate-test pipeline for this task is shown in Figure 1.Finetuning.To overcome the domain mismatch, we finetuned the tNER models using the MasakhaNER data.
We translated the MasakhaNER training data into English and performed the span projection the same way as at inference time.The finetuning step serves not only as domain adaptation to news stories from the non-English speaking world but also as an adaptation to texts which have been automatically translated from low-resource languages.
Results.Table 1 presents the results on the MasakhaNER 1 dataset.Our translate-test approach significantly outperforms zero-shot transfer from English using the XLM-R (Conneau et al., 2020) and XLM-V (Liang et al., 2023) models; however, there is still a large performance gap between the translate-test approach and supervised in-language training.
The MasakhaNER 2 results are shown in Table 2. Similarly to MasakhaNER 1, our results are strictly worse than supervised training.The second line of the table shows the results of a model trained on MasakhaNER 1 but tested on MasakhaNER 2, which contains ten more languages than the first dataset.The results on these additional languages (shown in boxes) mark zero-shot transfer between African languages.Our translate-test approach via English is better than zero-shot using African languages for 5 of the 10 languages.
When compared to related work, our results (average score 61.3%) without finetuning outperform transfer from English using mDeBERTav3 (He et al., 2023) (average score 55.5%).However, they are worse when compared to the translate-train results reported by Chen et al. (2023) (average score 63.4%) that used additional parallel data with projected labels for training.
On both MasakhaNER benchmarks, the Ontonotes5 model is slightly better than CoNLL 2003.Finetuning (which involves training data of the respective datasets) leads to consistent improvements.On MasakhaNER 2, the finetuned model outperforms Chen et al. (2023); however, the training data setups are not easily comparable.
The results on the shared task validation data are in Table 3.Because of the domain mismatch (the shared task validation data are not local news but rather Wikipedia articles), the original Ontonotes5 model performs better.Based on this observation, we decided to use the pipeline using the original Ontonotes5 model without finetuning.We omit Yoruba from calculating the average score because most entities are left without annotation in the data.

Question Answering
The goal of this task is to find an answer to a given question within a given context.In the generative version of this task, the answer may not be taken from the context directly.Figure 2 shows the question-answering processing pipeline we use in our experiments.
Data Preprocessing.The XTREME-UP datasets for QA consist of three fields: The question, the context, and the answer.Since the context might be several sentences long, we apply sentence splitting using wtpsplit (Minixhofer et al., 2023) we use the English variant instead.After sentence splitting, we translate everything into English using the NLLB model.Since NLLB does not support Alsatian, we set the source language to German.
Answering Questions.For the extractive question answering task, we use a RoBERTa-based model7 finetuned for question answering to mark the answer spans in the English context.Once the spans are found, we insert tags into the English sentence.To find the right spans in the original language, we use the tag-preserving NLLB model as a scorer and select the highest-scoring span according to the model.
No Answer Classification.Since there are examples with no answer in XTREME-UP, we train a classifier to detect such cases.We again use the QAtuned RoBERTa, which we finetune on 3 epochs of the translated XTREME-UP data.We set the learning rate to 10 −5 , weight decay to 0.01, and keep the default values of the rest of the hyper-parameters.
The classifier achieves 93% accuracy on the development set.However, because the shared task validation set contains only a very small amount of examples with no answer, we decided not to use this classifier in our submissions.
In-domain Finetuning.We also implemented in-domain finetuning of the QA RoBERTa model on the XTREME-UP dataset translated into English.Because the answers are represented as spans within the context, we use the same technique to project the spans onto the English translation of the context as we use in span projecting to the original language.Performing grid search and measuring model performance on the development set, we found a learning rate of 5 × 10 −6 , gradient norm of 1, warmup ratio of 0.5, and weight decay of 0.1 are the most suitable hyper-parameters.
Using Generative Models.We noticed that the shared task validation data did not actually contain examples of extractive question answering.
Instead, the answers were likely written by a human annotator.Therefore, we decided also to submit a contrastive experiment using a generative model, namely Llama 2 (Touvron et al., 2023). 8For the generation, we use the prompt "Context: {context} Question: {question} Short answer:".We apply rule-based postprocessing to remove potential continuations generated after the answer.Details can be found in the corresponding Snakefile in the code repository.
Results.Table 4 shows the results of the shared task validation set.Since there is a considerable domain mismatch between the XTREME-UP dataset and the shared task validation and test sets, we see that the in-domain finetuning does not improve the performance -we, therefore, use the baseline systems as our primary submission.Using the generative model, however, achieves a substantial improvement.Because the task was originally aimed at extractive QA, we decided to submit the generative model as a contrastive experiment.

Conclusions and Discussion
The research community long overlooked the translate-test approach until recently, when Artetxe et al. (2023) showed that it might outperform both translate-train and cross-lingual transfer with sufficiently strong machine translation systems.
With the increasing number of attempts to use large generative language models in cross-lingual setups, we speculate that the translate-test approach will become an important baseline that might not be easy to cross.Methods that work well with multilingual encoders enforce alignment of the intermediate representation (Wu and Dredze, 2020;Hämmerl et al., 2022;Pfeiffer et al., 2022, inter alia).However, in generative setups, this would lead to undesirable language mixing (Li and Murray, 2023).Generative models are also known not to be consistent across languages (Lai et al., 2023b;Wang et al., 2023).Translate-test does not suffer from either of these disadvantages.
We successfully tested the translate-test method in the shared task setup involving span-labeling tasks.We translated the input into English, performed the task using state-of-the-art English models, and projected the results back to the original language.The main technical challenge is that after labeling the spans in English, we need to find the corresponding span in the original text.For that purpose, we used an MT model specifically finetuned to preserve tags encoded as brackets.Furthermore, we finetuned the task-specific models on XTREME-UP data automatically translated into English.
Although the shared task claimed to be based on the XTREME-UP benchmark, the actual shared task data have many different characteristics.Instead of local news outlets, the NER data used Wikipedia text, often on generic topics rather than local ones.The QA validation and test data were abstractive, not extractive.Because of that, our finetuned models performed worse than the original ones.Also, generative QA using LlaMA 2 outperformed our original extractive system.
The final results show that building a translatetest pipeline is a viable approach to both crosslingual NER and QA.

Limitations
Both validation and test datasets from the shared task are considerably small, especially for QA, where they contain only around 100 examples per language.This might lead to an unreliable comparison between the submitted systems.
The paper does not contain experimental results that would sufficiently back stronger claims about translate-test approaches.We made decisions that appeared to lead to a good performance in the context of the shared task.However, the paper misses ablations that would reliably show that the span projection method is the best.More importantly, this paper does not compare our results with a strong system based on cross-lingual transfer.None of the system authors speak the languages in the shared task, and neither is particularly familiar with the culture of the respective language communities.The authors did not check the system outputs for harmful or otherwise inappropriate content.

Table 2 :
. Since the toolkit does not support Alsatian or Swahili, F1 scores on the MasakhaNER 2 dataset.The numbers in boxes denote zero-shot transfer between African languages (i.e., languages that are in MasakhaNER 2 but not in MasakhaNER 1).Bold numbers are results where our approach is better than the zero-shot transfer between African languages.

Table 3 :
Results on the shared task validation data.The average does not include Yoruba.

Table 4 :
Question answering results on the shared task validation data (chrF).