Transfer-Free Data-Efficient Multilingual Slot Labeling

,


Introduction and Motivation
Slot labeling (SL) is a crucial natural language understanding (NLU) component for task-oriented dialogue (TOD) systems (Tur and De Mori, 2011).It aims to identify slot values in a user utterance and fill the slots with the identified values.For instance, given the user utterance "Tickets from Chicago to Milan for tomorrow", the airline booking system should match the values "Chicago'', "Milan", and "tomorrow" with the slots departure_city, arrival_city, and date, respectively.
Building TOD systems which support new domains, tasks, and also languages is challenging, expensive and time-consuming: it requires large annotated datasets for model training and development, where such data are scarce for many domains, tasks, and most importantly -languages (Razumovskaia et al., 2022a).The current approach to mitigate the issue is the standard cross-lingual transfer.The main 'transfer' assumption is that a suitable large English annotated dataset is always available for a particular task and domain: (i) the systems are then trained on the English data and then directly deployed to the target language (i.e., zero-shot transfer), or (ii) further adapted to the target language relying on a small set of target language examples (Xu et al., 2020;Razumovskaia et al., 2022b) which are combined with the large English dataset (i.e., few-shot transfer).However, this assumption might often be unrealistic in the context of TOD due to a large number of potential tasks and domains that should be supported by TOD systems (Casanueva et al., 2022).Furthermore, the standard assumption implicitly grounds any progress of TOD in other languages to the English language, hindering any system construction initiatives focused directly on the target languages (Ruder et al., 2022).
Therefore, in this work we depart from this often unrealistic assumption, and propose to focus on transfer-free scenarios for SL instead.Here, the system should learn the task in a particular domain directly from limited resources in the target language, assuming that any English data cannot be guaranteed.This setup naturally calls for constructing a versatile multilingual data-efficient method that leverages scarce annotated data as effectively as possible, and should thus be especially applicable to low-resource languages (Joshi et al., 2020).
Putting this challenging setup into focus, we thus propose a novel two-stage slot-labeling approach, dubbed TWOSL.TWOSL recasts the SL task into a span classification task within its two respective stages.In Stage 1, a multilingual general-purpose sentence encoder is fine-tuned via contrastive learning (CL), tailoring the CL objective towards SLbased span classification; the main assumption is that representations of phrases with the same slot type should obtain similar representations in the specialised encoder space.CL allows for a more efficient use of scarce training resources (Fang et al., 2020;Su et al., 2021;Rethmeier and Augenstein, 2021).Foreshadowing, it manages to separate the now-specialised SL-based encoder space into slottype specialised subspaces, as illustrated later in Figure 2.These SL-aware encodings are more interpretable and allow for easier classification into slot types in Stage 2, using simple MLP classifiers.
We evaluate TWOSL in transfer-free scenarios on two standard multilingual SL benchmarks: Multi-ATIS++ (Xu et al., 2020) andxSID (van der Goot et al., 2021), which in combination cover 13 typologically diverse target languages.Our results indicate that TWOSL yields large and consistent improvements 1) across different languages, 2) in different training set size setups, and also 3) with different input multilingual encoders.The gains are especially large in extremely low-resource setups.For instance, on MultiATIS++, with only 200 training examples in the target languages, we observe an improvement in average F 1 scores from 49.1 without the use of TWOSL to 66.8 with TWOSL, relying on the same multilingual sentence encoder.Similar gains were observed on xSID, and also with other training set sizes.We also report large gains over fine-tuning XLM-R for SL framed as the standard token classification task (e.g., from 50.6 to 66.8 on MultiATIS++ and from 43.0 to 52.6 on xSID with 200 examples), validating our decision to recast the task in TWOSL as a span classification task.
In summary, the results suggest the benefits of TWOSL for transfer-free multilingual slot labeling, especially in the low-resource setups when only several dozen examples are available in the target language: this holds promise to quicken SL development cycles in future work.The results also demonstrate that multilingual sentence encoders can be transformed into effective span encoders using contrastive learning with a handful of examples.The CL procedure in TWOSL exposes their phrase-level semantic 'knowledge' (Liu et al., 2021;Vulić et al., 2022).In general, we hope that this work will inspire and pave the way for further research in the challenging transfer-free few-shot setups for multilingual SL as well as for other NLP tasks.The code for TWOSL will be available online.

Related Work
Multilingual Slot Labeling.Recently, the SL task in multilingual contexts has largely benefited from the development of multilingually pretrained language models (PLMs) such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020).These models are typically used for zero-shot or few-shot multilingual transfer (Xu et al., 2020;Krone et al., 2020;Cattan et al., 2021).Further, the representational power of the large multilingual PLMs for cross-lingual transfer has been further refined through adversarial training with latent variables (Liu et al., 2019) and multitask training (van der Goot et al., 2021).
Other effective methods for cross-lingual transfer are translation-based, where either the training data in the source language is translated into the target language or the evaluation data is translated into the source (translate-train and translate-test, respectively; Schuster et al. (2019); Razumovskaia et al. (2022a)).The issues with these methods for SL are twofold.First, the translations might be of lower quality for low-resource languages or any language pair where large parallel datasets are lacking.Second, they involve the crucial label-projection step, which aligns the words in the translated utterances with the words in the source language.Therefore, (i) applying translation-based methods to sequence labeling tasks such as SL is not straightforward (Ponti et al., 2021), (ii) it increases the number of potential accumulated errors (Fei et al., 2020), and (iii) requires powerful word alignment tools (Dou and Neubig, 2021).
Several methods were proposed to mitigate the issues arising from the label-projection step.Xu et al. (2020) propose to jointly train slot tagging and alignment algorithms.Gritta and Iacobacci (2021) and Gritta et al. (2022) fine-tune the models for post-alignment, i.e., explicitly aligning the source and translated data for better cross-lingual dialogue NLU.These approaches still rely on the availability of parallel corpora which are not guaranteed for low-resource languages.Thus, alternative approaches using code-switching (Qin et al., 2020;  Krishnan et al., 2021) were proposed.All of the above methods assume availability of an 'aid' for cross-lingual transfer such as a translation model or a bilingual lexicon; more importantly, they assume the existence of readily available task-annotated data in the source language.
Data-Efficient Methods for Slot Labeling.One approach to improve few-shot generalisation in TOD systems is to pretrain the models in a way that is specifically tailored to conversational tasks.For instance, ConVEx (Henderson and Vulić, 2021) fine-tunes only a subset of decoding layers on conversational data.QANLU (Namazifar et al., 2021) and QASL (Fuisz et al., 2022) use questionanswering for data-efficient slot labeling in monolingual English-only setups by answering questions based on reduced training data.

Methodology
Preliminaries.We assume the set of N s slot types S = {SL 1 , . . ., SL Ns } associated to an SL task.Each word token in the input sentence/sequence s = w 1 , w 2 , ..., w n should be assigned a slot label y i , where we assume a standard BIO tagging scheme for sequence labeling (e.g., the labels are O, B-SL 1 , I-SL 1 ,. .., I-SL Ns ). 1 We also assume that M SL-annotated sentences are available in the target language as the only supervision signal.
The full two-stage TWOSL framework is illustrated in Figure 1, and we describe its two stages in what follows.

Stage 1: Contrastive Learning for Span Classification
Stage 1 has been inspired by contrastive learning regimes which were proven especially effective in few-shot setups for cross-domain (Su et al., 2022;Meng et al., 2022;Ujiie et al., 2021) and crosslingual transfer (Wang et al., 2021;Chen et al., 2022), as well as for task specialisation of generalpurpose sentence encoders and PLMs for intent detection (Mehri and Eric, 2021;Vulić et al., 2021).
To the best of our knowledge, CL has not been coupled with the TOD SL task before.
Input Data Format for CL.First, we need to reformat the input sentences into the format suitable for CL.Given M annotated sentences, we transform each of them into M triples of the following format: (s mask , sp, L).Here, (i) s mask is the original sentence s, but with word tokens comprising a particular slot value masked from the sentence; (ii) sp is that slot value span masked from the original sentence; (iii) L is the actual slot type associated with the span sp.Note that L can be one of the N s slot types from the slot set S or a special None value denoting that sp does not capture a proper slot value.One example of such a triple is (s mask =Ich benötige einen CL proceeds in a standard fashion relying on sets of positive and negative CL pairs.A positive pair (actually, 'a pair of pairs') is one where two pairs p i and p j contain the same label L in their corresponding tuple, but only if L ̸ = None.2A negative pair is one where two pairs p i and p j contain different labels L i and L j in their tuples, but at least one of the labels is not None.
Following prior CL work (Vulić et al., 2021), each positive pair (p i , p j ) is associated with 2K negative pairs, where we randomly sample K negatives associated with p i and K negatives for p j .Finally, for the special and most efficient CL setup where the ratio of positive and negative pairs is 1 : 1, we first randomly sample the item from the positive pair (p i , p j ), and then randomly sample a single negative for the sampled p i or p j .
Online Contrastive Loss.Fine-tuning the input sentence encoder with the positive and negative pairs proceeds via a standard online contrastive loss.Similarly to the original contrastive loss (Chopra et al., 2005), it aims at 1) reducing the semantic distance, formulated as the cosine distance, between representations of examples forming the positive pairs, and 2) increase the distance between representations of examples forming the negative pairs.The online version of the loss, which typically outperforms its standard variant (Reimers and Gurevych, 2019), focuses only on hard positive and hard negative examples: the distance is higher than the margin m for positive examples, and below m for negative examples.

Stage 2: Span Identification and Classification
The aim of Stage 2 is to identify and label the slot spans, relying on the embeddings produced by the encoders fine-tuned in the preceding Stage 1.In order to identify the slot spans, we must consider every possible subspan of the input sentence, which might slow down inference.Therefore, to boost inference speed, we divide Stage 2 into two steps.
In Step 1, we perform a simple binary classification, aiming to detected whether a certain span is a slot value for any slot type from S. Effectively, for the input pair (s mask , sp) the binary classifier returns 1 (i.e., 'sp is some slot value') or 0. The 0-examples for training are all subspans of the sentences which are not associated with any slot type from S.
Step 2 is a multi-class span classification task, where we aim to predict the actual slot type from S for the input pair (s mask , sp).The binary filtering Step 1 allows us to remove all input pairs for which the Step 1 prediction is 0, and we thus assign slot types only for the 1-predictions from Step 1.Put simply, Step 1 predicts if span covers any proper slot value, while Step 2 maps the slot value to the actual slot type.We can directly proceed with Step 2 without Step 1, but the training data then also has to contain all the examples with spans where L=None, see Figure 1 again.
The classifiers in both steps are implemented as simple multi-layer perceptrons (MLP), and the input representation in both steps is the concatenation of the respective encodings for s mask and sp.

Experimental Setup
Training Setup and Data.The standard few-shot setup in multilingual contexts (Razumovskaia et al., 2022a;Xu et al., 2020) assumes availability of a large annotated task-specific dataset in English, and a handful of labeled examples in the target language.However, as discussed in §1, this assumption might not always hold.That is, the English data might not be available for many targetlanguage specific domains, especially since the annotation for the SL task is also considered more complex than for the intent detection task (van der Goot et al., 2021;Xu et al., 2020;FitzGerald et al., 2022).We thus focus on training and evaluation in these challenging transfer-free setups.
We run experiments on two standard multilingual SL datasets, simulating the transfer-free setups: MultiATIS++ (Xu et al., 2020)  A current limitation of TWOSL is that it leans on whitespace-based word token boundaries in the sentences: therefore, in this work we focus on a subset of languages with that property, leaving further adaptation to other languages for future work.
Input Sentence Encoders.We experiment both with multilingual sentence encoders as well as general multilingual PLMs in order to (i) demonstrate the effectiveness of TWOSL irrespective of the underlying encoder, and to (ii) study the effect of pretraining task on the final performance.1) XLM-R (Conneau et al., 2020) is a multilingual PLM, pretrained with a large multilingual dataset in 100 languages via masked language modeling.2) Multilingual mpnet (Song et al., 2020) is pretrained for paraphrase identification in over 50 languages; the model was specifically pretrained in a contrastive fashion to effectively encode sentences.3) We also run a subset of experiments with another stateof-the-art multilingual sentence encoder, LaBSE (Feng et al., 2022), to further verify that TWOSL can be disentangled from the actual encoder. 3All models are used in their 'base' variants, with 12 hidden-layers and encoding the sequences into 768dimensional vectors.This means that the actual encodings of (s mask , sp) pairs, which are fed to MLPs in Stage 2, are 1,536-dimensional; see §3.
Hyperparameters and Optimisation.We rely on sentence-transformers (SBERT) library (Reimers andGurevych, 2019, 2020) for model checkpoints and contrastive learning in Stage 1.The models are fine-tuned for 10 epochs with batch size of 32 using the default hyperparameters in SBERT: e.g., the margin in the contrastive loss is fixed to m = 0.5.max sp is fixed to 5 as even the longest slot values very rarely exceed that span length.Unless stated otherwise, K = 1, that is, the ratio of positive-tonegative examples is 1 : 2, see §3.1.
In Stage 2, we train binary and multi-class MLPs with the following number of hidden layers and their size, respectively: [2,500, 1,500] and [3,600, 2,400, 800], and ReLU as the non-linear activation.The Step 1 binary classifier is trained for 30 epochs, while the Step 2 MLP is trained for 100 epochs.The goal in Step 1 is to ensure high recall (i.e., to avoid too aggressive filtering), which is why we opt for the earlier stopping.As a baseline, we fine-tune XLM-R for the token classification task, as the standard SL task format (Xu et al., 2020;Razumovskaia et al., 2022b).Detailed training hyperparameters are provided in Appendix B.
The results are averages across 5 random seeds.
Evaluation Metric.For direct comparability with standard token classification approaches we rely on token-level micro-F 1 as the evaluation metric.For TWOSL this necessitates the reconstruction of the BIO-labeled sequence Y from the predictions for the (s mask , sp, L pred ) tuples.For every sentence s we first identify all the tuples (s mask , sp, L pred ) associated with s such that the predicted slot type L pred ̸ = None.In Y the positions of sp are filled with B L pred , complemented with the corresponding number of I L pred if the length of sp > 1.Following that, the rest of the positions are set to the O label.The aim of CL in Stage 1 is exactly to make the representations associated with particular cluster into coherent groups and to offer a clearer separation of encodings across slot types.As proven previously for the intent detection task (Vulić et al., 2021), such well-divided groups in the encoding space might facilitate learning classifiers on top of the fine-tuned encoders.As revealed by a t-SNE plot (van der Maaten and Hinton, 2012) in Figure 2, which shows the mpnet-based encodings before and after Stage 1, exactly that effect is observed.Namely, the non-tuned mpnet encoder already provides some separation of encodings into slot type-based clusters (Figure 2a), but the groupings    Micro F 1 scores are reported.XLM-R refers to using the XLM-R PLM for the standard token classification fine-tuning for SL.XLM-R-Sent denotes using XLM-R directly as a sentence encoder in the same fashion as mpnet.We provide standard deviation for results with mpnet as the sentence encoder in Appendix D, demonstrating statistical signifcance of the improvements provided by contrastive learning.In sum, this qualitative analysis already suggests the effectiveness of CL for the creation of customised span classification-oriented encodings that support the SL task.We note that the same observations hold for all other languages as well as for all other (data-leaner) training setups.

Results and Discussion
Main Results.The results on xSID and Multi-ATIS++ are summarised in Table 2 and Figure 3, respectively.The scores underline three important trends.First, TWOSL is much more powerful than the standard PLM-based (i.e., XLM-R-based) token classification approach in very low-data setups, when only a handful (e.g., 50-200) annotated examples in the target language are available as the only supervision.Second, running TWOSL on top of a general-purpose multilingual encoder such as mpnet yields large and consistent gains, and this is clearly visible across different target languages in both datasets, and across different data setups.Third, while the token classification approach is able to recover some performance gap as more annotated data become available (e.g., check Figure 5a with 800 examples), TWOSL remains the peak-performing approach in general.
A finer-grained inspection of the scores further reveals that for low-data setups, even when exactly the same model is used as the underlying encoder (i.e., XLM-R), TWOSL offers large benefits over token classification with full XLM-R fine-tuning, see Table 2.The scores also suggest that the gap between TWOSL and the baselines increases with the decrease of annotated data.The largest absolute and relative gains are in the 50-example setup, followed by the 100-example setup, etc.: e.g., on xSID, the average gain is +9.5 F1 points with 200 training examples, while reaching up to +35.3 F1 points with 50 examples.This finding corroborates the power of CL especially for such low-data setups.Finally, the results in Table 2 also hint that TWOSL works well with different encoders: it improves both mpnet and XLM-R as the underlying multilingual sentence encoders.

Ablations and Further Analyses
TWOSL in Standard Few-Shot Setups.TWOSL has been designed with a primary focus on transferfree, extremely low-data setups.However, another natural question also concerns its applicability and effectiveness in the standard few-shot transfer setups, where we assume that a large annotated dataset for the same task and domain is available in the source language: English.To this end, we run several experiments on MultiATIS++, with German and French as target languages, where we first finetune the model on the full English training data, before running another fine-tuning step (Lauscher Overall, the results in Table 3 demonstrate that TWOSL maintains its competitive performance, although the token classification approach with XLM-R is a stronger method overall in this setup.TWOSL is more competitive for French as the target language.The importance of CL in Step 1 for TWOSL is pronounced also in this more abundant data setup.We leave further exploration and adaptation of TWOSL to transfer setups for future work. Impact of Binary Filtering in Stage 2. In order to understand the benefit of Step 1 (i.e., binary filtering) in Stage 2, we compare the performance and inference time with and without that step.We focus on the xSID dataset in the 200-example setup.The scores, summarised in Table 9 in Appendix E, demonstrate largely on-par performance between the two variants.The main benefit of using Step 1 is thus its decrease of inference time, as reported in Figure 4, where inference was carried out on a single NVIDIA Titan XP 12GB GPU.The filtering step, which relies on a more compact and thus quicker classifier, greatly reduces the number of examples that have to undergo the final, more expensive slot type prediction (i.e., without filtering all the subspans of the user utterance must be processed) without harming the final performance.
Different Multilingual Encoders.The results in Table 2 have already validated that TWOSL offers gains regardless of the chosen multilingual encoder (e.g., XLM-R versus mpnet).However, the effectiveness of TWOSL in terms of absolute scores is naturally dependent on the underlying multilingual capabilities of the original multilingual encoder.We thus further analyse how the performance changes in the same setups with different encoders.We compare XLM-R-Sent (i.e., XLM-R   used a sentence encoder, mean-pooling all subword embeddings), mpnet, and LaBSE on a representative set of 7 target languages on xSID.In the majority of the experimental runs, LaBSE with TWOSL yields the highest absolute scores.This comes as no surprise as LaBSE was specifically customised to improve sentence encodings for low-resource languages and in low-resource setups (Feng et al., 2022).Interestingly, XLM-R performs the best in the 'lowest-data' 50-example setup: we speculate this might be due to a smaller model size, which makes it harder to overfit in extremely low-resource setups.Finally, the scores again verify the benefit of TWOSL when applied to any underlying encoder.
Number of Negative Pairs.The ratio of positiveto-negative examples, controlled by the hyperparameter K, has a small impact on the overall performance, as shown in Figure 5.We observe some slight performance gains when moving from 1 negative example to 2 (cf., the 50-example setup for German in MultiATIS++ or Arabic in xSID).In such cases, the increase in the number of negative pairs can act as data augmentation for the extreme low-resource scenarios.This hyper-parameter also impacts the trade-off between training time and the stability of results.With fewer negative examples, training is quicker, but the performance is less stable: e.g., in the 50-example setup for German in MultiATIS++, the standard deviation is σ = 7.45, σ = 2.36 and σ = 3.21 with 1,2 and 4 negatives per positive, respectively.Therefore, as stated in §4, we rely on the setup with 2 negatives-per-positive in our experiments, indicating the good trade-off between efficiency and stability.

Conclusion and Future Work
We proposed TWOSL, a two-stage slot labeling approach which turns multilingual sentence encoders into slot labelers for task-oriened dialogue (TOD), which was proven especially effective for slot labeling in low-resource setups and languages.TWOSL was developed with the focus on transfer-free fewshot multilingual setups, where sufficient Englishlanguage annotated data are not readily available to enable standard cross-lingual transfer approaches.
In other words, the method has been created for bootstrapping a slot labeling system in a new language and/or domain when only a small set of annotated examples is available.TWOSL first converts multilingual sentence encoders into task-specific span encoders via contrastive learning.It then casts slot labeling into the span classification task supported by the fine-tuned encoders from the previous stage.The method was evaluated on two standard multilingual TOD datasets, where we validated its strong performance across diverse languages and different training data setups.Due to its multi-component nature, a spectrum of extensions focused on its constituent components is possible in future work, which includes other formulations of contrastive learning, tuning the models multilingually, mining (non-random) negative pairs, and cross-domain transfer learning for ToD (Majewska et al., 2022).In the long run, we plan to use the method for large-scale fine-tuning of sentence encoders to turn them into universal span encoders which can then be used on sequence labeling tasks across languages and domains.Experiments with TWOSL can be run on other 'non-TOD' sequence labeling tasks (e.g., NER) for which evaluation data exists for truly low-resource languages: e.g., on African languages (Adelani et al., 2021).

Limitations
TWOSL relies on whitespace-based word boundaries.Thus, it is only applicable to languages which use spaces as word boundaries.We plan to extend and adapt the method to other languages, without this property, in our subsequent work.Additionally, the approach has been only tested on the languages which the large multilingual PLMs have seen during their pretraining.We plan to test the same approach on unseen languages in our future work.
As mentioned in §6, we opted for representative multilingual sentence encoders and components of contrastive learning that were proven to work well for other tasks in prior work (Reimers and Gurevych, 2020;Vulić et al., 2021) (e.g., the choice of the contrastive loss, adopted hyper-parameters), while a wider exploration of different setups and regimes in TWOSL's Stage 1 and Stage 2 might further improve performance and offer additional low-level insights.
The scope of our multilingual evaluation is also constrained by the current availability of multilingual evaluation resources for TOD NLU tasks.
Finally, in order to unify the experimental protocol across different languages, and for a more comprehensive coverage and cross-language comparability, we relied on multilingual encoders throughout the work.However, we stress that for the transfer-free scenarios, TWOSL is equally applicable to monolingual for respective target languages, when such models exist.

Figure 1 :
Figure 1: Illustration of the proposed TWOSL framework which turns general-purpose multilingual sentence encoders into efficient slot labelers via two stages.Stage 1: contrastive learning tailored towards encoding sub-sentence spans.Stage 2: slot classification in two steps, binary slot-span identification/filtering (Step 1, aiming to answer the question 'Is this span a value for any of the slot types?') and multi-class span-type classification (Step 2, aiming to answer the question 'What class is this span associated with?').Ablation variants include: a) using off-the-shelf multilingual sentence encoders in Stage 2 without their CL-based fine-tuning in Stage 1; b) directly classifying slot spans without the binary filtering step (i.e., without Step 1 in Stage 2).
which allow for efficient use of available annotated data.Further, Qin et al. (2022) and Liang et al. (2022) use code-switched examples to improve performance on intent classification and slot labeling in a zero-shot setting by training on code-switched examples on slot-value, sentence and word levels.Unlike prior work, TWOSL focuses on adapting multilingual sentence encoders to the SL task in transfer-free low-data setups.

Figure 2
Figure 2: t-SNE plots (van der Maaten and Hinton, 2012) for annotated German examples from MultiATIS++'s test set.We show the examples for a subset of 8 slot types, demonstrating the effect of CL on the final encodings.The encodings were created using (a) the original mpnet encoder before Stage 1 CL and (b) mpnet after CL-tuning in Stage 1. 800 annotated training examples were used for CL, K = 1.

Figure 3 :
Figure 3: Slot F 1 scores on MultiATIS++ across different languages and data setups.The exact numerical scores are available in Appendix C.

Figure 4 :
Figure 4: Inference time per language on XSID with and without the binary filtering Step 1 in Stage 2.

Figure 5 :
Figure 5: Impact of the number of negative examples per each positive example for the 50-example and 100-example setups.For clarity, the results are shown for a subset of languages in MultiATIS++ and xSID, and the similar trends are observed for other languages.
Flug von[MASK][MASK] nach Chicago, sp=New York, L=departure_city).In another example, (s mask =[MASK] mir die Preise von Boston nach Denver, sp=Zeige, L=None).Note that sp can span one or more words as in the examples above, which effectively means masking one or more words from the original sentence.We limit the length of sp to the maximum of max sp consecutive words.Positive and Negative Pairs for CL.The main idea behind CL in Stage 1 is to adapt the input (multilingual) sentence encoder to the span classification task by 'teaching' it to encode sentences carrying the same slot types closer in its CL-refined semantic space.The pair p=(s mask , sp) is extracted from the corresponding tuple, and the encoding of the pair is a concatenation of encodings of s mask and sp encoded separately by the sentence encoder.

Table 1 :
(van der Goot et al., 2021)tasets in the experiments.(vanderGootetal., 2021).Their data statistics are provided in Table1, with language codes in Appendix A. For low-resource scenarios, we randomly sample M annotated sentences from the full training data.Since xSID was originally intended only for testing zero-shot cross-lingual transfer, we use its limited dev set for (sampling) training instances.

Table 2 :
Results on xSID's test set using a subsample of 50, 100, and 200 examples from its validation portion for training.

Table 4 :
Results on xSID for a sample of languages with different sentence encoders.XLM-R-Sent denotes using XLM-R as a standard sentence encoder.