Multilingual Relation Classification via Efficient and Effective Prompting

Prompting pre-trained language models has achieved impressive performance on various NLP tasks, especially in low data regimes. Despite the success of prompting in monolingual settings, applying prompt-based methods in multilingual scenarios has been limited to a narrow set of tasks, due to the high cost of handcrafting multilingual prompts. In this paper, we present the first work on prompt-based multilingual relation classification (RC), by introducing an efficient and effective method that constructs prompts from relation triples and involves only minimal translation for the class labels. We evaluate its performance in fully supervised, few-shot and zero-shot scenarios, and analyze its effectiveness across 14 languages, prompt variants, and English-task training in cross-lingual settings. We find that in both fully supervised and few-shot scenarios, our prompt method beats competitive baselines: fine-tuning XLM-R_EM and null prompts. It also outperforms the random baseline by a large margin in zero-shot experiments. Our method requires little in-language knowledge and can be used as a strong baseline for similar multilingual classification tasks.


Introduction
Relation classification (RC) is a crucial task in information extraction (IE), aiming to identify the relation between entities in a text (Alt et al., 2019).Extending RC to multilingual settings has recently received increased interest (Zou et al., 2018;Kolluru et al., 2022), but the majority of prior work still focuses on English (Baldini Soares et al., 2019;Lyu and Chen, 2021).A main bottleneck for multilingual RC is the lack of supervised resources, comparable in size to large English datasets (Riedel et al., 2010;Zhang et al., 2017).The SMiLER dataset (Seganti et al., 2021) provides a starting point to test fully supervised and more efficient approaches due to different resource availability for different languages.
Previous studies have shown the promising performance of prompting PLMs compared to the datahungry fine-tuning, especially in low-resource scenarios (Gao et al., 2021;Le Scao and Rush, 2021;Lu et al., 2022).Multilingual pre-trained language models (Conneau et al., 2020;Xue et al., 2021) further enable multiple languages to be represented in a shared semantic space, thus making prompting in multilingual scenarios feasible.However, the study of prompting for multilingual tasks so far remains limited to a small range of tasks such as text classification (Winata et al., 2021) and natural language inference (Lin et al., 2022).To our knowledge, the effectiveness of prompt-based methods for multilingual RC is still unexplored.
To analyse this gap, we pose two research questions for multilingual RC with prompts: RQ1.What is the most effective way to prompt?We investigate whether prompting should be done in English or the target language and whether to use soft prompt tokens.RQ2.How well do prompts perform in different data regimes and languages?We investigate the effectiveness of our prompting approach in three scenarios: fully supervised, few-shot and zero-shot.We explore to what extent the results are related to the available language resources.
We present an efficient and effective prompt method for multilingual RC (see Figure 1) that derives prompts from relation triplets (see Section 3.1).The derived prompts include the original sentence and entities and are supposed to be filled with the relation label.We evaluate the prompts with three variants, two of which require no translation, and one of which requires minimal translation, i.e., of the relation labels only.We find that our method outperforms fine-tuning and a strong taskagnostic prompt baseline in fully supervised and few-shot scenarios, especially for relatively low-1 arXiv:2210.13838v2[cs.CL]  .____ : head entity, tail entity : code-switch (CS) prompting : in-language (IL) prompting : Overview of our approach.Given a plain text x containing head entity e h and tail entity e t from language L, we first apply the template T (x) = "x.e h ____ e t " and yield the prompt input with a blank.Then the PLM aims to fill in the relation at the blank.In code-switch prompting, the target sequence is the English relation verbalization.In in-language prompting, the target is the relation name translated into L.
resource languages.Our method also improves over the random baseline in zero-shot settings, and achieves promising cross-lingual performance.The main contributions of this work hence are: • We propose a simple but efficient prompt method for multilingual RC, which is, to the best of our knowledge, the first work to apply prompt-based methods to multilingual RC (Section 3).
• We evaluate our method on the largest multilingual RC dataset, SMiLER (Seganti et al., 2021), and compare our method with strong baselines in all three scenarios.We also investigate the effects of different prompt variants, including insertion of soft tokens, prompt language, and the word order of prompting (Sections 4 & 5).

Preliminaries
We first give a formal definition of the relation classification task, and then introduce fine-tuning and prompting paradigms to perform RC.

Relation Classification Task Definition
Relation classification is the task of classifying the relationship such as date_of_birth, founded_by or parents between pairs of entities in a given context.Formally, given a relation set R and a text x = [x 1 , x 2 , . . ., x n ] (where x 1 , • • • , x n are tokens) with two disjoint spans e h and e t denoting the head and tail entity, RC aims to predict the relation r ∈ R between e h and e t , or give a no_relation prediction if no relation in R holds.RC is a multilingual task if the token sequences come from different languages.

Fine-tuning for Relation Classification
In fine-tuning, a task-specific linear classifier is added on top of the PLM.Fine-tuning hence introduces a different scenario from pre-training, since language model (LM) pre-training is usually formalized as a cloze-style task to predict target tokens at [MASK] (Devlin et al., 2019;Liu et al., 2019) or a corrupted span (Raffel et al., 2020;Lewis et al., 2020).For the RC task, the classifier aims to predict the target class r at [CLS] or at the entity spans denoted by MARKER (Baldini Soares et al., 2019).

Prompting for Relation Classification
Prompting is proposed to bridge the gap between pre-training and fine-tuning (Liu et al., 2022;Gu et al., 2022).The essence of prompting is, by appending extra text to the original text according to a task-specific template T (•), to reformulate the downstream task to an LM pre-training task such as masked language modeling (MLM), and apply the same training objective during the task-specific training.For the RC task, to identify the relation between "Angela Merkel" and "Joachim Sauer" in the text "Angela Merkel's current husband is quantum chemist Joachim Sauer," an intuitive template for prompting can be "The relation between Angela Merkel and Joachim Sauer is [MASK]," and the LM is supposed to assign a higher likelihood to the term couple than to e.g.friends or colleagues at [MASK].This "fill-in the blank" paradigm is well aligned with the pre-training scenario, and enables prompting to better coax the PLMs for pre-trained knowledge (Petroni et al., 2019).

Methods
We now present our method, as shown in Figure 1.We introduce its template and verbalizer, and propose several variants of the prompt.Lastly, we explain the training and inference process.

Template
For prompting (Liu et al., 2022), a prompt often consists of a template T (•) and a verbalizer V. Given a plain text x, the template T adds taskrelated instruction to x to yield the prompt input x prompt = T (x). (1) Following Chen et al. (2022) and Han et al. (2021), we treat relations as predicates and use the cloze "e h {relation} e t " for the LM to fill in.Our template is formulated as T (x) := "x. e h ____ e t ". (2) In the template T (x), x is the original text and the two entities e h and e t come from x.Therefore, our template does not introduce extra tokens, thus involves no translation at all.

Verbalizer
After being prompted by x prompt , the PLM M predicts the masked text y at the blank.To complete an NLP classification task, a verbalizer φ is required to bridge the set of labels Y and the set of predicted texts (verbalizations V).For the simplicity of our prompt, we use the one-to-one verbalizer: where r is a relation, and φ(r) is the simple verbalization of r. φ(•) normally only involves splitting r by "-" or "_" and replacing abbreviations such as org with organization.E.g., the relation org-has-member corresponds to the verbalization "organization has member".Then the prediction is formalized as where θ M denotes the parameters of model M. p(r|x) is normalized by the likelihood sum over all relations.

Variants
To find the optimal way to prompt, we investigate three variants as follows.
Hard prompt vs soft prompt (SP) Hard prompts (a.k.a.discrete prompts) (Liu et al., 2022) are entirely formulated in natural language.Soft prompts (a.k.a.continuous prompts) consist of learnable tokens (Lester et al., 2021) that are not contained in the PLM vocabulary.Following Han et al. (2021), we insert soft tokens before entities and blanks as shown for SP in Table 1.
Code-switch (CS) vs in-language (IL) Relation labels are in English across almost all RC datasets.Given a text from a non-English input L with a blank, the recovered text is code-mixed after being completed with an English verbalization, corresponding to code-switch prompting.It is probably more reasonable for the PLM to fill in the blank in language L. Inspired by Lin et al. (2022) and Zhao and Schütze (2021), we machinetranslate the English verbalizers into the other languages. 1 and in-language (IL) prompting.For English, CSand IL-prompting are equivalent, since L is English itself.
Word order of prompting For the RC task, head-relation-tail triples involve three elements.Therefore, deriving natural language prompts from them requires handling where to put the predicate (relation).In the case of SOV languages, filling in a relation that occurs between e h and e t seems less intuitive.Therefore, to investigate if the word order of prompting affects prediction accuracy, we swap the entities and the blank in the SVOtemplate "x. e h ____ e t " and get "x.e h e t ____" as the SOV-template.

Training and Inference
The training and inference setups depend on the employed model.Prompting autoencoding language models requires the verbalizations to be of fixed length, since the length of masks, which is identical with verbalization length, is unknown during inference.Encoder-decoders can handle verbalizations of varying length by nature (Han et al., 2022;Du et al., 2022).Han et al. (2021) adjust all the verbalizations in TACRED to a length of 3, to enable prompting with RoBERTa for RC.We argue that for multilingual RC, this fix is largely infeasible, because: (1) in case of in-language prompting on SMiLER, the variance of the length of the verbalizations increases from 0.68 to 1.44 after translation (see Table 2), and surpasses most of listed monolingual RC datasets (SemEval, NYT and SCIERC), making it harder to unify the length; (2) manually adjusting the translated prompts requires manual effort per target language, making it much more expensive than adjusting only English verbalizations.Therefore, we suggest using an encoder-decoder PLM for prompting (Song et al., 2022).
Training objective For an encoder-decoder PLM M, given the prompt input T (x) and the target sequence φ(r) (i.e.label verbalization), we denote the output sequence as y.The probability of an exact-match decoding is calculated as follows: where y t , φ t (r) denote the t-th token of y and φ(r), respectively.y <t denotes the decoded sequence on the left.θ represents the set of all the learnable parameters, including those of the PLM θ M , and those of the soft tokens θ sp in case of variant "soft prompt".Hence, the final objective over the training set X is to minimize the negative loglikelihood: Inference We collect the output logits of the decoder, L ∈ R |V |×L , where |V | is the vocabulary size of M, and L is the maximum decode length.For each relation r ∈ R, its score is given by (Han et al., 2022): where we compute P by looking up in the t-th column of L and applying softmax at each time step t.We aggregate P by addition to encourage partial matches as well, instead of enforcing exact matches.The score is normalized by the length of verbalization in order to avoid predictions favoring longer relations.Finally, we select the relation with the highest score as prediction.

Experiments
We implement our experiments using the Hugging Face Transformers library (Wolf et al., 2020), Hydra (Yadan, 2019) and PyTorch (Paszke et al., 2019). 2 We use micro-F1 as the evaluation metric, as the SMiLER paper (Seganti et al., 2021) suggests.To measure the overall performance over multiple languages, we report the macro average across languages, following Zhao and Schütze (2021) and Lin et al. (2022).We also group the languages by their available resources in both pretraining and fine-tuning datasets for additional aggregate results.Details of the dataset, the models, and the experimental setups are as follows.Further experimental details are listed in Appendix A.

Dataset
We conduct an experimental evaluation of our multilingual prompt methods on the SMiLER (Seganti   2 We make our code publicly available at https://github.com/DFKI-NLP/meffi-prompt for better reproducibility.3 lists the main statistics of the different languages in the SMiLER dataset.Note that languages have varying number of relations, mostly related to how many samples are present.We do not evaluate other datasets because the only prior multilingual RC dataset that fits our task, RELX (Köksal and Özgür, 2020), contains only 502 parallel examples in 5 languages.
Grouping of the languages We visualize the languages in Figure 2 based on the sizes of RC training data, but include the pre-training data as well, to give a more comprehensive overview of the availability of resources for each language.We divide the 14 languages into 4 groups, according to the detectable clusters in Figure 2 and language origins.

Model
For prompting, we use mT5 BASE (Xue et al., 2021), an encoder-decoder PLM that supports 101 languages, including all languages in SMiLER.mT5 BASE (Xue et al., 2021)  XLM-R EM To provide a fine-tuning baseline, we re-implement BERT EM (Baldini Soares et al., 2019) with the ENTITY START variant. 4In this method, the top-layer representations at the starts of the two entities are concatenated for linear classification.To adapt BERT EM to multilingual tasks, we change the PLM from BERT to a multilingual autoencoder, XLM-R BASE (Conneau et al., 2020), and refer to this model as XLM-R EM .XLM-R BASE has 125M parameters.
Null prompts (Logan IV et al., 2022) To better verify the effectiveness of our method, we implement null prompts as a strong task-agnostic prompt baseline.Null prompts involve minimal prompt engineering by directly asking the LM about the relation, without giving any task instruction (see Table 1).Logan IV et al. (2022) show that null prompts surprisingly achieve on-par performance with handcrafted prompts on many tasks.For best comparability, we use the same PLM mT5 BASE .

Fully Supervised Setup
We evaluate the performance of XLM-R EM , null prompts, and our method on each of the 14 languages, after training on the full train split from that language.The prompt input and target of null prompts and our prompts are listed in Table 1.
We employ the randomly generated seed 319 for all the evaluated methods.For XLM-R EM , we follow Baldini Soares et al. ( 2019) and set the batch size to be 64, the optimizer to be Adam with the learning rate 3 × 10 −5 and the number of epochs to be 5.For null prompts and ours, we use AdamW as the optimizer with the learning rate 3 × 10 −5 , as Zhang et al. (2022) suggest for most of the sequence-to-sequence tasks, the number of epochs to 5, and batch size to 16.The maximum sequence length is 256 for all methods.

Few-shot Setup
Few-shot learning is normally cast as a K-shot problem, where K labelled examples per class are available.We follow Chen et al. (2022) and Han et al. (2021), and evaluate on 8, 16 and 32 shots.
The few-shot training set D train is generated by randomly sampling K instances per relation from the training split.The test set D test is the original test split from that language.We follow Gao et al. (2021) and sample another K-shot set from the English train split as validation set D val .We tune hyperparameters on D val for the English task, and apply these to all languages.
We evaluate the same methods as in the fully supervised scenarios, but repeat 5 runs as suggested in Gao et al. (2021), and report the mean and standard deviation of micro-F1.We use a fixed set of random seeds {13, 36, 121, 223, 319} for data generation and training across the 5 runs.For XLM-R EM , we use the same hyperparameters as Baldini Soares et al. ( 2019), a batch size of 256, and a learning rate of 1 × 10 −4 .For null prompts and our prompts, we set the learning rate to 3 × 10 −4 , batch size to 16, and the number of epochs to 20.

Zero-shot Setup
We consider two scenarios for zero-shot multilingual relation classification.
Zero-shot in-context learning Following Kojima et al. ( 2022), we investigate whether PLMs are also decent zero-shot reasoners for RC.This scenario does not require any samples or training.We test the out-of-the-box performance of the PLM by directly prompting it with x prompt .Zero-shot in-context learning does not specify further hyperparameters since it is training-free.
Zero-shot cross-lingual transfer In this scenario, following Krishnan et al. (2021), we finetune the model with in-language prompting on the English train split, and then conduct zero-shot incontext tests with this fine-tuned model on other languages using code-switch prompting.Through this setting, we want to verify if task-specific pretraining in a high-resource language such as English helps in other languages.In zero-shot crosslingual transfer, we use the same hyperparameters and random seed to fine-tune on the English task.

Results and Discussion
We first present the results in fully supervised, fewshot and zero-shot scenarios, and then discuss the main findings for answering the research questions in Section 1.  (Seganti et al., 2021), XLM-R EM , null prompts, and ours.EN, H, M, L: macro average across the languages within the respective group.X: macro average across all 14 languages.Our variants outperform all baselines along all groups averages, XLM-R EM has good results for many high-resource languages.Overall, in-language prompting performs best, especially for lower-resource languages.

Fully Supervised Results
the three variants of our method beat the finetuning baseline XLM-R EM and the prompting baseline null prompts, according to the macroaveraged performance across 14 languages.Inlanguage prompting delivers the most promising result, achieving an average F 1 of 85.0, which is higher than XLM-R EM (68.2) and null prompts (66.2).The other two variants, code-switch prompting with and w/o soft tokens, achieve F 1 scores of 84.1 and 82.7, respectively, only 0.9 and 2.3 lower than in-language.All three prompt variants are hence effective in fully supervised scenarios.On a per-group basis, we find that the lowerresourced a language is, the greater an advantage prompting enjoys against fine-tuning.In particular, in-language prompts shows better robustness compared to XLM-R EM in low-resource languages.They both yield 95.9-96.0F 1 scores for English, but XLM-R EM decreases to 54.3 and 3.7 F 1 in Group-M and -L, while in-language prompting still delivers 83.5 and 65.2 F 1 .

Few-shot Results
Table 5 presents the per-group results in few-shot experiments.All the methods benefit from larger K. Similarly, in-language prompting still turns out to be the best contender, performing 1st in 8-and 32-shot, and the 2nd in 16-shot.We see that inlanguage outperforms XLM-R EM in all K-shots, while code-switch achieves comparable or even lower F 1 to XLM-R EM for K = 8, suggesting that the choice of prompt affects the few-shot performance greatly, thus needs careful consideration.
On a per-group basis, we find that in-language prompting outperforms other methods for middleand low-resourced languages.Similar observations can also be drawn from fully supervised results.We conclude that, with sufficient supervision, inlanguage is the optimal variant to prompt rather Table 5: Few-shot results by group in micro-F1 (%) on the SMiLER (Seganti et al., 2021) dataset averaged over five runs.We macro-average results for each language group (see Figure 2) and over all languages (X).
In-language prompting performs best in most settings and language groups.Our variants are especially strong for medium-and lower-resource language groups.See Table 7 in Appendix C for detailed results with mean and std.for each language.
than code-switch.We hypothesize it is due to the pre-training scenario, where the PLM rarely sees code-mixed text (Santy et al., 2021).

Zero-shot Results
Table 6 presents the per-language results in zeroshot scenarios.We consider the random baseline for comparison (Zhao and Schütze, 2021;Winata et al., 2021).We notice that performance of the random baseline varies a lot across languages, since the languages have different number of classes in the dataset (cf.margin, in both word orders, while in-language prompting performs worse than the random baseline in 6 languages.Code-switch prompting outperforms in-language prompting across all the 13 non-English languages, using SVO-template.We assume that, without in-language training, the PLM understands the task best when prompted in English.The impressive performance of code-switch shows the PLM is able to transfer its pre-trained knowledge in English to other languages.We also find that the performance is also highly indicated by the number of classes, with worst F 1 scores achieved in EN, KO and PT (36,28 and 22 classes), and best scores in AR, RU and UK (9, 8 and 7 classes).In addition, we observe that word order does not play a significant role for most languages, except for FA, which is an SOV-language and has 54.5 F 1 gain from in-language prompting with an SOV-template.
For zero-shot cross-lingual transfer, we see that non-English tasks benefit from English in-domain prompt-based fine-tuning, and the F 1 gain improves with the English data size.For 5 languages (ES, FA, NL, SV, and UK), zero-shot transfer after training on 268k English examples delivers even better results than in-language fully supervised training (cf.Table 4).Sanh et al. (2022) show that including RC-specific prompt input in English during pre-training can help in other languages.

Discussion
Based on the results above, we answer the research questions from Section 1.
RQ1. Which is the most effective way to prompt?In the fully-supervised and few-shot scenario, in-language prompting displays the best re-sults.This appears to stem from a solid performance across all languages in both settings.Its worst performance is 31.8F 1 for Polish 8-shot (see Table 7 in Appendix C).All other methods have results lower than 15.0 F 1 for some language.This indicates that with little supervision mT5 is able to perform the task when prompted in the language of the original text.However, zero-shot results strongly prefer code-switch prompting.It could follow that, without fine-tuning, the model's understanding of this task is much better in English.
RQ2.How well does our method perform in different data regimes and languages?Averaged over all languages, all our variants outperform the baselines, except for 8-shot.For some high-resource languages, XLM-R EM is able to outperform our method.On the other hand, for low-resource languages null prompts are a better baseline which we consistently outperform.This could indicate that prompting the underlying mT5 model is better suited for multilingual RC on SMiLER.Overall, the results suggest that minimal translation can be very helpful for multilingual relation classification.

Related Work
Multilingual relation classification Previous work in multilingual RC has primarily focused on traditional methods rather than prompting PLMs.Faruqui and Kumar (2015) machine-translate non-English full text to English to deal with multilinguality.Akbik et al. (2016) employ a shared semantic role labeler to get language-agnostic abstraction and apply rule-based methods to classify the unified abstractions.Lin et al. (2017) employ convolutional networks to extract relation embeddings from texts, and propose cross-lingual attention between relation embeddings to model cross-lingual information consistency.Sanh et al. (2019) leverage the embeddings from BiLSTM, which is trained with a set of selected semantic tasks to help (multilingual) relation extraction.Köksal and Özgür (2020) finetune (multilingual) BERT, classifying the embedding at [CLS].To take entity-related embeddings into consideration as well, Nag et al. (2021) add an extra summarization layer on top of a multilingual BERT to collect and pool the embeddings at both  2022) propose a unified multilingual prompt by introducing a so-called "two-tower" encoder, with the template tower producing language-agnostic prompt representation, and the context tower encoding text information.Fu et al. (2022) manually translate prompts and suggest multilingual multitask training to boost the performance for a target downstream task.

Conclusion
In this paper, we present a first, simple yet efficient and effective prompt method for multilingual relation classification, by translating only the relation labels.Our prompting outperforms fine-tuning and null prompts in fully supervised and few-shot experiments.With supervised data, in-language prompting enjoys the best performance, while in the zero-shot scenarios prompting in English is preferable.We attribute the good performance of our method to its well-suitedness for RC, with the derivation of entity 1 -relation-entity 2 prompts from relation triples.We would like to see our method extended to similar tasks, such as semantic role labeling, with a structure between concepts that can be described in natural language.

Limitations
We acknowledge the main limitation of this work is that we only experiment on one dataset with 14 languages.Multilingual RC datasets prior to SMiLER are limited in the coverage of languages or in the size of unique training examples.It would be interesting to see how our method performs on other multilingual RC datasets, especially for underrepresented languages (Winata et al., 2022).
We restrict the target language to be supported by the underlying PLM.The popular multilingual PLMs, mT5 and mBART, include 101 and 25 languages during pre-training.We rely on these PLMs and fail to study true low-resource languages that are not represented in such PLMs (Aji et al., 2022).
It is noticeable that in the fully supervised scenario, for 7 out of the 14 languages, at least one method achieves over 0.95 micro-F 1 score.We hypothesize that is due to high homogeneity in and between the train and test split.If so, the dataset itself might not be challenging, which could indicate that the results are mostly measuring how well the model is able to fit a few indicators (quickly).
Like most other prompt methods, ours requires the label names to be natural language which are indicative of the class.Therefore, our method would suffer from labels being non-descriptive.
CS, SP and IL) on either English, or all other languages in total.With XLM-R EM the running time is 3 hours.
Few-shot It takes 20 (8-shot), 26 (16-shot), and 36 minutes (32-shot) for 1 run with mT5 BASE and a prompt method over all languages.With XLM-R EM the running time is 8 minutes.
Zero-shot For zero-shot in-context experiments, it takes 6 minutes with mT5 BASE and a prompt method over all languages.For zero-shot cross-lingual transfer, the running time equals English training time (5 hours) plus inference-only time (6 minutes).

Shots
Figure1: Overview of our approach.Given a plain text x containing head entity e h and tail entity e t from language L, we first apply the template T (x) = "x.e h ____ e t " and yield the prompt input with a blank.Then the PLM aims to fill in the relation at the blank.In code-switch prompting, the target sequence is the English relation verbalization.In in-language prompting, the target is the relation name translated into L.

Figure 2 :
Figure 2: Pre-training and fine-tuning dataset size by language.Four languages groups are distinguishable: English (green) has by far the largest dataset, many other European languages (orange) have large datasets for pre-training and fine-tuning.The three non-European languages (blue) have either less pre-training or fine-tuning data and lowest resource are Swedish and Ukrainian (yellow).
has 220M parameters.4.3 Baselines EN(B) (Seganti et al., 2021) EN(B) is the baseline proposed together with the SMiLER dataset.They fine-tune BERT BASE on the English training split and report the micro-F1 on the English test split.BERT BASE has 110M parameters.

[
CLS] and entity starts.Multilingual prompting Multilingual prompting is a new yet fast-growing topic.Winata et al. (2021) reduce handcrafting efforts by reformulating general classification tasks into binary classification with answers restricted to true or false for all languages.Huang et al. ( 26 Oct 2022Titanic is directed by Cameron .Titanic _______ Cameron .Goethe schrieb die Tragödie Faust .Faust _______ Goethe .

Table 2 :
Zeng et al. (2018)oth code-switch (CS) Statistics of the lengths of the verbalizations over several classification tasks.The lengths for non-RC tasks depend on the tokenizers from the respective PLMs in the cited work.The lengths for RC tasks are based on the mT5 BASE tokenizer.Mean and std.show that the label space of the RC task is more complex than most few-class classification tasks.The verbalizations of RC datasets are listed in Appendix B. For SemEval, the two possible directions of a relation are combined.For NYT, we use the version fromZeng et al. (2018).For SMiLER, "EN" is the English split; "ALL" contains all data from 14 languages.

Table 3 :
Statistics of the 14 languages in the SMiLER dataset, including the number of classes, the number of training examples (in thousands), and the maximum text length over train and test splits.Appended to the table are the sizes (in billion tokens) of pre-training corpora of the referred languages for mT5 and XLM-R, respectively.
Table 4 presents the experimental results in the fully supervised scenario, for different methods, languages, and language groups.We see that all

Table 6 :
Zero-shot results in micro-F1 (%) on the SMiLER dataset."SVO" and "SOV": word order of prompting.Overall, Code-switch prompting performs the best in the zero-shot in-context scenario.In cross-lingual transfer experiments, English-task training greatly improves the performance on all the other 13 languages.