MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset

Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources comparable in size to large English datasets such as TACRED (Zhang et al., 2017). To address this gap, we introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families, which is created by machine-translating TACRED instances and automatically projecting their entity annotations. We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models in common transfer learning scenarios. Our analyses show that machine translation is a viable strategy to transfer RE instances, with native speakers judging more than 83% of the translated instances to be linguistically and semantically acceptable. We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts. However, we also observe a variety of translation and annotation projection errors, both due to the MT systems and linguistic features of the target languages, such as pronoun-dropping, compounding and inflection, that degrade dataset quality and RE model performance.


Introduction
Relation extraction (RE), defined as the task of identifying and classifying semantic relationships between entities from text (cf. Figure 1), is a fundamental task in information extraction (Doddington et al., 2004). Extending RE to multilingual settings has recently received increased interest (Zou et al., 2018;Nag et al., 2021;Chen et al., 2022c), both to address the urgent need for more inclusive NLP systems that cover more languages than just English (Ruder et al., 2019;Hu et al., 2020), as well as to investigate language-specific phenomena and challenges relevant to this task. The main bottleneck for multilingual RE is the lack of supervised resources, comparable in size to large English datasets (Riedel et al., 2010;Zhang et al., 2017), as annotation for new languages is very costly. Most of the few existing multilingual RE datasets are distantly supervised (Köksal and Özgür, 2020;Seganti et al., 2021;Bhartiya et al., 2022), and hence suffer from noisy labels that may reduce the prediction quality of models (Riedel et al., 2010;Xie et al., 2021). Available fully-supervised datasets are small, and cover either very few domain-specific relation types (Arviv et al., 2021;Khaldi et al., 2022), or only a small set of languages (Nag et al., 2021).
To address this gap, and to incentivize research on supervised multilingual RE, we introduce a multilingual version of one of the most prominent supervised RE datasets, TACRED (Zhang et al., 2017). MultiTACRED is created by machinetranslating TACRED instances and automatically projecting their entity annotations. Machine translation is a popular approach for generating data in cross-lingual learning (Hu et al., 2020;Nag et al., 2021). Although the quality of machine-translated data may be lower due to translation and alignment errors (Yarmohammadi et al., 2021), it has been shown to be beneficial for classification and structured prediction tasks (Hu et al., 2020;Ozaki et al., 2021;Yarmohammadi et al., 2021).
The MultiTACRED dataset we present in this work covers 12 languages from 9 language families. 1 We select typologically diverse languages which span a large set of linguistic phenomena such as compounding, inflection and pronoun-drop, and for which a monolingual pretrained language model is available. We automatically and manually analyze translation and annotation projection quality in all target languages, both in general terms and with respect to the RE task, and identify typical error categories for alignment and translation that may affect model performance. We find that overall translation quality is judged to be quite good with respect to the RE task, but that e.g. pronoun-dropping, coordination and compounding may cause alignment and semantic errors that result in erroneous instances. In addition, we experimentally evaluate fine-tuned pretrained monoand multilingual language models (PLM) in common training scenarios, using source language (English), target language, or a mixture of both as training data. We also evaluate an English data fine-tuned model on back-translated test instances to estimate the effect of noise introduced by the MT system on model performance. Our results show that in-language training works well, given a suitable PLM. Cross-lingual zero-shot transfer is acceptable for languages well-represented in the multilingual PLM, and combining English and target language data for training considerably improves performance across the board.
To summarize, our work aims to answer the following research questions: Can we reaffirm the usefulness of MT and cross-lingual annotation projection, in our study for creating large-scale, high quality multilingual datasets for RE? How do pretrained mono-and multilingual encoders compare to each other, in within-language as well as crosslingual evaluation scenarios? Answers to these questions can provide insights for understanding language-specific challenges in RE, and further research in cross-lingual representation and transfer learning. The contributions of this paper are: • We introduce MultiTACRED, a translation of the widely used, large-scale TACRED dataset into 12 typologically diverse target languages: Arabic, German, Spanish, French, Finnish, Hindi, Hungarian, Japanese, Polish, Russian, Turkish, and Chinese.
• We present an evaluation of monolingual, cross-lingual, and multilingual models to evaluate target language performance for all 12 languages.
• We present insights into the quality of machine translation for RE, analyzing alignment as well as language-specific errors.

Translating TACRED
We first briefly introduce the original TACRED dataset, and then describe the language selection and automatic translation process. We wrap up with a description of the analyses we conduct to verify the translation quality.

The TACRED dataset
The  Alt et al. (2020) and Stoica et al. (2021) improved upon the label quality of the crowd annotations by re-annotating large parts of the dataset.

Automatic Translation
We translate the complete train, dev and test splits of TACRED into the target languages, and in addition back-translate the test split into English to generate machine-translated English test data. Each instance in the original TACRED dataset is a list of tokens, with the head and tail entity arguments of the potential relation specified via token offsets. For translation, we concatenate tokens with whitespace and convert head and tail entity offsets into XML-style markers to denote the arguments' boundaries, as shown in Figure 1. We use the commercial services of DeepL 5 and Google 6 , since both offer the functionality to preserve XML tag markup. Since API costs are similar, we use DeepL for most languages, and only switch to Google for languages not supported by DeepL (at the time we were running the MT). We validate the translated text by checking the syntactic correctness of the XML tag markup, and discard translations with invalid tag structure, e.g. missing or invalid head or tail tag pairs. After translation, we tokenize the translated text using language-specific tokenizers. 7 Finally, we store the translated instances in same JSON format as the original TACRED English dataset, with fields for tokens, entity types and offsets, label and instance id. We can then easily apply the label corrections provided by e.g. Alt et al. (2020) or Stoica et al. (2021) to any target language dataset by applying the respective patch files.
We select target languages to cover a wide set of interesting linguistic phenomena, such as compounding (e.g., German), inflection/derivation (e.g., German, Turkish, Russian), pronoun-dropping (e.g., Spanish, Finnish, Polish), and varying degrees of synthesis (e.g., Turkish, Hungarian vs. Chinese). We also try to ensure that there is a monolingual pretrained language model available for each language, which is the case for all languages except Hungarian. The final set of languages in Multi-TACRED is: German, Finnish, Hungarian, French, Spanish, Arabic, Hindi, Japanese, Chinese, Polish, Russian, and Turkish. Table 6 in Appendix A lists key statistics per language.

Translation Quality Analysis
To verify the overall quality of the machinetranslated data, we also manually inspect translations. For each language, we randomly sample 100 instances from the train split. For each sample 7 See Appendix A for details. instance, we display the source (English) text with entity markup (see Figure 1 for the format), the target language text with entity markup, and the relation label.
We then ask native speakers to judge the translations by answering two questions: (Q1) Does the translated text meaningfully preserve the semantic relation of the English original, regardless of minor translation errors? 8 (Q2) Is the overall translation linguistically acceptable for a native speaker? Human judges are instructed to read both the English source and the translation carefully, and then to answer the two questions with either yes or no. They may also add free-text comments, e.g. to explain their judgements or to describe translation errors. The samples of each language are judged by a single native speaker. Appendix B gives additional details.
In addition, we conduct a manual analysis of the automatically discarded translations, using a similar-sized random sample from the German, Russian and Turkish train splits, to identify possible reasons and error categories. These analyses are performed by a single trained linguist per language, who is also a native speaker of that language, with joint discussions to synthesize observations. Results of both analyses are presented in Section 4.1.

Experiments
In this section, we describe the experiments we conduct to answer the research questions "How does the performance of language-specific models compare to the English original?" and "How does the performance of language-specific models compare to multilingual models such as mBERT trained on the English source data? How does the performance change when including target-language data for training". We first introduce the training scenarios, and then give details on choice of models and hyperparameters, as well as the training process.

Training scenarios
We evaluate the usefulness of the translated datasets by following the most prevalent approach of framing RE as a sentence-level supervised multi-class classification task. Formally, given a relation set R and a text x = [x 1 , x 2 , . . . , x n ] (where x 1 , · · · , x n are tokens) with two disjoint spans e h = [x i , . . . , x j ] and e t = [x k , . . . , x l ] denoting the head and tail entity mentions, RE aims to predict the relation r ∈ R between e h and e t , or assign the no_relation class if no relation in R holds. Similar to prior work (e.g., Nag et al. (2021)), we evaluate relation extraction models in several different transfer learning setups, which are described next.
Monolingual We evaluate the performance of language-specific PLMs for each of the 12 target languages, plus English, where the PLM is supervisedly fine-tuned on the train split of the respective language. Cross-lingual We evaluate the performance of a multilingual mBERT model on the test split of each of the 12 target languages, plus English, after training on the English train split. Mixed / Multilingual We evaluate the performance of a multilingual mBERT model on the test split of each of the 12 target languages, after training on the complete English train split and a variable portion of the train split of the target language, as suggested e.g. by Nag et al. (2021). We vary the amount of target language data in {5%,10%,20%,30%,40%,50%,100%} of the available training data. When using 100%, we are effectively doubling the size of the training set, and "duplicating" each training instance. Back-translation Finally, we also evaluate the performance of a BERT model fine-tuned on the original (untranslated) English train split on the test sets obtained by back-translating from each target language.

Training Details and Hyperparameters
We implement our experiments using the Hugging Face (HF) Transformers library (Wolf et al., 2020), Hydra (Yadan, 2019) and PyTorch (Paszke et al., 2019). 9 Due to the availability of pretrained models for many languages and to keep things simple, we use BERT as the base PLM (Devlin et al., 2019).
We follow Baldini Soares et al. (2019) and enclose the subject and object entity mentions with special token pairs, modifying the input to become "[HEAD_START] subject [HEAD_END] . . .
[TAIL_START] object [TAIL_END]". In addition, we append the entity types of subject and object to the input text as special tokens, after a separator token: ". . . [SEP] [HEAD=type] [SEP] [TAIL=type]", where type is the entity type of the respective argument. We use the final hidden state representation of the [CLS] token as the fixed length representation of the input sequence that is fed into the classification layer.
We train with batch size of 8 for 5 epochs, and optimize for cross-entropy. The maximum sequence length is 128 for all models. We use AdamW with a scenario-specific learning rate, no warmup, β 1 = 0.9, β 2 = 0.999, ϵ = 1e − 8, and linear decay of the learning rate. Other hyperparameter values, as well as scenario-specific learning rates and HF model identifiers for the pretrained BERT models, are listed in Appendix C.
We use micro-F1 as the evaluation metric, and report the median result of 5 runs with different, fixed random seeds. For all experiments, we use the revised version of TACRED presented by Alt et al. (2020), which fixes a large portion of the dev and test labels. 10 We report scores on the test set in the respective target language, denoted as test L . Due to the automatic translation and validation, training and test sets differ slightly across languages, and absolute scores are thus not directly comparable across languages. We therefore also report scores on the intersection test set of instances available in all languages (test ∩ ). This test set contains 11,874 instances, i.e. 76.6% of the original test set (see also Table 6).

Translation Quality
Automatic validation As described in Section 2.2, we validate the target language translation by checking whether the entity mention tag markup was correctly transferred. On average, 2.3% of the instances were considered invalid after translation. By far the largest numbers of such errors occurred when translating to Japanese (9.6% of translated instances), followed by Chinese (4.5%) and Spanish (3.8%). Table 6 in Appendix A gives more details, and shows the number of valid translations for each language, per split and also for the back-translation of the test split. Back-translation incurred only half as many additional errors as compared to the initial translation of the test split into the target language, presumably due to the fact that 'hard' examples had already been filtered out during the first translation step.
The validation basically detects two types of alignment errors -missing and additional alignments. An alignment may be missing in the case of pro-drop languages, where the argument is not realized in the translation (e.g. Spanish, Chinese), or in compound noun constructions in translations (e.g. in German). In other cases, the aligner produces multiple, disjoint spans for one of the arguments, e.g. in the case of coordinated conjunctions or compound constructions with different word order in the target language (e.g. in Spanish, French, Russian). Table 8 in Appendix D lists more examples for the most frequent error categories we observed. Manual Validation Table 1 shows the results of the manual analysis of translations. With regards to Q1, on average 87.5% of the translations are considered to meaningfully express the relation, i.e. as in the original text. Overall translation quality is judged to be good for 83.7% of the sampled instances on average across languages. The most frequent error types noted by the annotators are again alignment errors, such as aligning a random (neighboring) token from the sentence with an English pronoun argument in pronoun-dropping languages (e.g. Polish, Chinese), and non-matching spans (inclusion/exclusion of tokens in the aligned span). Similar errors have also been observed in a recent study by Chen et al. (2022b). In highly inflecting languages such as Finnish or Turkish, the aligned entity often changes morphologically (e.g. possessive/case suffixes). 11 Other typical errors are 11 Inflection and compounding both ideally could be solved by introducing alignment/argument span boundaries at the uncommon/wrong word choices, (e.g. due to missing or wrongly interpreted sentence context), and the omission of parts of the original sentence. Less frequent errors include atypical input which was not translated correctly (e.g. sentences consisting of a list of sports results), and non-English source text (approx. 1% of the data, see also Stoica et al. (2021)). Table 8 also lists examples for these error categories.

Model Performance
Monolingual  Table 2 show that language-specific models perform reasonably well for many of the evaluated languages. 13 Their morpheme level, but this in turn may raise issues with e.g. PLM tokenization and entity masking. 12 See also Appendix C for an additional discussion of Hindi performance issues 13 However, as various researchers have pointed out, model performance may be over-estimated, since the models may be  lower performance may be due to several reasons: translation errors, smaller train and test splits because of the automatic validation step, the quality of the pre-trained BERT model, as well as languagespecific model errors.
Results on the intersection test set test ∩ are slightly higher on average, as compared to test L . Relative differences to English, and the overall 'ranking' of language-specific results, remain approximately the same. This reaffirms the performance differences between languages observed on test L . It also suggests that the intersection test set contains fewer challenging instances. For Hindi, these results, in combination with the low manual evaluation score of 67% correct translations, suggest that the translation quality is the main reason for the performance loss.
We conclude that for the monolingual scenario, machine translation is a viable strategy to generate supervised data for relation extraction for most of the evaluated languages. Fine-tuning a languagespecific PLM on the translated data yields reasonable results that are not much lower than those of the English model for many tested languages. Cross-lingual In the cross-lingual setting, micro-F1 scores are lower than in the monolingual setting for many languages (see Table 3). The micro-F1 scores for languages well-represented in mBERT's pretraining data (e.g., English, German, Chinese) are close to their monolingual counterparts, whereas for languages like Arabic, Hungarian, Japanese, or Turkish, we observe a loss of 4.7 to 9.7 F1 points. This is mainly due to a much lower recall, for example, the median recall for Japanese is only 51.3. The micro-F1 scores are highly correlated with the pretraining data size of each language in mBERT: The Spearman rank correlation coefficient of micro-F1 L T scores with the WikiSize reported in Wu and Dredze (2020) is r s = 0.82 , the Pearson correlation coefficient is r p = 0.78 . Hence, languages which are less affected by "translationese" (Riley et al., 2020;Graham et al., 2020). well represented in mBERT's pretraining data exhibit worse relation extraction performance, as they don't benefit as much from the pretraining.
Precision, Recall and F1 on the intersection test set test ∩ are again slightly better on average than the scores on test L . For Hindi, our results reaffirm the observations made by Nag et al. (2021) for cross-lingual training using only English training data. Our results for RE also confirm prior work on the effectiveness of cross-lingual transfer learning for other tasks (e.g., Conneau et al. (2020);Hu et al. (2020). While results are lower than in the monolingual setting, they are still very reasonable for wellresourced languages such as German or Spanish, with the benefit of incurring no translation at all for training. However, for languages that are less wellrepresented in mBERT, using a language-specific PLM in combination with in-language training data produces far better results. Table 4 shows the results obtained when training on both English and varying amounts of target language data. We can observe a considerable increase of mBERT's performance for languages that are not well represented in mBERT's pretraining data, such as e.g. Hungarian. These languages benefit especially from adding in-language training data, in some cases even surpassing the performance of their respective monolingual model. For example, mBERT trained on the union of the English and the complete Japanese train splits achieves a micro-F1 score of 73.3, 11.2 points better than the cross-lingual score of 62.1 and 1.5 points better than the 71.8 obtained by the monolingual model on the same test data. Languages like German, Spanish, and French don't really benefit from adding small amounts of in-language training data in our evaluation, but show some improvements when adding 100% of the target language training data (last row), i.e. when essentially doubling the size of the training data. Other languages, like Finnish or Turkish, show improvements over the cross-lingual baseline, but don't reach the performance of their monolingual counterpart.   (2020)). Languages with less pretraining data in mBERT suffer a larger performance loss.  Table 4: Micro-F1 scores on the TACREV dataset for the mixed/multilingual setting. The table shows the median micro-F1 score across 5 runs, on the translated test split of the target language, when training mBERT on the full English train split and various portions, from 5% to 100%, of the translated target language train split. The last column shows the mean improvement across languages, compared to the cross-lingual baseline. Micro-F1 scores improve when adding in-language training data for languages not well represented in mBERT, while other languages mainly benefit when using all of the English and in-language data, i.e. essentially doubling the amount of training data (last row).

Mixed/Multilingual
Our results confirm observations made by Nag et al. (2021), who also find improvements when training on a mixture of gold source language data and projected silver target language data. For the related task of event extraction, Yarmohammadi et al. (2021) also observe that the combination of data projection via machine translation and multilingual PLMs can lead to better performance than any one cross-lingual strategy on its own. Back-translation Finally, Table 5 shows the performance of the English model evaluated on the back-translated test splits of all target languages. Micro-F1 scores range from 69.6 to 76.1, and are somewhat lower than the score of 77.1 achieved by the same model on the original test set. For languages like German, Spanish, and French, scores are very close to the original, while for Arabic and Hungarian, we observe a loss of approximately 7 percentage points. These differences may be due to the different quality of the MT systems per language pair, but can also indicate that the model cannot always handle the linguistic variance introduced by the back-translation.

Conclusion
We introduced a multilingual version of the largescale TACRED relation extraction dataset, obtained via machine translation and automatic annotation projection. Baseline experiments with inlanguage as well as cross-lingual transfer learning models showed that MT is a viable strategy to transfer sentence-level RE instances and span-level entity annotations to typologically diverse target languages, with target language RE performance comparable to the English original for many languages. However, we observe that a variety of errors may affect the translations and annotation alignments, both due to the MT system and the linguistic features of the target languages (e.g., compounding, high level of synthesis). MultiTACRED can thus serve as a starting point for deeper analyses of annotation projection and RE challenges in these languages. For example, we would like to improve our understanding of RE annotation projection for highly inflectional/synthetic languages, where token-level annotations are an inadequate solution. In addition, constructing original-language test sets to measure the effects of translationese remains an open challenge.
We plan to publish the translated dataset for the research community, depending on LDC requirements for the original TACRED and the underlying TAC corpus. We will also make publicly available the code for the automatic translation, annotation projection, and our experiments.

Limitations
A key limitation of this work is the dependence on a machine translation system to get highquality translations and annotation projections of the dataset. Depending on the availability of language resources and the MT model quality for a given language pair, the translations we use for training and evaluation may be inaccurate, or be affected by translationese, possibly leading to overly optimistic estimates of model performance. In addition, since the annotation projection for relation arguments is completely automatic, any alignment errors of the MT system will yield inaccurate instances. Alignment is at the token-level, rendering it inadequate for e.g. compounding or highly inflectional languages. Due to the significant resource requirements of constructing adequately-sized test sets, another limitation is the lack of evaluation on original-language test instances. While we manually validate and analyze sample translations in each target language (Section 4.1) for an initial exploration of MT effects, these efforts should be extended to larger samples or the complete test sets. Finally, we limited this work to a single dataset, which was constructed with a specific set of target relations (person-and organization-related), from news and web text sources. These text types and the corresponding relation expressions may be well reflected in the training data of current MT systems, and thus easier to translate than relation extraction datasets from other domains (e.g., biomedical), or other text types (e.g., social media). The translated examples also reflect the source language's view of the world, not how the relations would necessarily be formulated in the target language (e.g., use of metaphors, or ignorance of cultural differences).

Ethics Statement
We use the data of the original TACRED dataset "as is". Our translations thus reflect any biases of the original dataset and its construction process, as well as biases of the MT models (e.g., rendering gender-neutral English nouns to gendered nouns in a given target language). The authors of the original TACRED dataset (Zhang et al., 2017) have not stated measures that prevent collecting sensitive text. Therefore, we do not rule out the possible risk of sensitive content in the data. Furthermore, we utilize various BERT-based PLMs in our experiments, which were pretrained on a wide variety of source data. Our models may have inherited biases from these pretraining corpora.
Training jobs were run on a machine with a single NVIDIA RTX6000 GPU with 24 GB RAM. Running time per training/evaluation is approximately 1.5 hours for the monolingual and cross-lingual models, and up to 2 hours for the mixed/multilingual models that are trained on English and target language data.

A Translation Details
We use the following parameter settings for DeepL API calls: split_sentences:1, tag_handling:xml, outline_detection:0. For Google, we use for-mat_:html, model:nmt. Table 6 shows the number of syntactically valid and invalid translations for each language and split, as well as for the back-translation of the test split.
The translation costs per language amount to approximately 460 Euro, for a total character count of 22.9 million characters to be translated (source sentences including entity markup tags), at a price of 20 Euro per 1 million characters at the time of writing. Compared to an estimated annotation cost of approximately 10K USD, translation costs amount to less than 5% of the cost of fully annotating a similar-sized dataset in a new language. 16

B Human Translation Analysis
For the manual analysis of translated TACRED instances, we recruited a single native speaker for each language among the members of our lab and associated partners. Annotators were not paid for the task, but performed it as part of their work at the lab. All annotators are either Master's degree or PhD students, with a background in Linguistics, Computer Science, or a related field. The full instructions given to annotators, after a personal introduction to the task, are shown in Figure 2.

C Additional Training Details
All pre-trained models evaluated in this study are used as they are available from HuggingFace's model hub, without any modifications. Our implementation uses HF's BertForSequenceClassification implementation with default settings for dropout, positional embeddings, etc. Licenses for the pretrained BERT models are listed in Table 7   Hydra under the MIT license, and PyTorch uses a modified BSD license.
For Hungarian, we use bert-base-multilingualcased, since there is no pretrained Hungarian BERT model available on the hub. For Hindi, we tried several models by l3cube-pune, neuralspace-reverie, google and ai4bharat, but all of these produced far worse results than the ones reported here for l3cube-pune/hindi-bert-scratch. Interestingly, using bert-base-multilingual-cased instead of l3cubepune/hindi-bert-scratch as the base PLM produced far better results for Hindi in the monolingual setting, at 71.1 micro-F1.