REDFM: a Filtered and Multilingual Relation Extraction Dataset

Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English.In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems.First, we present SREDFM, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose REDFM, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at [https://www.github.com/babelscape/rebel](https://www.github.com/babelscape/rebel).


Introduction
The vast majority of online and offline content consists of raw, natural language text containing factual information. Current Large Language Models (LLMs) are pretrained on such text, allowing reasoning over it through tasks such as Question Answering (Bouziane et al., 2015) or Text Summarization (El-Kassas et al., 2021). On the other hand, structured resources such as Knowledge Graphs enable knowledge-based, explainable, machine-ready reasoning over their content. Both approaches are important and are widely used within Natural Language Processing systems, with recent trends looking at combining them (Yamada et al., 2020;Sun et al., 2021).
Information Extraction tackles the need for systems that extract structured information from raw text. Specifically, end-to-end Relation Extraction extracts the relational information between entities in a given text, providing a structured prediction. However, although some highly capable systems have been released (Wang and Lu, 2020;Paolini et al., 2021;Huguet Cabot and Navigli, 2021), there are few high-quality, contemporary resources. Current RE datasets are outdated, behind paywalls, have design flaws, or only consider English. While multilingual datasets exist, such as ACE05 1 or SMiLER (Seganti et al., 2021), the former covers only six relation types, and requires a paid license for its use. The latter is more recent, bigger, and has a higher coverage of relation types, but it does not contain human-annotated samples that permit reliable evaluation and is not conducive to train End-to-End Relation Extraction systems. Instead, the availability of large high-quality resources is fundamental in order to allow LLMs to be trained and evaluated on trustworthy multilingual RE benchmarks.
In this paper, we introduce large amounts of high-coverage RE annotated data in a multilingual fashion. Our new resources will enable the training of multilingual RE systems and their evaluation. In particular, we provide three main contributions: 1. We present RED FM , our humanly-revised dataset with 32 relation types and 7 languages.
2. We introduce SRED FM , a silver-standard dataset based on interconnecting Wikipedia and Wikidata, filtered by a Critic system trained on human annotations. It covers 400 relation types, 18 languages, and more than 44M triplet instances. Both datasets are automatically enriched with entity-type information using a novel entity typing approach.
3. We demonstrate the usefulness of these new resources by releasing mREBEL, a multilingual system for Relation Classification and Relation Extraction that extracts entity types.
2 Related work

Relation Extraction
In Relation Extraction (RE), the goal is to identify all triplets, composed of a subject, an object, and a relation between them, within a given text. Early approaches to RE split the task into two different sub-tasks: Named Entity Recognition (NER) (Nadeau and Sekine, 2007), which identifies all entities, and Relation Classification (Bassignana and Plank, 2022), which classifies the relationship, or lack thereof, between them. However, errors from the NER system may be propagated to the subsequent module, leaving the shared information in the interaction of both tasks unexplored. Recent works have tackled RE in an end-to-end fashion, seeking to overcome these problems by using different abstractions of the task. Miwa and Sasaki (2014) introduced a table representation and reframed RE as a table-filling task. This idea was further explored and extended by Pawar et al. (2017) and Wang and Lu (2020). However, these systems still had some restrictions, such as assuming that only one relation exists between each pair. Instead by framing the task as a sequence of triplets to be decoded, seq2seq approaches (Paolini et al., 2021;Huguet Cabot and Navigli, 2021) provided more flexibility to the RE task and lifted some of these restrictions. Nevertheless, seq2seq models are notoriously data-hungry, hence vast amounts of data are needed to enable them to learn the task with satisfactory scores.

Relation Extraction Datasets
Manually annotating RE data is a costly and timeconsuming process. As a result, many RE datasets have been created using distant supervision methods, such as NYT (Riedel et al., 2010), T-REx (Elsahar et al., 2018) or DocRED (Yao et al., 2019). Despite their widespread use in the RE community, these datasets have limitations. For instance, automatically generated datasets often contain noisy labels, leading to unfair or misleading evaluations. Additionally, there has been a long-standing focus on monolingual relation extraction systems, particularly in English.
The ACE05 benchmark presented some of the first relation extraction datasets in three languages, Arabic, Chinese, and English. However the focus on Arabic and Chinese quickly faded away while resources for English continued to grow. One of the main challenges in developing multilingual relation extraction systems is the lack of annotated data for the task. The SMiLER dataset (Seganti et al., 2021), based on distant supervision, uses Wikipedia and Wikidata to create a multilingual relation extraction dataset. However, besides being automatic, SMiLER limits annotations to one triple per sentence. With this paper, we overcome the limitations of existing datasets by providing a new multilingual evaluation dataset that includes manual annotations and enables RE with a wide coverage and higher quality despite being based on automatic annotation.

RED FM
In this Section, we present RED FM , our supervised and multilingual dataset for Relation Extraction, and a larger SRED FM , a silver-annotated dataset covering more languages and relation types. The creation of the dataset consists of several steps: data collection and processing (Section 3.1), manual annotation (Section 3.2), a triplet filtering system (Section 3.3) and entity typing (Section 3.4). Figure 1 shows an overview of this process.

Data Extraction
We base our dataset on Wikidata and Wikipedia, and expand cRocoDiLe, the data extraction pipeline from Huguet Cabot and Navigli (2021), to obtain a large collection of triplets in multiple languages (see Appendix A for more details). We use the hyperlinks from Wikipedia abstracts, i.e. the content before the Table of Contents, as entity mentions and the relations in Wikidata between them. We run our pipeline in the following 18 languages: Arabic, Catalan, Chinese, Dutch, German, Greek, English, French, Hindi, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Vietnamese. Then, we collapse inverse relations and keep the 400 most frequent ones. We highlight that some extracted relations are not necessarily entailed by the Wikipedia text; therefore, we apply a multilingual NLI system 2 to filter out those with a low entailment score (i.e. < 0.1).
Despite using NLI techniques to filter out false positives, distant RE annotations still present noisy labels. This can result in unfair or misleading evaluations, as demonstrated for TACRED (Zhang et al., 2017), which had 23.9% wrongly annotated triplets that were later revised and corrected by Stoica et al. (2021). Moreover, our triplet corpus extraction pipeline relies on existing triplets in Wikidata, similar to T-REx (Elsahar et al., 2018). These latter showed how certain relation types, such as "capital", have a lower entailment score and may not be entailed by a given text even though the entities involved share the relation in Wikidata.
Given these challenges in distant RE annotation, manual filtering of a portion of the data is necessary to ensure high-quality, accurate annotations.

Manual Annotation
We manually filter a portion of the data to deal with false positives present in the dataset for a subset of languages (i.e. Arabic, Chinese, German, English, French, Italian, and Spanish) through crowdsourced annotation: 1. We reduce the coverage of the annotated data to the top 32 most frequent relation types. See Appendix B for details on each of these types.
2. We select a portion of our silver annotated data consisting of i) common Wikipedia pages across those languages and ii) a random sample with less frequent relations to balance the dataset.
3. We ask human annotators to validate each triplet. They are shown the context text with subject and object entities highlighted, and the possible relation between them from the silver extraction. They must answer whether the text conveys the necessary information to infer that the relationship between those two entities is true. 4. We annotate each triple three times using different annotators, obtaining an average interrater reliability (Krippendorff's alpha) across languages of α κ = 0.73. 5. We keep as true positives those relations with at least two annotators answering true. We consider the rest false positives.
We employed Amazon Mechanical Turk and manually selected annotators that qualified for the task in each language. The annotation scheme can be found in Appendix B. From Table 1 we see that around 8% of annotated triplets, on average, were labeled as non-entailed by the context provided, albeit the percentage varies across languages. For instance, Spanish had a lower agreement across annotators and a higher number of filtered instances.  Ours is trained solely on our annotated data without ar/zh. Last two columns show the multi-task approach with (+XNLI) and without ar/zh (+XNLI − ) data from XNLI.

Triplet Critic
Our manual annotation procedure (Section 3.2) filtered a portion of the silver data in order to have a higher-quality subset on which to train and evaluate our models. However, by removing the negative triplets we disregard valuable information that can be used to improve the quality of the remaining annotations, i.e. all those not validated by humans. Inspired by West et al. (2021), who trained critics based on human annotations on commonsense triplets, we use our annotated triplets, both true and false positives with their contexts, to train a Triplet Critic. Specifically, given a textual context c and an annotated triplet t that may appear in c, we train a cross-encoder T (c, t) to predict whether c, the premise, entails t, the hypothesis. Once T is trained, we can use it on our silver data to filter out other false positives, i.e., triplets that, albeit present in Wikidata for two entities within the context, are not entailed by that context. We test our approach by training T in English, French, Italian, Spanish and German, namely European languages with shared families (Romance and Germanic), and testing on Arabic and Chinese. This setup will test the zero-shot multilingual capabilities on unseen languages in order to determine whether the Critic can be applied to any language. We base our Triplet Critic on DeBERTaV3 (He et al., 2021) with a classification head on top of the [CLS] token that produces a binary prediction, trained using a Cross-Entropy loss criterion. Furthermore, since the task is similar to and inspired by NLI, we explore a multi-task approach using the XNLI dataset (Conneau et al., 2018), aiming at improving cross-lingual performance. To this end, we add an additional linear layer at the end of the model for NLI that projects the output layer to the three possible predictions (neutral, contradiction, entailment), again using a Cross-Entropy loss. Table 2 shows the Triplet Critic results when trained under different setups. We see how the use of our data dramatically improves upon the XNLI baseline, especially in terms of accuracy. While we are primarily interested in precision so as to guarantee that triplets are valid, a low accuracy would lead to missing annotations, which we also want to avoid. Additionally, when our Triplet Critic is trained simultaneously on our data and XNLI, even if no Arabic or Chinese data is used (i.e. +XNLI − ), performance further improves: the system achieves an average 92.6% precision, on par with using only our data, but sees a point increase in recall, which is remarkable, taking into account the high class inbalance. In contrast, when XNLI data from those languages is added (i.e. +XNLI), we observe a small trade-off between precision and recall.
Overall, these zero-shot results certainly legitimize the use of our Triplet Critic to refine our silver data for unseen languages, and suggest even more promising benefits for seen languages. Furthermore, the Triplet Critic serves as a feedback on the consistency of human annotations since the models have successfully learned from them.

Entity Typing
In RE datasets, the entity types are commonly included in the triplets (Riedel et al., 2010) and, therefore, are taken into account under the strict evaluation, where a triplet is only considered correctly extracted when entity boundaries, entity types, and relation type are all predicted, versus the boundaries evaluation, where only entity boundaries and rela-tion type are taken into account (Taillé et al., 2020). This Section describes the procedure through which we automatically label entities with their types.
We start by mapping entities in Wikipedia to Ba-belNet (Navigli and Ponzetto, 2012;Navigli et al., 2021) synsets by exploiting the one-to-one linkage between them. Then, we annotate synsets by applying the knowledge-based semantic classifier introduced by Tedeschi et al. (2021b), which exploits the relational information in BabelNet such as hypernymy and hyponymy relations. This procedure yields ∼7.5M entities labeled with an entity type. However, since the annotations are automatically derived and prone to errors, we devise a new strategy to improve their quality. Specifically, we design a Transformer-based classifier that takes a synset and returns its NER category. More formally, let us define the functions L(s) and D(s) that output the main lemma and the textual description of a synset s, respectively. Then, given a synset s and the above-defined functions, we provide the string as input to the classifier that predicts a label e ∈ E 3 . The tagset E is obtained by refining the categorization of named entities introduced by Tedeschi et al. (2021a) based on the ability of automated systems to distinguish NER categories and on the frequency of these categories in Wikipedia articles (Tedeschi and Navigli, 2022). To train the classifier, we construct a dataset by selecting a high-quality subset from the 7.5M automatically-produced annotations by taking only synsets with a maximum distance equal to 1 from one of the 40k synsets in WordNet (Miller, 1995), this latter being a manually-curated subset of BabelNet. By doing this, we obtain a set consisting of 1.2M high-quality annotations that we split into 80% for training and 20% for validation, and convert these to the above-specified I(s, D, L) format.
Finally, we use the trained classifier to confirm or replace the previous 7.5M annotations, resulting in 6.2M (82.4%) confirmations and 1.3M (17.6%) changes, and employ it to label new Wikidata instances as well, thus obtaining a final mapping consisting of ∼13M Wikidata entries annotated with their entity types. By manually inspecting a sample of 100 changes, we observed that our NER classifier was right 68% of the time, wrong 17%, while the remaining 15% of the time both annotations  were wrong, providing an improvement of 51% over 1.3M changes. We highlight that 68% is not the accuracy of our classifier as it is computed on 100 items where there is a disagreement between the original annotations produced by WikiNEuRal (Tedeschi et al., 2021b) -the current state of the art in entity typing-and the annotations produced by our model. Indeed, an accuracy of 68% on this subset means that our classifier corrected most of the instances that were previously mistaken by WikiNEuRal. For completeness, we report that when the two systems agree, i.e. 82% of the time, they are correct in 98% of these cases, as measured on another subset of 100 instances.

SRED FM
The current datasets for Relation Extraction often lack complete coverage of relations. The SMiLER dataset (Seganti et al., 2021) only annotates one triplet per example, resulting in a limited understanding of the relationships therein. For instance, in the example "Fredrik Hermansson (born 18 July 1976) is a Swedish musician. He was a keyboardist and backing vocalist in the Swedish progressive rock band Pain of Salvation.", the triplet (Fredrik Hermansson, has-genre, progressive rock) is annotated, but other triplets such as (Fredrik Hermansson, has-occupation, musician) and (Fredrik Hermansson, has-nationality, Swedish) are also valid. Another common issue with RE datasets is the high class imbalance, particularly for distantly annotated datasets. This is often due to the skewed distributions that are intrinsic in knowledge bases such as Wikidata, which are often used to construct RE resources. This leads to low coverage for lower frequency classes, as seen in Figure 2 for certain languages in the SMiLER dataset. In DocRED (Yao et al., 2019), another distantly annotated dataset, location-based relations constitute over 50% of instances.
Fully human-annotated datasets that overcome these limitations are scarce and often not widely accessible. Additionally, they often cover a narrow set of languages and relations (Table 3). To address these issues, we introduce Silver RED FM . SRED FM is a large, multilingual RE dataset that contains more than 45M triplets and covers 400 relation types and 18 languages. It is created using the data extraction procedure described in Section 3.1 and the Triplet Critic introduced in Section 3.3. SRED FM overcomes some of the previous shortcomings of current datasets by providing a higher coverage of annotation and more evenly distributed classes.
Using the same example sentence, in SRED FM , the following triplets are annotated: Additionally, we provide a pipeline that enables the automatic creation of an RE dataset in any language. So, even though we release the SRED FM dataset as described in this paper (i.e. covering 18 different languages), we encourage the expansion to other languages by using our pipeline available here.
In summary, SRED FM is a large, multilingual dataset that addresses the shortcomings of current datasets by providing a higher coverage of annotation and more evenly distributed classes. RED FM , instead, is the result of the manual annotation (Section 3.1) to which we add entity types. We split them into training, validation and test, with no overlapping Wikipedia pages across splits. Details can be found in Table 9 in Appendix E.

mREBEL
In this section, we present our system, mREBEL (Multilingual Relation Extraction By End-to-end Language generation), which is a multilingual relation extraction model pre-trained on SRED FM . It is a multilingual extension of the REBEL model introduced in Huguet Cabot and Navigli (2021), which uses a seq2seq architecture to convert relations into text sequences that can be decoded by the model. We convert triplets into text sequences and pre-train our model using mBART-50 (Tang et al., 2021). To support multiple languages, we prepend the input text with a special language token (i.e. en_XX). Additionally, we include relation classifi-     cation (RC) in the pre-training phase of mREBEL. Specifically, for 5% of the training data, we select a random triplet, mark the subject and object entities in the input text, and use a special token <relation> to indicate to the model that only one triplet needs to be decoded. Finally, to promote cross-lingual transfer, we use the English names for the relation types when decoding the triplets. For (1) and (2) we also train their untyped versions, mREBEL 400 and mREBEL 32 .

Experimental Setup
We evaluate mREBEL and its variants on our own datasets, i.e. RED FM and SRED FM , and on SMiLER (Seganti et al., 2021). Unless stated otherwise, we train on the training sets of all languages simultaneously and apply early stopping based on the Micro-F1 obtained on the overall validation set. We use the Adafactor optimizer and the same Cross-Entropy loss with teacher forcing from Huguet Cabot and Navigli (2021). The full list of hyperparameters is detailed in Appendix D.

Multilingual Relation Extraction
We report the Micro-F1 score per language for both SMiLER and RED FM test sets. When evaluating on SMiLER, we use mREBEL 400 variants as starting checkpoints and fine-tune them on SMiLER training sets. For RED FM , instead, we include the mREBEL 32 model in our experiments as it was trained on the same set of relations. The inclusion of this model lets us analyze the impact of further fine-tuning on RED FM gold data against the quality of our silver annotation process. As an extrinsic evaluation of our Triplet Critic model from Section 3.3, we train a   version of mREBEL T 32 without filtering triplets. Multilingual Relation Classification Even though Seganti et al. (2021) introduced SMiLER as a RE dataset, each sentence contains just one annotated triplet and includes the "no relation" class as part of its annotation scheme. Therefore, it is better approached as an RC task and it is more akin to a dataset like TACRED. For instance, Chen et al. (2022) use it as an RC dataset with an array of prompt-based approaches, and we compare our approach with theirs for RC.

Results
Multilingual Relation Extraction First, in Table  5 we show how our system performs compared to HERBERTa, the system proposed by Seganti et al. (2021) for SMiLER, using their best-performing setup for each language. We consider this dataset better suited for RC. However, as it originally reports on RE, we demonstrate how our system can perform better when pretrained on SRED FM . In particular, mREBEL T 400 provides an improvement of about 15 Micro-F1 points compared to HER-BERTa. Additionally, as SMiLER does not include entity types, we observe that mREBEL 400 performs marginally better than mREBEL T 400 . Table 6, instead, shows the results on RED FM , compared against an mBART baseline. Specifically, we analyze model performance when fine-tuning is, or is not, performed on the train set of RED FM . While performances vary across languages, the best overall Micro-F1 (52.0) is obtained when training on SRED FM , mREBEL T 32 , without further fine-tuning. This confirms that our silver annotation procedure produces high-quality data, as there is no need for further tuning with RED FM , which achieved 51.6. We also see how filtering by the Triplet Critic was crucial: when removed, performance dropped by more than 3 points.
Training on 400 relation types does lead to lower results, since there is a mismatch between the two stages of training. However, mREBEL T 400 showed decent performance on SRED FM as shown in Table  7. This provides the first RE system to competitively extract up to 400 relation types in multiple languages. See Appendix C for more results.
Multilingual Relation Classification From the bottom half of Table 5, we can observe how our mREBEL models consistently outperform competitive baselines, i.e. mT5 BASE * and mT5 BASE (en), by a large margin on all tested languages.

Error Analysis
We performed an error analysis with mREBEL T 32 to understand the sources of error when training on RED FM . Our study revealed that 27.8% of errors in the test set can be attributed to specific reasons.
First, there were discrepancies between pre-  dicted entity types and annotations (7.2%). These errors may have arisen from the automatic nature of the typing annotation, errors by the system or ambiguity in some cases, such as fictional characters, which can be considered either PERSON or MEDIA. Additionally, a portion of errors (8.1%) resulted from mismatches between the predicted and annotated spans for each entity, which may also be ambiguous (see the Span overlap example in Table 8). Another 10.6% of errors were caused by either the subject or object entity being completely misaligned with the annotation. We identify some of these as co-reference errors, such as the Subject example in Table 8. Evaluation for RE systems often ignores other mentions of an entity. We believe co-reference resolution has not been properly explored within RE evaluation and this may open interesting opportunities for future work.
Finally, it is worth noting that only 1.8% of errors were due to the wrong relation type being predicted between entities. We consider this to be a strong indicator of the quality of annotated relations between entities. However, we also observed that 72.2% of the errors were caused by incorrect predictions or missing annotations, highlighting the main shortcoming of our annotation procedure. Our approach is based on annotated hyperlinks in Wikipedia and relations in Wikidata, which can result in recall issues where entities in the text are not identified as hyperlinks or relational facts are not present in Wikidata.

Conclusions
In this paper, we have addressed some of the key issues facing current multilingual relation extraction datasets by presenting two new resources: SRED FM and RED FM . SRED FM is an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, and more than 40 million triplet instances, while RED FM is a smaller, humanly-revised dataset for seven languages. We improved the quality of the entity type annotations in these datasets by using a Transformer-based NER classifier. We also introduced the Triplet Critic, a cross-encoder that is trained on annotated data to predict whether a given context entails a triplet. We demonstrated the utility of these new resources by training new, capable multilingual relation extraction models and evaluating them using our supervised data. We also presented mREBEL, the first multilingual end-to-end relation extraction system that extracts triplets, including entity types. Our work thus contributes to the development of better multilingual relation extraction systems and provides valuable resources for future research.
There are several limitations to the work presented in this paper that need to be acknowledged.
First, the SRED FM and RED FM datasets are based on Wikipedia and Wikidata, which means they may not cover all possible relation types or entities. In addition, the quality of the annotations in these datasets may be influenced by the biases and limitations of these sources.
Second, the Triplet Critic is trained on a small subset of the SRED FM dataset, which may limit its ability to generalize to other relation types or languages. Additionally, the performance of the Triplet Critic may be affected by the quality of the annotations used to train it.
Third, the authors of this work are native speakers of some of the languages tackled in this work and external native speakers created the annotation guidelines. However, for some of the automaticallyannotated languages, there were no native speakers involved. Additionally, the qualitative error analysis does not include Arabic or Chinese examples, as neither of the authors of the paper is proficient in those languages.
Finally, the mREBEL system is based on a Transformer architecture, which may not be optimal for all relation extraction tasks. It is possible that other types of model, such as graph neural networks or rule-based systems, could outperform mREBEL on certain relation types or languages.
Overall, the results presented in this paper should be interpreted in the context of these limitations. Further research is needed to address these limitations and to improve the performance of multilingual relation extraction systems.

Ethics Statement
In this work, we present two new relation extraction datasets, RED FM and SRED FM , which are created using distant supervision techniques and the use of human annotation to filter out false positives. We believe that our datasets will help advance the field of relation extraction by providing a highquality multilingual resource for researchers and practitioners.
We take the ethical considerations of our work seriously. The annotation of the RED FM dataset is based on existing triplets in Wikidata, which may not always reflect the true relation between entities in a given text. Moreover, the use of human annotation ensures a higher level of accuracy in our dataset, but it also raises ethical considerations. We recognize that human annotation may contain errors or biases. Therefore, we encourage researchers to use our dataset with caution and to perform thorough evaluations of their methods. Additionally, we are transparent about our annotation costs and payment to human annotators.
In conclusion, we believe that our dataset and the research it enables will contribute positively to the field of relation extraction, but we also acknowledge that there are ethical considerations that need to be taken into account when using it. however, its dates and numbers linker were Englishspecific. We use regex to extract dates and values in all the languages this work covers.
In the original work, they filtered triplets using NLI. For each triplet, they input the text containing both entities from the Wikipedia abstract, and the triplet in their surface forms, subject + relation + object, separated by the <sep> token. If the score was less than 0.75 for the entailment class, it was removed to ensure higher precision. In our work, we set a lower threshold, 0.1, since we further filter triplets using manual annotation or our Critic model.

B Annotation
We employ Mechanical Turk for annotation purposes. Each annotator was paid 0.1$ for every ten instances annotated, constituting 1 HIT, an average of $10 hourly rate. We restrict annotators to countries where each of the languages is spoken, plus the USA. We manually screen annotators in each language separately by having them annotate a small sample of fewer than 10 HITs, and allowing only those who correctly performed the task to annotate the final corpus.
Annotators were presented descriptions for each relation, which they could check at any time by hovering the label or opening the instructions. The English descriptions are: • located in the administrative territorial entity: the item is located on the territory of the following administrative entity • country: sovereign state of this item (not to be used for human beings) • instance of: that class of which this subject is a particular example and member • shares border with: countries or administrative subdivisions, of equal level, that this item borders, either by land or water • part of: object of which the subject is a part • capital: seat of government of a country, province, state or other type of administrative territorial entity • follows: immediately prior item in a series of which the subject is a part • headquarters location: city, where an organization's headquarters is or has been situated • located in or next to body of water: sea, lake, river or stream • sport: sport that the subject participates or participated in or is associated with • participant: person, group of people or organization (object) that actively takes/took part in an event or process (subject) Figure 3 shows the annotation interface provided to the annotators.

C Results
In this Section, we provide more results concerning our mREBEL model. Specifically, in Figure 4 we provide a heatmap that shows the scores attained by mREBEL T 32 (without fine-tuning) on each of the 32 relations covered by RED FM , and for each of its 7 languages. Similarly, in Figure 5, we report the scores obtained by the fine-tuned version of mREBEL T 32 . By looking at these two heatmaps, it is easy to identify our model's strengths and weaknesses across relations and languages. We can see how relations such as named after or shares border with had low scores, probably due to their lower frequency at evaluation time, where a few errors lead to a low score. On the other hand, domain-specific relations such as cast member, league or author show a strong performance on most languages.

D Reproducibility
Experiments were performed using a single NVIDIA 3090 GPU with 64GB of RAM and Intel ® Core ™ i9-10900KF CPU.
The hyperparameters were manually tuned on the validation sets for each dataset, but mostly left at default values for mBART. The ones used for the final results can be found in Table 10.

E Data
In Table 9, we provide data statistics for our RED FM dataset. Specifically, for each of the 7 languages, we report the number of instances for each relation in the corresponding training, validation and test sets.