Multilingual Entity and Relation Extraction Dataset and Model

We present a novel dataset and model for a multilingual setting to approach the task of Joint Entity and Relation Extraction. The SMiLER dataset consists of 1.1 M annotated sentences, representing 36 relations, and 14 languages. To the best of our knowledge, this is currently both the largest and the most comprehensive dataset of this type. We introduce HERBERTa, a pipeline that combines two independent BERT models: one for sequence classification, and the other for entity tagging. The model achieves micro F1 81.49 for English on this dataset, which is close to the current SOTA on CoNLL, SpERT.


Introduction
The majority of the -constantly growing -amount of openly accessible knowledge is locked in unstructured text, and hence inefficiently utilized by any systems. The NLP tasks related to this problem include Information Extraction, Relation Extraction, Named Entity Retrieval, as well as Joint Entity and Relation Extraction.
Our contribution is twofold. First, we present SMiLER (Samsung MultiLingual Entity and Relation Extraction dataset): an open-domain corpus of annotated sentences, created for the Joint Entity and Relation Extraction task. With 1.1 M sentences, 36 relation types, and 14 languages, SMiLER seems to be both the largest and the most diversified corpus for the task in existence, to the best of our knowledge. The corpus was semi-automatically created from Wikipedia and DBpedia, and partly checked by linguists.
Our second contribution is HERBERTa -Hybrid Entity and Relation extraction BERT for a multi- * Work done while at Samsung R&D Institute Poland. lingual setting that consists of two independently pretrained BERT models (Devlin et al., 2018). The first one classifies the input sequence as belonging to one of our 36 relations (including no_relation). Its output -the relation -is then fed to the second BERT, together with the same input sequence. The second model performs entity tagging and outputs two spans for the two entities selected from the input sequence. Our model is close to SpERT (Eberts and Ulges, 2019), the current state-of-the-art for Joint Entity and Relation Extraction, achieving micro F 1 81.49 for English on the presented dataset.
The SMiLER corpus and the source code is available at https://github.com/samsungnlp/smiler/.

Related Work
Available Datasets The datasets that are commonly used for the task of Joint Entity and Relation Extraction are still insufficient in size and diversification; furthermore, they are all monolingual English corpora. For instance, CONLL04 (Roth and Yih, 2004) has only 1.7k annotated sentences and 5 relations. ADE (Gurulingappa et al., 2012) is larger, with almost 21k sentences, but it distinguishes only 2 relations (plus no_relation). Yet another such dataset, SciERC (Luan et al., 2018), contains 500 annotated abstracts and 7 relations; however, all of the relations belong to the scientific domain.
We believe that presenting a new, large and diversified dataset will be a valuable contribution to the joint task of Entity and Relation Extractionboth for English and for the multilingual setting.
Information Extraction Information extraction systems collect knowledge that is locked in unstructured text, such as in our Wikipedia articles. A nota-ble example is the Never-Ending Language Learner (NELL), which was reading the Web for almost 10 years, 2010-2019 (Mitchell et al., 2018). This semisupervised system was retrained continuously, with the use of the current knowledge, to collect new instances of a pre-defined set of entity types and relations, which also constitutes the final goal of our system. Another well-known knowledge base is Knowledge Vault (Dong et al., 2014). The system extracts information from the Internet (text, tabular data, page structure, human annotations) and combines it with information from Freebase (Bollacker et al., 2008). A probabilistic inference system computes calibrated probabilities of fact correctness.
Instead of utilizing pre-defined entity and relation types, some systems use syntactic analysis: ReVerb (Fader et al., 2011), MinIE (Gashteovski et al., 2017). Another approach is to create a taxonomy with just one relation: isA, as in Probase (Wu et al., 2012).
Relation Extraction Current models (HEBERTa included) for Relation Extraction are based on BERT (Devlin et al., 2018), a multi-layer bidirectional Transformer encoder (Vaswani et al., 2017). Its main contribution is pre-training deep bidirectional representations from unlabeled text by joint conditioning on both left and right context, in all the layers.
The current state-of-the-art in Relation Extraction are REDN (on SemEval-2010 Task 8, NYT, and WebNLG) (Li and Tian, 2020) and Matching the Blanks (on FewRel) (Soares et al., 2019). REDN extracts the token embeddings (BERT) from two different layers to represent the head and tail entities separately, in order to enhance learning reversible relations. Next, it calculates a parameterized asymmetric kernel inner product matrix between all the head and tail embeddings of each token in a sequence. On the other hand, the model constructed by Soares et al. (2019) combines BERT with extensions of Harris' distributional hypothesis to relations, to build task-agnostic relation representations solely from entity-linked text.
Joint Entity and Relation Extraction A more robust approach to Relation Extraction is to combine it with Entity Extraction. Three such joint solutions have achieved the current state-of-theart: SciBERT (on SciERC) (Beltagy et al., 2019), SpERT (for RE on CONLL04) (Eberts and Ulges, 2019), and End2End Joint NER & RE (for NER on CONLL04) (Giorgi et al., 2019).
SciBERT leverages unsupervised pretraining on a multi-domain corpus of scientific publications, as this domain differs significantly from the general domain used by the original BERT. The second model, SpERT, is an attention model for Span-Based Joint Entity and Relation Extraction. It is trained using within-sentence negative samples, which are extracted in a single BERT pass. Finally, End2End Joint NER & RE combines BERT with biaffine attention.
Distant Supervision Distant supervision proposed by Mintz et al. (2009) for Relation Extraction task is a training paradigm, based on the assumption that if there is a relation between entities, then every sentence containing them may also express that relation. In the experiments, multiple sentences containing the entities were used to create their feature vectors. Thus, this approach allows a prediction of the relation between two entities, but does not identify sentences containing useful and correct cues for this relation in the corpus.
Data collection for SMiLER was based on the same assumption, but thanks to the subset of manually validated examples, the dataset could be used as the sole source of supervision during training, where the algorithm was supervised by the database.
Multilingual BERT M-BERT (Pires et al., 2019) is a model that is particularly relevant to our multilingual approach, and one that we tested repeatedly. This is a single language model, pre-trained from monolingual corpora in 104 languages, which performs zero-shot cross-lingual model transfer. Task-specific annotations in one language are used to fine-tune the model for evaluation in another language. Such transfer is possible even to languages in different scripts, such as English and Korean.
M-BERT has become the basis for building other multilingual models, as well as monolingual ones for languages other than English. We have experimented with three such models: German BERT, Italian BERT, and KoBERT for Korean (see section 4 for details).

Data
Our SMiLER corpus consists of 1.1 M annotated Wikipedia sentences, and was created specifically for the task of Entity and Relation Extraction. It  consists of annotated sentences in 14 languages, with each language representing a subset of 36 relations. As far as we know, this is both the biggest and the most comprehensive corpus for these NLP tasks. Table 1 provides the number of sentences and relations for each language, while Table 2 shows a few examples from the corpus. The relations belong to 9 rough domains: Person (e.g. has-child, has-occupation), Organization (e.g. org-has-member, headquarters), Location (e.g. loc-leader, has-tourist-attraction), Animal (e.g. haslifespan, eats), Art (e.g. starring, has-author), Device (e.g. invented-by), Event (event-year), Measurement (e.g. has-length), and no_relation. The no_relation sentences do not contain any of the 35 "positive" relations. Figure 1 shows the number of sentences for each relation, for each language.

Dataset Building
To collect the English dataset, we queried DBpedia for (entity 1 , entity 2 , relation) triples. The articles about entity 1 that would also contain entity 2 were obtained from a Wikipedia dump. They were parsed and automatically selected. The complete English dataset is referred to as EN-full in the following discussion.
During this process it is possible that the same sentence will appear multiple times containing: 1. different entity annotations for the same relation, or 2. different entity annotations for another relation. The first case is unlikely, because sentences are taken from the Wikipedia article of the "main" entity. The second case is possible, though.
Theoretically, it is possible to write grammatical rules in order to do a "cross-match" for augmenting a number of annotated sentences, and it has been done previously (e.g. Wu et al. (2012)). The reason why we did not use grammatical rules is that we wanted to scale the approach to multiple (and diverse) languages and the rules that could help with this task are not universal.
A part of EN-full was manually validated by linguists. The linguistic validation was more in-depth for English than for other languages, as for English the task was to correct the entity annotation whenever possible. Hence, passable annotations were corrected for English, while they were simply assessed as correct for the other languages. Overall, the linguists assessed 58.2% of the sample as fully correct, corrected the annotations in further 18.4%, and assessed the remaining 23.4% as incorrect. The correct 58.2% plus the corrected 18.4% of the sample together form the EN-small dataset. The corpus-building process was adapted for the other languages. First, the entities (as well as the entire no_relation sentences) were translated from English. The translation was carried out by an in-house multilingual system similar to (Przybysz et al., 2017;Williams et al., 2018;Wetesko et al., 2019). After translating the entities, the relevant article was copied from the Wikipedia dump in the other language. Finally, samples from the corpus were checked by linguists. Figure 2 illustrates the process.
The no_relation sentences in English have been selected by linguists from Wikipedia articles. The sentences contained entities that were in a relation in another sentence in the data set but were not in a relation in the selected sentence. Since the process is so strict, the translation to other languages was done automatically. All no_relation sentences needed to be hand-checked, because they could contain some hidden relations, and for this reason preparing original sentences for all the languages would take too much time. This is not the case for the 'positive' relations; e.g. if 'Bill' is the father of 'Chelsea', then a sentence containing both 'Bill' and 'Chelsea' -and coming from a Wikipedia article on Bill Clinton -would probably also express the relation has-child. However, if a sentence contains 'Bill', but not 'Chelsea', it could still express another of the "positive" relations.
We also tried other automated approaches for finding no_relation sentences. For example, given a Wikipedia article with a relation found, we could find other sentences with the same entities and assume all of them are no_relation sentences. Unfortunately this approach caused too much noise and was not used in the end.
The linguists verified a random sample of 50 sentences for 16 largest relations + no_relation for each of the following languages: IT, FR, DE, PT, ES, KO. The results are shown in Table 3. The overall percentage of correctness reached 79-80%, which is very high, considering the level of automation involved in the process. The selected random sample was small, because the verification needed to be cost-and time-effective. Still, the results were thoroughly checked and they were largely consistent. IT   The most common errors found by the linguists belong to three groups: (1) unexpected DBpedia query results, (2) wrong sentence parsing (deleting the second half of the sentence or including HTML), (3) selecting random sentences, which either do not express the relation, or have the entities marked in wrong places. Other typical errors are related to missing translations of some English words and no_relation sentences that contain one of the 35 positive relations. Table 4 shows one example of each error.

Datasets for the Model
For each language in our multilingual dataset, we automatically extracted 2% of the sentences to create a dev set and another 2% for a test set. This corpus split maintained the distribution of the rela-  tions in the original data, whenever possible. For the languages where the label distribution was impossible to preserve, the number of relations in the test set may be smaller than in the train set. 1 For English, we used a single test set for both EN-mid and EN-small. The only difference was in the number of labels: EN-mid (both train and dev) had 36 labels, while EN-small (albo both train and dev) -32 (see Table 1).
For training the models, we created several combinations of languages, in order to examine the effect of adding/removing languages from the training set. We experimented with the following combinations: Pires et al. (2019) showed that the language structure had an impact on the performance of the model. Therefore, we treat Korean, Farsi and Arabic as a special case because they are non-SVO languages (Korean and Farsi are SOV, while Arabic is VSO

Model
Our architecture HERBERTa (Figure 3) uses two pre-trained BERT models. It solves the problem of Entity and Relation Extraction with the use of an unconventional pipeline. In the first step it is trained for Relation Extraction, while the entities are retrieved in the second stage.
The first model is BERT, fine-tuned for the Sequence Classification task. Its input is a tokenized sequence, while its output consists of a sequence output and a pooled output that represents the overall sentence context (from the [CLS] token). The latter is passed to a softmax for classifying the relation.
The second model is our implementation of BERT for Entity Tagging. It is based on BERT for Question Answering, and its inputs are: • the same tokenized sequence as the one used for Relation Extraction, • the relation outputted in the first step -encoded as a BERT [unused] token. At inference time, the model will return the start and end of the two entities separately and this is used to mark the entities. Using our model, it is also possible to select N best predictions (one prediction = one entity pair) for the same relation. This has not been used for the result of the paper because we wanted to find a single best entity pair for each sentence. We used the following types of BERT, depending on the language(s): • bert-base-cased (Devlin et al., 2018), • bert-base-multilingual-cased (Pires et al., 2019), • bert-base-german-cased from the Hugging-Face Transformer library (Wolf et al., 2019), • bert-base-korean-cased -the monologg/kobert model from HuggingFace, • bert-base-italian-cased -the dbmdz/bert-baseitalian-cased model from HuggingFace.
Below we call our model LANG(B) if it is a finetuning of bert-base, e.g. DE(B) is the bert-basegerman model fine-tuned on the German corpus. We call our model just LANG if it is a fine-tuning of bert-base-multilingual, e.g. DE-EN is the English and German model trained using bert-basemultilingual. The English dataset has two versions: EN (EN-mid) and EN_S (EN-small). We did not use EN-full for the training.

Training Procedure
The two models presented in Section 4 are trained independently, each one with a different loss function. The first model uses standard cross-entropy as its loss for relation classification. The loss function for the second model is: Where L j,start is the cross-entropy loss for the prediction of j-th entity first token index and L j,end is the cross-entropy of the predictions of its last token index.
The result of the first model becomes the input to the second one, during inference. We conclude that this works well because the performance of the relation classification model is very high (please refer to Table 5 and the Relation column).
The number of epochs is established using an early stopping mechanism. On average, each model (Relation Classifier and Entity Tagger) is trained for around 6 epochs for models with EN-mid set and 3-4 epochs for others. We fine-tune the whole BERT.

Results
For evaluating the models, we use the micro F 1 score on the entities and relations together, ignoring no_relation. Below, this measure is referred to as Combined. When comparing with the state-ofthe-art, we also show the relation-only micro F 1 , as well as the entity-only micro F 1 (for a single entity and for the pair of entities). We use such metrics because they are typically calculated for these tasks; see for instance Eberts and Ulges (2019).

Comparison with the State-of-the-Art
Given that our dataset is new, there is no state-ofthe-art for it. Nevertheless, we decided to compare the results of HERBERTa with SpERT (Eberts and Ulges, 2019), the currently best model on CoNLL (Roth and Yih, 2004). Table 5 presents the results that we obtained by training SpERT on our Wikipedia EN-mid (denoted as EN(SpERT)) and ENsmall (denoted as EN_S(SpERT)). Additionally, Table 5 also shows the results of our model trained on the same datasets. In terms of combined F 1 , HERBERTa is close, the difference being about 0.2 percentage point with SpERT, the current SOTA on CONLL.

Language and Model Comparison
Single Language As the first step, we trained HERBERTa for each language separately. The results (F 1 measure) are presented in Figure 4 (left). Our models trained with the use of multilingual BERT are named after language, e.g. IT, FR, PT. Our non-multilingual BERT models are called LANG plus (B), e.g. IT(B), DE(B). We obtained the best result for EN(B). BERT trained on EN-small (EN_S(B)) achieved lower results, 81 vs. 58, because EN-small contains over 7 times fewer sentences. If we compare this with the other languages, the relationship between F 1 and the logarithm of the number of sentences holds in general. That is, the higher number of sentences, the higher F 1 (see Figure 5). EN-small (35k sentences, F 1 =58) falls between NL (40k sentences, F 1 =60) and PL (17k sentences, F 1 =50). RU (7k) and UK (1k, the smallest set) achieve F 1 =29 and 10 respectively. Surprisingly, FA (3k) and SV (5k) -despite small sets -yield F 1 =65 and 58 respectively; as a result, they are the outliers in Figure 5. AR (9k) achieves a higher F 1 score than ES (12k); however, the difference is just 2 percentage points. FA, AR (and KO) have different word orders than all the other languages, which might be one of the reasons why they achieve high F 1 despite small datasets. Another reason might be their small number of relations. The high result for SV, which is SVO, could be explained by a smaller number of relations in the test set than in the train set. UK and RU, in spite of their small number of relations, obtain the lowest results, because they have small datasets. Furthermore, contrary to SV, UK and RU have rich inflection which might also be the case.
Comparing the results of our models trained with multilingual BERT versus non-multilingual BERT, we observe that all three possibilities are present in the pairs we checked. For DE we get the same results F 1 =58, for IT the non-multilingual BERT gives lower results 64 vs. 66, while for KO the nonmultilingual BERT achieves significantly higher results, 51 vs. 42.

Multilingual
In the second step, we added models trained with the multilingual BERT for several languages simultaneously, e.g. ALL-EN for all the languages available in the corpus (see Figure 4 (left)). There are four main groups of models: (A) a model with EN-mid and all other languages, (B) models with EN-small and at least 6 languages, (C) models with EN-small and one other language, (D) other models with about 3-4 similar languages.
Comparing the models for single languages with group A and B models, we observe that the multilingual models typically yield similar or better results. The largest increase was observed for UK: F 1 =10 on its own and even 26 in ALL-EN_S. We tried to group similar languages RU, PL, and UK to boost UK. However, we obtained just F 1 =20 for UK, which is still significantly higher compared to the UK single language model, but lower compared to UK-EN and UK-EN_S.
One exception is EN-mid, which works significantly better in the single model (81 vs. 77); for EN-small, the difference is less pronounced (58 vs. 54-57). Another exception is KO, for which the single KO(B) model obtains 51 vs. multilingual 43-47. However, if we compare the multilingual KO, F 1 =42, then the multilanguage models increase the results. Finally, the model that groups IT, FR, PT, and ES achieves similar results to ALL-EN_S.
Relations Figure 4 (right) shows F 1 for relations. The results differ widely between the languages and the relations. For instance, for EN some relations achieve F 1 =100 (e.g. has-lifespan, eats), while one relation gets just F 1 =22 (from-country). However, the same relation achieves F 1 =73 for PL and even 0 for SV. We conclude that the results depend on the number of training examples for each relation.
Entity 1 vs Entity 2 Figure 6 demonstrates a significant difference in results between the two entities in the sentence. We obtain far higher results for entity 1 (left) than for entity 2 (right). This is because entity 1 usually occurs at the beginning of the sentence, while the position of entity 2 is not so deterministic.  It is important to note though that typically sentences begin with their linguistic subject. As a result, for the sentence 'Bill has a child called Chelsea.', the relation has-child is far more plausible than has-parent, even though both are logically correct. For this reason, the fact that entity 1 has a more deterministic behavior is not a problem.

Conclusions
We have described our approach to solving the task of multilingual Joint Entity and Relation Extraction, by training our novel, HERBERTa model on SMiLER -our large, comprehensive dataset. The model combines two independent BERT models: one for Sequence Classification, and the other for Entity Tagging. Regarding F1 measure our model is close to SpERT (the current state-of-the-art for CONLL04), the difference being 0.2 percentage point. What is more, the SMiLER dataset appears to be currently the largest (1.1 M annotated sentences) dataset for this task, as well as the most comprehensive one (14 languages, 36 relation types).
We observe that our multilingual models achieve higher or similar results, compared to the models trained for each language separately. Languages with less data can benefit the most from such multilingual models. The victim here is English, as it seems to be a giver of F1 to other languages, especially when the number of sentences for this language is significantly higher than for other languages. On the one hand, due to the large amount of data for English, a model is well trained for patterns existing in this language and thus other languages with less data can benefit from it (because of some similarities between languages). Therefore, we observe increased results for less resourced languages. On the other hand, a model tries to accommodate the nuances of the less resourced languages. Their features are noticeable for the model, which reduces the dominant role of English.
As we can see from the results, each non-English language follows a slightly different (error) path. This does not seem to be a general rule that can be applied. We would like to point out though that this is exactly the reason why we have created this dataset in the first place. We wanted to observe the performance of models on different languages on the relation extraction task and now, thanks to our dataset, this is possible. The fact that there is no simple explanation for the difference in model performance shows that deeper analysis for each language is necessary.
Another observation is that we obtain significantly higher F 1 for entity 1 than for entity 2 , which suggests that entity 1 is simpler. This seems to be true, because entity 1 typically occurs at the beginning of the sentence, while entity 2 does not have any consistent location.
In the future, we plan to train our model on the EN-full dataset and to predict multiple plausible entity pairs for the same sentence. We would also like to extend the dataset to include entity types and sentences containing multiple relations.
Another promising direction is data augmentation by "cross-matching" entities and relations in the dataset with sentences in the dataset. This crossmatch could search for two cases.
1. Sentences that contain multiple relations between the same 2 entities.
2. Sentences that contain more than 2 entities (e.g. 3), with different relations between them.
In both cases, the sentence could be added to the dataset multiple times.