DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction

Distant supervision (DS) is a well established technique for creating large-scale datasets for relation extraction (RE) without using human annotations. However, research in DS-RE has been mostly limited to the English language. Constraining RE to a single language inhibits utilization of large amounts of data in other languages which could allow extraction of more diverse facts. Very recently, a dataset for multilingual DS-RE has been released. However, our analysis reveals that the proposed dataset exhibits unrealistic characteristics such as 1) lack of sentences that do not express any relation, and 2) all sentences for a given entity pair expressing exactly one relation. We show that these characteristics lead to a gross overestimation of the model performance. In response, we propose a new dataset, DiS-ReX, which alleviates these issues. Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class. We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE. Unlike the competing dataset, we show that our dataset is challenging and leaves enough room for future research to take place in this field.


Introduction
Relation extraction (RE) is an important subtask of information extraction.The goal is to identify the relation R between a pair of entities (e1, e2) given context C, where C is some text mentioning the two entities.Creating RE datasets using human annotation can be cumbersome, as a result of which most fully supervised datasets are small in size.Mintz et al. (2009) proposed creating relation extraction datasets by using distant supervision (DS-RE).DS-RE is bag-level classification task where * Equal Contribution a bag of an entity pair (e1, e2) is defined as the set of all sentences in the dataset which mention both e1 and e2.If e1 and e2 have a relation r according to a knowledge base (KB), then the entire bag of (e1, e2) is associated with the label r.
Research in DS-RE has been mostly limited to the English language due to unavailability of large multilingual datasets.Since same facts about the world can be represented in different languages, multilingual training of RE models can have several benefits.Having a single model for multilingual tasks vis-a-vis one model for each language 1) allows for cross-lingual knowledge transfer which improves performance across all languages (Zoph et al., 2016;Feng et al., 2020), and 2) is a more efficient method of capturing consistent semantics across languages (Lin et al., 2017) RELX-Distant (Köksal and Özgür, 2020) is the first multilingual dataset for distantly supervised relation extraction with sentences in 5 languages.Our analysis reveals some critical flaws in this dataset, which make it unsuitable as a reliable DS-RE benchmark: 1.It has no negative samples i.e sentences without any possible relationship between a given entity pair

Relation classes are semantically far apart.
There exists no entity pair that has more than one possible label in the relation set, even under the distant supervision scheme.
3. The dataset is extremely imbalanced.More than 50% bags belong to country relation type.
These unrealistic characteristics grossly overestimate an RE model's performance.Our preliminary analysis with a simple mBERT based model achieves an AUC of 0.98 and Micro F1 of 0.95 on the test set after only 5 epochs of training.These issues make the benchmark unsuitable for spurring further research in the field of multilingual DS-RE.

arXiv:2104.08655v1 [cs.CL] 17 Apr 2021
In response, we contribute a more realistic benchmark dataset for the task called DiS-ReX.Our contributions are as follows:  provement).Using our baseline model, we achieve an AUC of 0.8 and Micro F1 of 73% on the test set.This suggests that unlike the RELX dataset, our dataset is not trivial to optimize on and has the potential to act as a useful and challenging dataset for the task.We publicly release our dataset and baseline model1 .

Previous Works
For multilingual supervised relation extraction, ACE05 dataset (Walker et al., 2006) 2010) also introduce the New York Times (NYT) dataset built using distant supervision, which serves as an important benchmark for English DS-RE.RELX-Distant (Köksal and Özgür, 2020) is the first dataset for multi-lingual DS-RE but it suffers from several drawbacks discussed in previous section.Moreover, the authors did not publish any benchmark numbers on the RELX-Distant dataset, and instead used distant supervision for pre-training a downstream supervised RE system.One of the first deep neural networks for DS-RE are piece-wise CNN (PCNN) based methods (Zeng et al., 2015).Lin et al. ( 2016) combine PCNN with intra-bag attention in which a trainable relation embedding attends over all sentences in a bag and generates a bag-level representation which is used for prediction.
Lin et al. ( 2017) and Wang et al. (2018) proposed extension of bag-attention models for bilingual datasets.However, adoption of these models to multiple languages has been limited due to lack of multi-lingual DS-RE datasets Instead of using separate sentence encoders for each language, we modify the bag attention model by encoding sentence using a common mBERT model.This serves as a baseline benchmark for our dataset.

Dataset creation pipeline
We first harvest a large number of sentences in English, French, Spanish and German using Wikipedia.We then hypothesize relations using distant supervision by aligning sentences with DBpedia KB (Lehmann et al., 2015) which is a largescale multilingual KB extracted from Wikipedia.For dataset creation, we build a general pipeline that can be extended to obtain datasets from other text corpora besides Wikipedia.The specific steps are as follows: 1. Wikimedia dumps for each language are downloaded and split into sentences.Entities present in the sentences are detected using a language-specific NER tagger (Honnibal and Montani, 2017).
2. We use different DBpedia language editions for sentences from different languages.This gives us a better and increased coverage on entities that are local to different language speaking parts of the world.
3. We fuse the KBs of different language editions by finding the Wikidata ID for each en-tity.Wikidata IDs are consistent across languages and allows us to establish equivalence between entities like USA and Estados Unidos de América.
4. Entities detected in sentences are aligned with the fused KB by string matching.We only select entities for which we obtain an exact match in the KB.
5. For each entity pair in a sentence, we search for a relation between them in the knowledge base.If a relation is found, that instance is labelled with it, otherwise label is "NA".
We then select the top 50 positive relations classes based on number of bags from all languages combined.Relation types which do not have more than 50 bags in each of the 4 languages are discarded.We end up with 36 positive relation classes.
We then add the bags of entity pairs which have no relation between them.We filter bags with "NA" label to achieve similar percentage of instances as in the NYT Dataset (around 70%) To make the dataset more balanced, we limit the number of bags for each relation type in each language to a max of 10,000.This helps curb the skew due to relation types such as country and birthP lace.During the filtering process, we ensure that bags of entity pairs common across more than 1 language are not removed so that we have an abundant number of cross-lingual bags.Models can take advantage of such bags for establishing representation consistency across languages (Wang et al., 2018).
Key statistics of our dataset are shown in Table 1.Positive relations classes in our dataset are : artist, associatedBand, author, bandMember, birthPlace, capital, city, country, deathPlace, department, director, formerBandMember, headquarter, hometown, influenced, influencedBy, isPartOf, largestCity, leaderName, locatedInArea, location, locationCountry, nationality, predecessor, previous-Work, producer, province, recordLabel, region, related, riverMouth, starring, state, subsequentWork, successor, team For evaluation on this dataset, we combine sentences from all languages and create two types of splits.We call one as unseen split and other as translation split.For unseen split, bags in the test set and training set are mutually exclusive.For translation split, mutual exclusion only holds between train and test bags of the same language but there can be common bags between train bags of one language and test bags of other.Our train, validation and test sets are in the ratio 70 : 10 : 20 Translation and Unseen splits measure different capabilities of an extractor.Unseen split measures how well an extractor is able to generalize to new entity pairs whereas translation split measures how well an extractor is able to memorize and recall facts learnt through one language, when tested on a different language.

BERT + Bag-Attention Baseline
We now describe our baseline Multilingual DS-RE model.Let B = {β 1 , β 2 ...β l } denote bag of sentences in l different languages with same entity pair (e 1 , e 2 ) and label r.Here β i = {x 1 i , x 2 i ..x n i i } is set of sentences in language i with entity pair (e 1 , e 2 ) and label r and contains n i sentences.Using our model, we obtain probabilities p(r|B, θ) which measures likelihood of r being a label for bag B.

BERT Encoder
To obtain a distributed representation of a sentence x, we use mBERT.In order to encode positional information into the model we use Entity Markers scheme introduced by (Soares et al., 2019).We add special tokens [E1] , [\E1] to mark start and end of the head entity and [E2] , [\E2] to mark start and end of the tail entity.This modified sentence is fed into a pretrained BERT model and the output head and tail tokens are concatenated to get the final sentence representation xj i for each sentence x j i in our bag.

Intra Bag Attention
To obtain representation of bag B, we apply selective sentence-level attention (Lin et al., 2016).We obtain real-valued vector B for the bag as a weighted sum of sentence representations xj i : where α j i measures attention score of xj i with a specific relation r :- This reduces the effect of noisy labels on the final bag representation.Finally, we obtain conditional probability p(r|B, θ) = sof tmax(o).Here we obtain o which represents scores for all relation types.
R is the matrix of relation representations.Our objective function is the cross-entropy loss and is defined as follows :- ) where b denotes the number of bags in our training data 4 Experiments and Analysis

Comparison of datasets
DS-RE is modelled as an MI-ML task.RELX-Distant contradicts the multi-label assumption as there exists no entity-pair which have more than one relation label as the ground truth.This is because relation types in RELX-Distant are not fine-grained and many of them are mutually exclusive.For instance, person-person relations in RELX-Distant are: mother, spouse, father, sibling, partner.We see that for a given person-person entity pair, there will almost always be exactly one possible relation in the knowledge base.This is infact the case for all relation classes in the RELX-Distant dataset.Evaluating a classifier on such a dataset is not indicative of how it will perform in the real world setting where the relation types are much more fine-grained.
Ideally, a dataset for DS-RE should have sufficient number of multi-label bags.Further, instances in such datasets should be evenly distributed between different relation classes so that a model cannot choose to ignore classes with few examples in order to increase its accuracy.Our DiS-ReX dataset has the following attributes: • DiS-ReX has 21642(~10%) bags which have more than one relation label.An entity pair can have up to 5 possible relations.An example of a bag with 4 relations: ('Isaac Newton', 'England'): 'http://dbpedia.org/ontology/birthPlace','http://dbpedia.org/ontology/country','http://dbpedia.org/ontology/deathPlace','http://dbpedia.org/ontology/nationality' • Moreover, DiS-ReX also has inverse relations (unlike RELX-Distant) which ensures that the model learns that an entity pair should be ordered.Some of the examples are: (successor,predecessor), (influenced by, influenced), (previous work, subsequent work), (associated band, band member) • In real world datasets, the model must also learn to predict whether an two entities are even related to each other.Hence, our dataset contains instances of "NA" class with a similar percentage as the NYT dataset.
In order to compare the imbalance among non-NA relation classes in DiS-ReX and RELX-Distant, we calculate normalized entropy (Shannon, 1948), also known an efficiency, over the distribution of relation classes .For k classes, where i th class has n i number of instances and the total number of instances across all k classes is n : Efficiency lies between 0 and 1.Higher efficiency means that the class-distribution is closer to a uniform distribution.We report the Efficiency and percentage of instances in largest relation class among non NA relation classes for RELX-Distant and DiS-ReX in table 2 We find that there is high imbalance in the RELX-Distant dataset.This contributes to how easily a baseline model can obtain close to 0.95 micro-F1 score.

BERT Encoder+attention baselines
We run our baseline model of mBERT+bag attention on both DiS-ReX and RELX-Distant.We report AUC and Micro F1 in table 3 For training we use AdamW optimizer (Kingma and Ba, 2017;Loshchilov and Hutter, 2019), with lr=0.001,betas=(0.9,0.999), eps=1e-08.Weight decay is 0.01 for all parameters except bias and layer norm parameters.We follow the training pipeline from Lin et al. ( 2016) and set bag size to be 2.This means that in one forward pass, our network will process two sentences together belonging to the same bag.We train our model for 5 epochs on both the splits for both datasets, jointly on all

Table 1 :
Key statistics for DiS-ReX

Table 3 :
Comparing AUC and Micro F1 on test set in RELX-Distant and DiS-ReX languages.Correct prediction of NA class is not counted in the calculation of Micro F1 and AUC.As can be seen in table 3, our baseline model achieves very high micro-F1 scores on the test set of RELX-Distant.Unlike RELX-Distant, DiS-ReX has lower numbers (and similar to state of the art in monolingual DS-RE), suggesting that DiS-REX is a more realistic and challenging dataset for our task.Yankai Lin, Zhiyuan Liu, and Maosong Sun.2017.Neural relation extraction with multi-lingual attention.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34-43.Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun.2016.Neural relation extraction with selective attention over instances.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2124-2133.Ilya Loshchilov and Frank Hutter.2019.Decoupled weight decay regularization.Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky.2009.Distant supervision for relation extraction without labeled data.In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003-1011.Sebastian Riedel, Limin Yao, and Andrew McCallum.2010.Modeling relations and their mentions without labeled text.In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 148-163.Springer.Claude E Shannon.1948.A mathematical theory of communication.The Bell system technical journal, 27(3):379-423.