mOKB6: A Multilingual Open Knowledge Base Completion Benchmark

Automated completion of open knowledge bases (Open KBs), which are constructed from triples of the form (subject phrase, relation phrase, object phrase), obtained via open information extraction (Open IE) system, are useful for discovering novel facts that may not be directly present in the text. However, research in Open KB completion (Open KBC) has so far been limited to resource-rich languages like English. Using the latest advances in multilingual Open IE, we construct the first multilingual Open KBC dataset, called mOKB6, containing facts from Wikipedia in six languages (including English). Improvingthe previous Open KB construction pipeline by doing multilingual coreference resolution andkeeping only entity-linked triples, we create a dense Open KB. We experiment with several models for the task and observe a consistent benefit of combining languages with the help of shared embedding space as well as translations of facts. We also observe that current multilingual models struggle to remember facts seen in languages of different scripts.


Introduction
Open information extraction (IE) systems such as ReVerb (Fader et al., 2011), OpenIE6 (Kolluru et al., 2020 and GEN2OIE (Kolluru et al., 2022) can extract triples, or facts, of the form (subject phrase, relation phrase, object phrase), which can be denoted as (s, r, o), from text (e.g., Wikipedia articles) without using any pre-defined ontology. Open knowledge base (KB) is constructed using these open IE triples where the subject and object phrases are nodes and relation phrases are edges connecting the nodes in the graph. Open knowledge base completion (KBC) is the task of discovering new links between nodes using the graph structure of the open KB. Knowledge graph embedding models are typically used for the open KBC task, where they are asked to answer questions of the form (s, r, ?) and (?, r, o).
Research in open KBC has been restricted to English (ReVerb20K (Galárraga et al., 2014)) due to lack of open KBs in other languages. We aim to study multilingual open KBC, with the motivation that the information available in high resource languages like English may help when inferring links in open KBs that use low resource languages like Telugu. Moreover, intuitively, if all the information in different languages can be pooled together, then it may help the model learn better, and allow information flow across open KBs in the different languages.
We design the first multilingual open KB construction pipeline (shown in Figure 1) using a multilingual open IE (mOpenIE) system, GEN2OIE (Kolluru et al., 2022). We find that coreference resolution is missing in existing open KB construction (Gashteovski et al., 2019) but is important for increasing the coverage of facts (as described in Figure 3). Since end-to-end multilingual coreference resolution (mCoref) models are absent in literature that works for many languages, we re-train a recent coref model (Dobrovolskii, 2021) using XLM-R (Conneau et al., 2020) as the underlying multilingual encoder and add it to our pipeline. For constructing a high quality test set, we use 400 manually verified facts in English. For extending to other languages, we automatically translate English facts using Google Translate. The dataset thus constructed, called MOKB6, contains 40.5K facts in six languages: English, Hindi, Telugu, Spanish, Portuguese, and Chinese.
We report on the first baselines for the multilingual open KBC task using state of the art knowledge graph embedding models. We find that they are indeed able to benefit from information in multiple languages, when compared to using facts from a single language. However, we also notice that although the multilingual encoders memorize facts in a particular language, they struggle to remember the same fact, when queried in another language.  (Trouillon et al., 2016), ConvE (Dettmers et al., 2018 and TransE (Bordes et al., 2013) have been used for open KBC (Gupta et al., 2019;Chandrahas and Talukdar, 2021;Broscheit et al., 2020). Given a triple (s, r, o), these models encode the subject, relation, and object from free text, and pass the encodings to a triple-scoring function which is optimized using binary cross entropy loss.

Dataset Curation
We aim to construct a dense open KB that maximizes the information about a given entity, which may be represented as multiple nodes across languages. Therefore, we consider those Wikipedia articles 2 that are available in six languages: English, Hindi, Telugu, Spanish, Portuguese, and Chinese 3 . This will also help the model learn from facts in high resource language like English and answer queries in low resource language like Telugu. We work with 300 titles randomly sampled from the common ones (found using MediaWiki-Langlinks (MediaWiki, 2021)) among all six languages. Thus, we extract facts from 6×300 Wikipedia articles. We discuss the three stages of our pipeline below.
Stage 1 We first process each Wikipedia article through a coreference resolution system. Although language-specific end-to-end neural coref models have been developed (Xia and Van Durme, 2021), multilingual models that work on all our languages of interest are absent in the literature. Therefore, we retrain wl-coref (Dobrovolskii, 2021) with XLM-R (Conneau et al., 2020) on the English training data (available in OntoNotes (Weischedel et al., 2013)) that can work zero-shot for other languages.
Coref models detect and cluster mentions, but do not identify a canonical cluster name, which is needed for standardizing all the mentions in the cluster. We employ a heuristic to find the cluster name and replace each of the coreferent mentions with it. The score for each mention is represented by a tuple, computed as: Score(mention phrase) = (#proper nouns, #nouns, #numerals, # adjectives, #pronouns, #verbs). Two tuples are compared index-wise with higher priority given to lower indices to determine the best scoring mention that is chosen as the canonical name (Table 1).

Mentions
Scores Cluster Name Barack Obama (2,0,0,0,0,0) Barack Obama He (0,0,0,0,1,0) have empty or very long arguments or have low confidence (less than 0.3, as assigned by GEN2OIE). We further only keep triples that have the article's title as either the subject or object, to avoid very generic or specific triples, valid only in the particular context. Examples of contextual triples are discussed in Appendix D and further details about the data curation are given in Appendix A.
These automatically extracted triples form the train set of MOKB6. To form a high quality test set in six languages with limited access to experts in all languages, the test set is created in a semiautomatic way. We sample 1600 English triples from the train set (which are subsequently filtered) and manually remove noisy triples. We then automatically translate the remaining 627 English triples. To translate a triple, we convert it to a sentence after removing tags and use Google Translate for translating the triple-converted sentence to the remaining five languages. Since the inputs to the translation system are short, the observed translation quality is quite high, with 92% satisfactory translations, as determined by a native-speaker of a translated language on a set of 50 extractions. To get the open IE subject, relation and object tags, we project the labels from the original English triple to the translated sentence using word alignments (Kolluru et al., 2022). This is also illustrated in Figure 2. Finally, we are left with 400 triples in each language after removing examples where some label could not be aligned. We use these 6×400 triples as the test set for each language. The train and dev sets are created from the remaining triples in each language such that the dev set has 200 randomly sampled triples (   (3) TRANS or (4) add the language-specific triples to the translated triples MONO+TRANS. We also train a multilingual model, (5) UNION+TRANS, in which the training data is union of open KBs + translated triples in all five languages. The performance across settings are reported in Table 4. UNION outperforms MONO in all languages by an average of 4.4% H@10 and 2.6% MRR, which provides evidence of information flow across languages. To check the extent of flow from (highresource) English to the other low-resource languages, we also apply SimKGC on the five languages except English, which we call UNION w/o En. We find that UNION w/o En still outperforms MONO by 2.3% H@10 and 1.5% MRR over the five languages, hinting that interlingual transfer is more general and pervasive.
We also find that MONO+TRANS is better than MONO by a large margin of 18.7% H@1, 32.6% H@10, and 23.4% MRR, averaged over all languages. Likewise, UNION+TRANS is better than UNION, except on English, by 21% H@10 and 15.1% MRR. This suggests that the model is able to better learn from English facts when they are available in the query language, unlike in UNION or MONO where the English facts are present only in English.
We also highlight that UNION+TRANS is worse than MONO+TRANS by an average of -2.3% MRR over five languages and it is worse than UNION by -1.5% MRR in English. This is because the English facts in UNION+TRANS are repeated six times in the six languages into which it is translated. This effects the generalisation, which likely hurts due to overfitting on the repeated facts.

Language Transfer Capability of mBERT
Language models like mBERT have shown strong cross lingual performance due to shared multilin-gual embeddings (Wu and Dredze, 2019). This is also evident in our results when UNION is able to perform better than MONO. We investigate the language transfer capability of mBERT by showing the fact in English language and querying the same fact in other five languages.
We take MONO trained over English facts in MOKB6 dataset, and train it further on the English test set. Then, we test this model on the test sets available in other five languages in MOKB6 dataset. Note that these test sets (in MOKB6 dataset) contain the same facts but in different languages. We call this experiment as MEMORIZE and the results are reported in Table 4.
We find that the model, which has seen the facts in English (98% H@10), struggles to answer the same facts in other five languages as indicated by low scores like 7.6% H@10 in Telugu. This strongly suggests that mBERT is learning the language-specific information and it is not learning language agnostic embeddings.

Conclusion
We create the MOKB6 dataset, the first multilingual open knowledge base completion dataset containing facts curated from articles in six languages. Moreover, its construction pipeline makes use of multilingual coreference resolution and filtering steps to improve the quality of automatically extracted facts. We also report the first baselines on the novel task of multilingual open KBC using the existing state of the art KGE models. We discuss several methods to study the task, which suggest various research challenges for exploration, such as how can we remove contextual facts since these are not relevant for open KBC? and how to learn language-agnostic embeddings in KGE models? We will release the resources and baselines to enable further research on the new task.
Although multilingual, the constructed open KB is limited to the sampling of the chosen six languages. We do not know how well the system will generalize to various language families that have not been considered here. Further, even among the languages considered, the performance of even the best-performing systems, as measured through H@1 is still in the low 20's. Therefore the models are not yet ready to be deployed for real-world applications.

MOKB6: A Multilingual Open Knowledge Base Completion Benchmark
(Appendix)

A Dataset Curation
As discussed in Section 3, we construct MOKB6 dataset in three stages after extracting the Wikipedia articles (using WikiExtractor 4 ) from the Wikidump of April 02, 2022. We run our construction pipeline (as shown in Figure 1) for all six languages on a single V100 (32 GB) GPU, which required 14 hours of computation to create MOKB6 dataset.
In the first stage, we keep the sentences containing at least 6 and at most 50 tokens since we find that most of the short sentences are headings or sub-headings present in Wikipedia articles, and very long sentences can't be input to GEN2OIE (in second stage) due to maximum sequence length constraint of 1024 in mT5 (Xue et al., 2021) based GEN2OIE. This filtering step discards 18.9% of sentences on an average in all six languages. We use Stanza (Qi et al., 2020) to perform sentenceand word-segmentation on Wikipedia articles in all six languages. After filtering the sentences, the articles are processed for coreference resolution using XLM-R (Conneau et al., 2020) encoder based wlcoref (Dobrovolskii, 2021), followed by replacing the coreferent cluster mentions with their canonical cluster name using the heuristic discussed in Section 3.
In the second stage, the coreference resolved articles are passed through GEN2OIE to get the open IE triples. The confidence scores for these triples are computed using label rescoring, for which we refer the readers to Kolluru et al. (2022) for more details.
Finally, in the last stage, we apply various filters, adapted from Gashteovski et al. (2019), to remove triples that are of no interest to open KBC task, like the triples: (1) having any of its argument or relation empty, (2) containing more than 10 tokens in any of its arguments or relation, (3) having confidence score less than 0.3, (4) containing pronouns (found using Stanza) in its arguments, (5) having same subject and object (i.e. self loops), and (6) that are duplicates. These filters keep 91.6% of the triples obtained from stage 2 in all six languages.
Further in the last stage, in order to create a dense open KB containing minimum noise and maximum facts about the entities, we keep the triples having the Wikipedia article's title as either the subject phrase or object phrase and discard the rest. We do this by finding all the coreference clusters (of entity mentions) that contain the titles, then get the entities, or cluster names, of those clusters using the heuristic discussed in section 3, and keep those triples that contain these cluster names. This filtering step retains 23.6% of the triples on an average. The commonly used evaluation metrics are hits at rank N (H@N), where N is a natural number, and mean reciprocal rank (MRR). Suppose, the rank of o in the list of entities ranked by the model is R. Then, H@N measures how many times the R was less than or equal to N . The MRR is the average of reciprocal ranks ( 1 R ). Both, H@N and MRR, are computed as average over both form of queries over the complete test set.
We use the publicly available code of KGE models: SimKGC (Wang et al., 2022), GRU-ConvE (Kocijan and Lukasiewicz, 2021), and CaRe (Gupta et al., 2019) for our experiments. The models are trained on 2 V100 (32 GB) GPU with three different random seeds and report the average of three evaluation runs.
SimKGC is a text-based KGE model that uses two unshared pretrained BERT (Devlin et al., 2019) models for encoding (subject phrase; relation phrase) and object phrase separately. GRU-ConvE (Kocijan and Lukasiewicz, 2021) encodes both the relation phrase and argument phrase from their surface forms using two unshared GRU (Cho et al., 2014). CaRe (Gupta et al., 2019) learns separate embeddings for each argument phrase and uses a bi-directional GRU to encode the relation phrase from its surface form. Both, GRU-ConvE and CaRe, are initialised with Glove embeddings (Pennington et al., 2014).
We do not perform hyperparameter search trials, except for batch size, and use the default hyperparameters from the respective codes of KGE models (see Table 5). We use early stopping to find the best model checkpoints based on HITS@1. The dev set is different for each baseline: MONO, TRANS, and MONO+TRANS use individual language's dev set, whereas the multilingual baselines: UNION w/o En, UNION, and UNION+TRANS use the English dev set.

D Contextual Triples
Open IE triples are of various kinds and not all of them can be used for open KBC task. Various filtering steps are used to remove some of these in data curation (Section 3). We define contextual triples as another kind of noisy triples, which are specific to, and are not interpretable out of, the context of text from which they are extracted.
(Max Born; continued; scientific work) (Robb Gravett; won; the championship) (George Herbert Walker Bush; was; out of touch) (Christianity; is; dominant) From the first two triples in Table 6, it is unclear which scientic work Max Born continued, or which championship Robb Gravett has won. The last two triples are too specific to the context and contain no factual information.  (Barack Obama;returned to Honolulu, Hawaii in;1971). Our pipeline (shown by blue arrows) increases the coverage of facts due to mCoref system.