LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction

Open Information Extraction (OIE) systems seek to compress the factual propositions of a sentence into a series of n-ary tuples. These tuples are useful for downstream tasks in natural language processing like knowledge base creation, textual entailment, and natural language understanding. However, current OIE datasets are limited in both size and diversity. We introduce a new dataset by converting the QA-SRL 2.0 dataset to a large-scale OIE dataset LSOIE. Our LSOIE dataset is 20 times larger than the next largest human-annotated OIE dataset. We construct and evaluate several benchmark OIE models on LSOIE, providing baselines for future improvements on the task. Our LSOIE data, models, and code are made publicly available.


Introduction
Open Information Extraction (OIE) (Banko et al., 2007) aims to automatically extract all factual propositions of a sentence into a series of n-ary tuples. For example, the sentence "the cook baked and ate the cake" would produce two extractions representing the two basic propositions of the sentence: (the cook, ate, the cake) and (the cook, baked, the cake). In OIE, extraction arguments are required to be contiguous spans from the sentence and the resulting tuple should be intelligible as natural text when read in order. The schemafree nature of OIE provides a flexible framework in which to capture semantic relations between entities in natural language text. Open Information Extraction tuples are useful to a variety of downstream tasks including knowledge base creation (Zhang et al., 2019), textual entailment (Levy et al., 2014), and other natural language understanding tasks (Mausam, 2016  Open Information Extraction relations may be explicitly stated by verbal predicates, or implicitly stated through nominalizations. In this paper, we focus only on explicit extractions. With the original goal of OIE as web scale information extraction (Banko et al., 2007), an OIE system can focus solely on explicit extractions because the redundancy of language will inevitably display implicit information elsewhere.
The interest in OIE has grown: both in terms of the types of models that can be applied to tackle OIE (Cui et al., 2018;Stanovsky et al., 2018;Jiang et al., 2019), and in terms of the downstream applications to which OIE can be applied (Mausam, 2016;Zhang et al., 2019). As the interest in OIE grows, however, so too should the scale of the corpora available for training and evaluating OIE models.
In this paper, we expand the reach and quality of OIE data by developing a new dataset, LSOIE, which is built by converting the QA-SRL BANK 2.0 dataset (FitzGerald et al., 2018) to the task of OIE. Our new dataset contains almost ten times as many extractions and about 20 times as many sentences as previous OIE datasets built from human anno-Figure 1: An example annotated sentence from QA-SRL 2.0 (FitzGerald et al., 2018). In this case, the annotations are derived from the question and answers: -Where does someone provide something? In Asian countries. Who provides something? physicians. What is being provided? drugs. The extracted tuple in our new LSOIE dataset is (physicians, provide, drugs, in Asian countries).
tations (see Table 1). We benchmark LSOIE with several models, providing baseline results for future research. Our LSOIE dataset, models, and code are publicly available.
Converted from crowdsourcing: Stanovsky and Dagan (2016) created the OIE2016 dataset by converting the crowd-annotated QA-SRL (He et al., 2015) dataset's question-answer pairs to OIE extraction relations. Similarly, Stanovsky et al. (2018) generated the AW-OIE dataset by converting the crowd-annotated Question Answer Meaning Representation (QAMR) dataset's question-answer pairs. The OIE2016 and AW-OIE datasets were the first datasets used for supervised OIE. These datasets provided the basis for supervised approaches in NLP, but they are small and extractions lack accuracy, as they are converted in the order that question answer pairs appear in the base dataset.
Model-derived: Cui et al. (2018) and Jia et al. (2018) generate large derivative training datasets by running rules-based models and keeping high confidence extractions for downstream tasks. Similarly, Gashteovski et al. (2019) introduce the largest OIE dataset to date (over 340M triples) by deriving extractions from MinIE Gashteovski et al. (2017) with the goal of automatically constructing a knowledge base. While model-derived datasets are useful for knowledge base construction, using them for downstream tasks teaches the new model to replicate the behavior of the original, often noisy, base model.
Directly crowdsourced: Bhardwaj et al. (2019) point out that the evaluation framework used in Stanovsky and Dagan (2016) is rather noisy and the tuple matching algorithm is overly lenient because it only looks at lexical overlap for the whole extraction, ignoring the ordering of arguments. Bhardwaj et al. (2019) provide an alternative evaluation set that has been crowdsourced specifically for OIE, annotating 1,282 sentences. While this dataset is useful for the evaluation of OIE systems, its format differs from other work in OIE -the predicate entry in CARB (Bhardwaj et al., 2019) tuples contains context that is often broken into separate tuples by other OIE systems.

The QA-SRL Bank 2.0
In QA-SRL, each predicate-argument relationship in a sentence is labeled manually with a questionanswer pair. FitzGerald et al. (2018) design a largescale crowdsourcing annotation pipeline to incentivize extensive and accurate coverage. Relative to the original QA-SRL annotations (He et al., 2015), which were collected from 10 hired freelance workers, the new QA-SRL dataset achieves similar precision (95.7% versus 97.5%) and lower recall (72.4% versus 86.6%). Relative to Propbank (Palmer et al., 2005), an expert annotation system designed to capture all semantic roles in a sentence. the QA-SRL 2.0 authors find that their work 95% precision and 85% recall. FitzGerald et al. (2018) then build a supervised QA-SRL parser and extend the reach of their dataset by over-generating new candidate question-answer pairs and passing them through their validation process.
The QA-SRL paradigm is well-suited to be a precursor to OIE extractions, as it captures predicateargument relations in a schema-free way.

The LSOIE Dataset
Our work expands upon and addresses the shortcomings present in Stanovsky and Dagan (2016) and Stanovsky et al. (2018). We apply a similar conversion processes used for OIE2016 on the QA-SRL BANK 2.0 dataset. In addition, we implement novel conversion heuristics to ensure data quality and order arguments. The result is LSOIE, an OIE dataset that is much larger and diverse than prior work.

LSOIE Conversion Process
We produce LSOIE via conversion from QA-SRL in the same manner as Stanovsky and Dagan (2016), with several important changes to adapt their method to the QA-SRL BANK 2.0.
A QA-SRL annotation for a predicate p consists of a list of questions Q = {q 0 , . . . , q n }, and a set of answer spans A i = {a i0 , . . . , a in i } for each question q i . For each tuple (a 0 , . . . , a k ) in the Cartesian product × n i A i , we produce the extraction tuple (a 0 , p, a 1 , . . . , a k ).
In our example extraction in Figure 1, the target predicate p is provide. The list of questions Q is [Where does someone provide something?, Who provides something?, What is being provided?] The list of arguments A is [In Asian countries, physicians, drugs]. The converted extraction tuple is (physicians, provide, drugs, in Asian countries).
To ensure data quality and as a result of differences between the original QA-SRL dataset and the QA-SRL BANK 2.0, we had to make two important changes to the algorithm: Answer Filtering: The original QA-SRL dataset has a single set of mutually-exclusive answer spans for each question, written by a single annotator. In contrast, the QA-SRL BANK 2.0 has answer judgments from three annotators for each question, some providing answer sets and others marking the questions as invalid. To consolidate these, we only include questions marked as valid by all three annotators. Then, for each question, we iteratively draw the longest remaining answer Bats are the only mammals that can truly fly.
(Bats, fly) Greece moved up three to be ranked tenth.
(Greece, ranked, tenth) A popular student, in 1915 Mao was elected secretary of the Students Society. (Mao, elected, secretary of the Students Society, in 1915) The proposed amendment already passed both houses in 2011.
(The proposed amendment, passed, both houses, in 2011) In polygynous species, males try to monopolize and mate with multiple females.
(males, monopolize, multiple females) Animals adapted to live in the desert are called xerocoles.
(Animals, adapted, to live in the desert) span that does not overlap with a previously drawn answer span, until there are none left. In answer filtering, our primary motivation was to clean the raw version of crowd workers' answer responses in the QA-SRL 2.0 dataset, where questions can be posed that are not valid or the answer to them is ambiguous. We found it advantageous for dataset quality to require a strict agreement between all annotators. In choosing the longest answer span, we were motivated to not miss relevant portions of the argument, as individual crowd workers occasionally annotated a limited portion of the answer span that did not encapsulate the whole semantic meaning of the derived argument.
Argument Ordering: Stanovsky and Dagan (2016)'s original algorithm relies on the original, annotator-written order of QA-SRL questions, which may or may not produce a sensible argument ordering. Furthermore, in the QA-SRL BANK 2.0, the original order in which the questions were written is unavailable.
So, to determine argument order, we use a heuristic based on the relative order between answer spans for each question in their source text. We consider the abstract form of questions, which includes verb tense without information about its lemma. For a given question q i in an extraction, let q ix represent the percentage of predicates in the QA-SRL BANK 2.0 where the answer span to the generalized version of q i appears in the x th place relative to other answer spans, according to the natural order of the sentence. For each argument slot in the derived extraction, the answer to the question with the highest probability q ix of naturally occurring in that slot is chosen as the argument. In our example extraction in Figure 1, the question Who [predicate] something? precedes What is being [predicate]? which precedes Where does someone [predicate] something?, enabling our algorithm to accurately extract argument ordering, which is not available from the natural ordering of the sentence or the ordering of crowd annotations in FitzGerald et al. (2018).

Dataset Statistics
We run our updated dataset conversion process over the directly crowdsourced portion of the train, development, and test partitions of the QA-SRL BANK 2.0. Stratifying the resulting data by domain, we present the new LSOIE corpus in two sections, LSOIE-wiki and LSOIE-sci. Dataset statistics are shown in Table 1. Example extractions are shown in Table 2. We provide the distribution of argument, predicate, and null tag labels in Figure 2. The LSOIE corpus expands the scope of OIE2016 and AW-OIE in size, textual diversity, and domain.

Benchmark Evaluation
Models: We evaluate several models on our new LSOIE dataset. Following Stanovsky et al. (2018), we model OIE as a supervised learning problem and format it as BIO tagging with tunable thresholding on extractions. We benchmark several model variants: • rnnoie is a replication of the model in Stanovsky et al. (2018), based on a bidirectional LSTM transducer over GloVe embed-  dings (Pennington et al., 2014) and learned part-of-speech embedding features. • ls oie is a replication of rnnoie trained on LSOIE. • ls oie crf is the same as ls oie, but trained end-to-end with a Conditional Random Field on top to capture BIO transition constraints and trained to maximize the likelihood of the gold BIO sequence. • srl bert ls is based on ls oie, but uses BERT (Devlin et al., 2019) as the bidirectional encoder and the Sentence A / Sentence B embedding feature as the predicate indicator, inspired by Shi and Lin (2019). • srl bert oie2016 is the same architecture as srl bert ls but applied to the OIE2016 data. • * sci models were trained with the same architectures applied only to the LSOIE-sci training set.

Experiments and Evaluation:
We use the Al-lenNLP framework (Gardner et al., 2018) built on PyTorch (Paszke et al., 2019) to implement, train, and test our models. We train rnnoie and srl bert oie2016 on OIE2016 and ls oie and srl bert ls on LSOIE-wiki. We also focus the series of models by only training on LSOIE-sci. We do not evaluate * sci models on LSOIE-wiki. We limit our evaluation to supervised OIE systems. We evaluate our system's performance against the gold test data in LSOIE-wiki and LSOIE-sci by considering extractions to be a match if they contain the same predicate as the gold extraction and contain the syntactic head of each gold argument. Syntactic heads are extracted with the Stanford CoreNLP dependency parser (Chen and Manning, 2014). Although it would be ideal to have the gold syntactic head, this method is preferable to taking the lexical overlap of the entire extraction Stanovsky and Dagan (2016), ignoring argument tags and ordering as pointed out in Lechelle et al. (2019).
We then assign a confidence score to each extraction to allow for tuning the precision-recall tradeoff. For the non-CRF models, we use the mean log probability assigned to the tag labels in the extraction as the confidence score. For the CRF model, we use the log probability assigned to the entire sequence. We differ from Stanovsky et al. (2018) where confidence was calculated as the product of the inverse of the model's estimate probability for each tag label, preferring longer extractions which were more likely to get a 50% lexical match, outweighing the deficit of swimming upstream against the model's estimated confidence and still producing a downward sloping precision recall curve.
We use Viterbi decoding to extract the most likely valid BIO tagging sequence given the model's probability output for each BIO tag. We import the Viterbi algorithm functionality from the AllenNLP library (Gardner et al., 2018). Figure 3 shows precision and recall curves on the LSOIE-wiki test set, accompanied by the ls oie model's estimated confidence. Table 3 shows F 1 and AUC scores for the benchmark models on the LSOIE-wiki and LSOIE-sci test sets.

Discussion
The OIE modeling task is difficult. Results on both evaluation sets show that the BERT model and the CRF output layer improve over the baseline model. Training with the LSOIE improves model performance. When science is the target domain, the * sci models are preferable, as they have slightly higher in-domain performance, showing the value of the domain split in LSOIE.

Error Analysis
We conduct a manual error analysis of the ls oie model, where we find that our baseline models could benefit from more careful extractions.
Incorrect predicate: At minimum confidence, 53% of the model's precision errors come from verbs that are not present in the gold dataset. Half of these are legitimate predicates that are missing from the gold dataset and the other half are auxiliary verbs, that should not be present in the gold dataset. Depending on the deployment environment, the model could be improved with predicate filtering heuristics at prediction time.
Argument Concatenation: We examined 500 incorrect extractions by ls oie. We found that 36% of unmatched extractions were semantically similar to the gold extraction. These extractions either concatenated arguments A 1 -A N into A 1 while gold did not, split these arguments apart while gold did, or dropped a non-material argument. For future modeling, this is an argument to drop A 2 and beyond from the dataset and only model OIE with extraction triples.
True Errors: Among the extraction errors, 2/3 involve errors in argument ordering, often following the natural order of the sentence. The other 1/3 of errors involved the model making nonsensical extractions or not making extracting arguments beyond A 0 , presumably because of lack of confidence and defaulting to the O label.
LSOIE Modeling Improvements: We also manually examined 100 extractions where ls oie chose the right extraction over rnnoie. In these cases, we found improved argument ordering, increased confidence on relevant A 1 objects, and better accuracy identifying subjects that are distant from the predicate.

Conclusion
In this paper, we introduced the LSOIE dataset as a resource for supervised OIE. We have algorithmically re-purposed the QA-SRL BANK 2.0 into a new OIE dataset, LSOIE, which contains over 70,000 sentences and over 150,000 extraction tuples. To benchmark the new dataset, we trained and evaluated a series of supervised OIE models, providing baselines for future research on the OIE modeling task.
The code and datasets introduced in this paper can be found at https://github.com/ Jacobsolawetz/large-scale-oie.