DocOIE: A Document-level Context-Aware Dataset for OpenIE

Open Information Extraction (OpenIE) aims to extract structured relational tuples (subject, relation, object) from sentences and plays critical roles for many downstream NLP applications. Existing solutions perform extraction at sentence level, without referring to any additional contextual information. In reality, however, a sentence typically exists as part of a document rather than standalone; we often need to access relevant contextual information around the sentence before we can accurately interpret it. As there is no document-level context-aware OpenIE dataset available, we manually annotate 800 sentences from 80 documents in two domains (Healthcare and Transportation) to form a DocOIE dataset for evaluation. In addition, we propose DocIE, a novel document-level context-aware OpenIE model. Our experimental results based on DocIE demonstrate that incorporating document-level context is helpful in improving OpenIE performance. Both DocOIE dataset and DocIE model are released for public.


Introduction
Open Information Extraction (OpenIE) has been a critical NLP task as it can extract structured relational tuples (subject, relation, object) from unstructured text. The OpenIE system is fully domainindependent, and does not need input from users. It is also highly scalable that allows fast querying mechanism (Yates et al., 2007). Therefore, Ope-nIE has been successfully applied to a variety of downstream NLP tasks, such as knowledge base population (Martinez-Rodriguez et al., 2018;Gashteovski et al., 2020), question answering (Khot et al., 2017), and summarization (Fan et al., 2019).
Current OpenIE methods mainly focus on extracting tuples at the sentence-level. However, in 1 https://github.com/daviddongkc/DocOIE Sentence 1 Data transfers to a single target terminal using the invention might not be significantly faster than conventional download methods. (Pat No. 8495167) (data; transfers to; a single target terminal) (data transfers to a single target terminal; use; the invention) Context S1: Data security is improved as compared with transferring plain text and data transfer requires less time. Context S2: If a new terminal is registered to the main server during the transfer it will be included in the next data transfer. Context S3: Examples of data transfers will be described with reference to a preferred embodiment of a network.

Sentence 2
Node-B can be a device a cellular base station having beam-forming antennas that serves various sectors of a cell. (Pat No. 8160027) (node-B; can be; a device a cellular base station) (node-B; can be; a device) (a device, is such as, a cellular base station) Context S4: A Node-B can be a device, such as, a cellular base station that serves an entire cell. many NLP scenarios, sentences exist as part of a document rather than standalone. Given a document corpus, if we simply apply existing sentencelevel OpenIE models to extract tuples, we could miss some useful and critical document-level contextual information, leading to unsatisfied results. We use the two example sentences in Fig. 1 to illustrate two types of ambiguities.
Part-of-speech Ambiguity. The word "transfers" can be a verb or a noun. Accordingly, two tuples could be extracted from the first example sentence, listed in Fig. 1. This ambiguity can be resolved by the main verb of the sentence "might not be", which is far away from "transfers", and thus is not considered by many existing OpenIE systems. However, context sentences S1, S2 and S3 in the document suggest that "data transfers" shall be considered as a noun phrase throughout this document. Syntactic Ambiguity. The second example sentence does not have an explicit clue about the relationship between "a device" and "a cellular base station". 2 Thus, existing OpenIE systems often fail to split them, but incorrectly extract the first tuple (node-B; can be; a device a cellular base station). However, context sentence S4 includes an explicit cue to the relationship between the two terms and may thus help split them.
To minimize the aforementioned ambiguities, it is clear that we should leverage document-level context. However, all existing OpenIE datasets are generated or annotated at sentence-level. These datasets include standalone sentences but not their context sentences. Hence they are not suitable for evaluating context-aware tuple extraction.
We annotate the first Document-level contextaware Open Information Extraction (DocOIE) dataset. DocOIE consists of 800 expert annotated sentences from 80 documents; 10 sentences are randomly sampled for annotation from each of 80 documents. To the best of our knowledge, among all OpenIE datasets as of now, DocOIE contains the largest number of expert-annotated sentences. More importantly, DocOIE provides documentlevel contexts, enabling OpenIE models to take relevant contexts for accurate tuple extraction.
Furthermore, to show that document-level context is useful for OpenIE task, we develop the Document-level context-aware Open Information Extraction (DocIE) model. DocIE encodes a source sentence with its contextual information by using pre-trained BERT (Devlin et al., 2018). Because contextual sentences can be much longer than the source sentence, the syntactic/semantic information in source sentence might be dominated by that of the contexts. Our proposed DocIE model differentiates the source sentence and its contexts by segment tags, and adds additional transformer encoder layers only for the source sentence. In summary, our contributions are threefold: • We propose a new task in OpenIE to extract relational tuples with document-level contexts. tem that can leverage on document-level contexts for relational tuple extraction.

Related Work
OpenIE Datasets. Since the OpenIE task was introduced by Yates et al. (2007), the earlier systems were mainly evaluated by using a small number of sentences, without a standardized evaluation procedure . OIE2016 (Stanovsky and Dagan, 2016) is the first large-scale dataset constructed for OpenIE tasks and comes with a standard scoring framework. In OIE2016, the gold tuples are automatically generated from a QA-SRL dataset (He et al., 2015) according to human crafted rules. Wire57 (Lechelle et al., 2019) improves the scorer and manually annotates 57 sentences as a benchmark dataset. Considering that OIE2016 dataset is noisy, Bhardwaj et al. (2019) provide a crowdsourcing dataset named CaRB. CaRB also has 50 expert annotated sentences and a sophisticated scoring framework. As summarized in Table 1, the number of expertannotated sentences in these datasets remains small. Furthermore, the sentences in these datasets do not come with contextual information. In contrast, our DocOIE dataset consists of 800 expert-annotated sentences, and comes with the source documents for accurate sentence interpretation.
Recently, neural OpenIE systems have been developed and showed promising results (Cui et al., 2018;Zhan and Zhao, 2020;Kolluru et al., 2020a,b). Different from the traditional models, neural OpenIE models extract tuples in an end-toend manner, not requiring prior syntactic or semantic analysis. In principle, the traditional rule-based or statistical OpenIE models do not need training. However, neural OpenIE models need a large number of training samples to learn the extraction patterns. For instance, IMOJIE (Kolluru et al., 2020b) uses about 100,000 sentences for model training. It is unrealistic and expensive to manually annotate 100,000 sentences simply for training purpose. Therefore, a common practice in learning a neural OpenIE model is to use tuples automatically extracted by traditional systems as training data, i.e., a bootstrapping strategy. We consider these imperfect training labels generated via bootstrapping as pseudo labels. The pseudo labels used in (Cui et al., 2018) are by Openie4, and those in (Kolluru et al., 2020b) are from multiple OpenIE systems.
To the best of our knowledge, no OpenIE models consider document-level contexts in the tuple extraction. Nonetheless, our neural model unavoidably requires extractions of pseudo labels bootstrapped from traditional models for training. To ensure reproducibility, as part of the DocOIE dataset, we also release the documents that are used for generating the pseudo labels.

DocOIE Dataset
We now present our Document-level context-aware Open Information Extraction (DocOIE) dataset. We first introduce the data selection and collection process, and then the annotation process by two experts. Moreover, we explain our annotation consistency measurement to indicate the high-level annotation consistency in DocOIE. In summary, Do-cOIE consists of two datasets: evaluation dataset and training dataset.
Evaluation dataset contains 800 expertannotated sentences, sampled from 80 documents in two domains (healthcare and transportation). Specifically, 10 sentences are sampled for annotation from each of the 40 documents in one domain.  In total, 2, 122 relational tuples are annotated in the 800 sampled sentences (refer to Table 2 for detailed statistics).
Training dataset contains 2,400 documents from the two domains (healthcare and transportation); 1,200 documents in each domain. All sentences from these documents are used to bootstrap pseudo labels for neural model training. 4

Dataset Collection
OpenIE, by definition, is to extract relational tuples in open domain. Ideally, sentences/documents in DocOIE dataset shall not be restricted to any particular document type or topical domain. However, it is challenging to include all types of documents and annotate them. In fact, all existing annotations are restricted to specific types of documents like news and Wikipedia articles .  Patent Document Collection After taking all factors into consideration, we choose to collect patent documents from PatFT. 5 Each patent document elaborates one specific invention in reasonable length, providing sufficient contexts to annotators. They are rich in informativeness by nature, and the documents contain rich syntactic structures. Through PatFT search engine, patent documents can be retrieved by keywords. We have two considerations for keyword selection: (i) Magnitude: as part of DocOIE, a large number of documents shall be available for training neural OpenIE models. Hence, the keywords shall lead to sufficient patent documents. (ii) Diversity: the collected patent documents are expected to be diversified in inventors, organizations, filed date, etc., to avoid fixed patterns, hence to ensure diversity of our dataset.

Document Type Selection
As the result, we choose three broad and nontechnical keywords: "healthcare", "traffic", and "transportation", to collect documents in two broad domains, healthcare and transportation. Reported in Table 3, 42,514 and 32,256 documents are collected in healthcare and transportation respectively. Documents in each domain are contributed by more than 40,000 inventors from over 8,000 cities, and the filed dates range in several decades.
We clean these documents by removing nontextual components in them. Then, by length (in number of words) the shortest 10% and longest 10% documents are removed to avoid extremely short/long documents in our dataset. The remaining documents form the corpus from which we sample (i) documents for annotation, and (ii) training documents for bootstrapping pseudo labels.

DocOIE Evaluation Dataset Selection
To ensure annotation quality and consistency, we choose to follow expert annotation scheme instead of crowdsourcing adopted in CaRB (Bhardwaj et al., 2019). As we discussed in Section 1, sentences exist as part of a document rather than standalone. To gain an accurate interpretation of a sentence, the annotator needs to read a few surrounding sentences, or even the entire document, for relevant contexts. Hence, the choices of labelling one sentence or multiple sentences per document incur different costs.
To be able to cover a reasonable number of documents and also balance the annotation workload, we choose to randomly sample 10 sentences per document from 80 documents for annotation. Recall that the average number of sentences per document is 101.78 (refer to Table 2). The 10 sentences annotated in a document can be used to evaluate context-aware OpenIE at 10 different positions in this document. In this way, our annotation covers 80 documents with considerable diversity.
In summary, we randomly selected 80 documents (40 in each domain) from the documents collected in Section 3.1. Then we randomly selected 10 sentences from each document. These 80 documents, along with 800 expert-annotated sentences form the DocOIE evaluation dataset.

Annotation Consistency Measurement
The annotation was performed by two OpenIE experts (both are authors of this paper) with reference to existing annotation processes (Stanovsky and Dagan, 2016;Bhardwaj et al., 2019). The dataset was annotated in three stages.
In the first stage, the two annotators practiced annotations independently on 100 sentences among the 800 sentences. Then they cross-validated the annotation results, discussed them to resolve disagreements, and updated annotation policy.
In the second stage, the two experts independently annotated another 100 sentences among the remaining 700 sentences. These two sets of annotations are used for measuring annotation consistency. Because it is not straightforward to evaluate annotation agreement by measures like Kappa coefficient, we adopted the evaluation scorer proposed by CaRB (Bhardwaj et al., 2019). The scorer performs matching at tuple level instead of lexical level. Specifically, we score one expert's annotations by treating the other's annotations as ground  truth. Among the tuple matching strategies in CaRB, we used the default binary lenient tuple matching, to estimate the consistency between the two annotators. Reported in Table 4, the two annotators reach high-level agreement in annotations with an average F1 of 89.9%. Based on the high-level annotation consistency, in the third stage, each expert annotated 300 sentences from the remaining 600 sentences. The annotations are then validated by the other expert, and annotation disagreements are resolved through discussion.

DocOIE Training Dataset
Besides the 80 documents for expert annotations, we further sample 2,400 documents randomly (1,200 in each domain) from the documents collected in Section 3.1 to create DocOIE training dataset. The 1,200 documents in each domain contain around 120,000 sentences, which is sufficient for pseudo label generation, required by neural OpenIE models.

Pseudo Label by Bootstrapping
Following the common practice (Kolluru et al., 2020b;Cui et al., 2018), we generate pseudo labels by bootstrapping with traditional OpenIE models. Before we run these models on the DocOIE training dataset, we evaluate their performances on the DocOIE evaluation dataset, to select the models which can generate better quality pseudo labels.
We evaluate the models by using CaRB scorer (Bhardwaj et al., 2019). Table 5 reports the performance of five independent OpenIE models: Reverb (Fader et al., 2011), Clausie (Del Corro and Gemulla, 2013), Stanford OpenIE (Angeli et al., 2015), OpenIE4 (Mausam, 2016) and OpenIE5 6 . In addition to these five models, we also evaluated two combinations of Reverb and OpenIE4. With   Rev+Oie4, Reverb is the main system and if Reverb fails to extract any tuples from a sentence, we complement the extraction by using Openie4. Similarly, Oie4+ Rev uses OpenIE4 as the main system, and the extractions are complemented by Reverb.
All the evaluated models show consistent performance in both domains. By F1 score, both Reverb and OpenIE4 are the best performing individual models and their combinations lead to the best and second best F1 scores in both domains. Accordingly, by applying OpenIE4, Reverb, and their combinations, the number of sentences and tuples extracted from the DocOIE training dataset are reported in Table 6. Note that, a sentence is not counted if it has no extracted tuples, which leads to the different sentence number.

DocIE Model
In this section, we present the proposed Documentlevel context-aware Open Information Extraction model, named DocIE. As shown in Fig. 2, DocIE mainly consists of two parts: source-context encoder, and encoder-decoder.

Document-level Context
Formally, we denote a document as D = {s 1 , s 2 , . . . , s N } consisting of N sentences. The source sentence s i is the input sentence that relational tuples are extracted from. Given source sentence s i , we regard its surrounding sentences c i = {s i−t , . . . , s i−1 , s i+1 , . . . , s i+t } as contextual sentences, where t represents the context window size. The larger t is, the more document-level context c i covers.

Source-Context Encoder
The source-context encoder is inspired by a recent work (Ma et al., 2020) which adopts Flat-Transformer to incorporate context into source sentence, for machine translation. In DocIE, our encoder consists of (i) bottom blocks which take the concatenation of source sentence and context sentences as input, and (ii) top blocks which take only the representation of the source sentence from the bottom blocks as input.
In our implementation, we use BERT (Devlin et al., 2018) as the bottom blocks to perform semantic interactions between source sentence s and context c. We first project both s and c into embedding space by summing their word embedding and segment embedding, i.e., e s = E(s) + S(s) and e c = E(c) + S(c). Here, E is the trainable word embedding matrix, and S is the trainable segment embedding matrix. The segment embedding is to distinguish words in source sentence from words in context sentences. They are initialized to 0 and 1 for words in source and context sentences respectively. Then we concatenate e s with e c as [e s ; e c ] as the input to the source-context encoder.
BERT, with multiple layers of transformers, merges source sentence information and its contextual information. We use the last hidden state h 1 [s; c] of BERT as the representation of the two concatenated input sequences. On top of the BERT blocks, we add Transformer as top blocks (Vaswani et al., 2017) to prepare the source sentence representation for the following encoder-decoder. The source sentence representation is obtained by truncating the latter (context sentences representation) h 1 [c] from h 1 [s; c]. Therefore only the former (source sentence repre- Encoder-Decoder The encoder-decoder generation module follows CopyAttention (Cui et al., 2018) which casts OpenIE task as a sequence-tosequence generation task with copying mechanism. The encoder-decoder framework represents a variable length input sequence in the encoder and uses it in the decoder to generate output sequence. In our encoder-decoder framework, attention mechanism (Bahdanau et al., 2014) is used to align the encoder hidden state with the decoder hidden state, jointly maximizing the log probability of output tuples, conditioned on the input sentence. Meanwhile, since tuple arguments and relation are normally sub-spans of the input sentence, additional copying mechanism (Gu et al., 2016) is applied. It helps copy words directly from the input sentence to the output tuples.

Experiments
We evaluate DocIE and compare its results with two baseline neural OpenIE models, CopyAtten-tion+BERT and IMOJIE (Kolluru et al., 2020b). Kolluru et al. (2020b) report that CopyAtten-tion+BERT is a strong baseline. Meanwhile, Do-cIE adopts CopyAttention (Cui et al., 2018) as its encoder-decoder module. Hence CopyAtten-tion+BERT can be considered as the base model, from which DocIE adds context modelling.

Neural Baseline Models
We first evaluate the two neural baseline models trained with the pseudo labels listed in Table 6. The evaluation is conducted on the DocOIE evaluation dataset with CaRB scorer. Reported in Table 7, CopyAttention+BERT outperforms IMOJIE in most settings by both measures: AUC and F1. In general, for both models, pseudo labels by Rev+Oie4 (and also Reverb) lead to better results in healthcare domain. Pseudo labels by Oie4+Rev (and also OpenIE4) generate better results in transportation domain. During our annotation of the 800 sentences, we observe that sentences in transportation domain tend to contain slightly more conjunctions (e.g., "and" and "or") and thus have more coordinating structures than those in healthcare. OpenIE4 system generally extracts more tuples than Reverb (refer to Table 6) and provides higher recall. Therefore, extractions in transportation domain with more conjunctions may better match the tuples extracted by OpenIE4.
Based on this set of results, in our following experiments, we use pseudo labels by Rev+Oie4 for healthcare domain, and pseudo labels by Oie4+Rev for transportation domain.

DocIE Against Baselines
In this section, we evaluate DocIE against sentencelevel OpenIE systems. We refer DocIE without the top transformer layer as "DocIE w/o transformer" and DocIE as "DocIE w transformer" for clarity. The context window size of DocIE is set to 5 for healthcare domain and 4 for transportation domain.  Observe that DocIE w transformer achieves the best AUC and F1 in both domains. Its variant, DocIE w/o top transformer, is the second best performer and outperforms all sentence-level models.
The experiment results suggest that incorporating document-level context is helpful in improving OpenIE. On the other hand, we remark that DocIE is trained by pseudo labels produced by traditional OpenIE models which do not consider documentlevel context. The potential of utilizing documentlevel context is yet to be fully realized.

Impact of Context Window Size
The setting of window size determines the number of context sentences to be considered. We evaluated the range from 1 to 6 and plot F1 scores against window size changes in Fig. 3. Observe that the optimal window size for healthcare domain is 5, and the number is 4 for transportation. Better F1 scores are observed along the increase of context sentence window, till 4 or 5. In general, 8 10 surrounding sentences (window size 4 or 5) provide sufficient context for sentence understanding. Small window size might not provide sufficient context, and a large window size might introduce noise and dominate source representation learning.

Case Study
We use the two example sentences shown in Fig. 1 as a case study, to illustrate the differences between DocIE and the sentence-level neural OpenIE baselines: CopyAttention+BERT and IMOJIE. For Sentence 1, CopyAttention+BERT incorrectly recognizes the word "transfers" as a verb, thus extracting an incorrect tuple (data ; transfers to ; a single target terminal). IMOJIE, however, completely misses the key phrase "data transfers" and extracts an incorrect tuple (a single target terminal ; using ; the invention). Only DocIE manages to extract the correct tuple (data transfers to a single target terminal ; using ; the invention).
For Sentence 2, there is no explicit clue about the relationship between "a device" and "a cellular base". Both CopyAttention+BERT and IMO-JIE treat "a device a cellular base station" as a whole and mistakenly generate a tuple (Node-B ; can be ; a device a cellular base station having beam-forming antennas). In contrast, DocIE successfully splits "a device a cellular base station" by referring to surrounding context and extracts the correct tuple (Node-B ; can be ; a device). However, DocIE fails to infer the inter-relationship between "a device" and "a cellular base". Accordingly, another correct tuple (a device ; is such as ; a cellular base station) is not extracted.
Results of the two example sentences show the improvements made by DocIE after leveraging contextual information for tuple extraction.

Error Analysis
Similar to the error analysis performed in (Kolluru et al., 2020b), we examine tuples extracted by DocIE from 50 randomly selected sentences in DocOIE. We identify the following major error types. (i) Incompleteness: In 28% sentences, Do-cIE fails to cover at least one key phrase in either arguments or relation. Missing key phrases result in incomplete information extraction. (ii) Incorrect Boundary: 27% extractions misinterpret the syntactic meaning of the sentence, leading to incorrect boundary of arguments and relation. (iii) Redundant Extractions: 15% sentences contain redundant extractions; that is, the same relational fact is extracted multiple times from a sentence or phrase. (iv) Grammatical Errors: 13% extractions are not grammatically correct. Most grammatical errors are contributed by the incorrect verb form used in tuple relation.

Implementation
We implement DocIE using the AllenNLP framework 7 in Pytorch 1.4. Pre-trained BERT 8 is finetuned at learning rate 2×10 −5 to get contextualized word embeddings. The learning rate for the other modules is set to 1 × 10 −4 . The input dimension, projection dimension, feedforward hidden dimension, number of layers, and number of attention heads of top transformer encoder are set to 768, 256, 3072, 2, and 8, respectively. The hidden dimension, and word embedding dimension of the LSTM-decoder are set to 256 and 100 respectively.

Conclusion
In this research, we propose to consider documentlevel contextual information for OpenIE task. We contribute DocOIE, the first document-level context-aware OpenIE dataset. It consists of 800 expert-annotated sentences from 80 documents. The documents are carefully selected and the annotations are completed by experts with high-level annotation consistency.
With the help of DocOIE, we conduct evaluation of neural OpenIE models and demonstrate that incorporating document-level context is helpful in improving OpenIE performance through DocIE. As a baseline for document-level context-aware Ope-nIE, DocIE achieves promising results compared with all sentence-level OpenIE models. Our future works are in two main directions. One is to research on more effective context-aware OpenIE models, and the other is to investigate the possibility of not relying on pseudo labels.