Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code are available at https://github.com/qijimrc/ROBUST.


Introduction
Open Information Extraction (OpenIE) aims to extract n-ary knowledge tuples {(a 1 , p, a 2 , ..., a n )} consisting of n arguments and one predicate from the natural text in a domain-independent manner, which has been served as the backbone to benefit NLP applications for many years (Liu et al., 2021;Pei et al., 2022;Chen et al., 2021).
Due to its structural flexibility, the evaluation of OpenIE is a nontrivial problem, which in turn drives the advancement of the task.Early studies (Stanovsky and Dagan, 2016;Zhan and Zhao, 2020)  < l a t e x i t s h a 1 _ b a s e 6 4 = " s Q v L o Q e B g P M 6 9 W u r b B I M 5 S 7 9 E l 0 = " > A M 3 H u 9 J P A z 6 T i v B W t u f m F x q b h c W l l d W 9 8 o b 2 6 1 s j h P u W j y O I j T j s c y E f i R a E p f B q K T p I K F X i D a 3 s 2 Z i r d v R Z r 5 c X Q p x 4 n o h W w U + U O f M 0 n U B e u 7 / X L F q T p 6 2 b P A N a A C s e l N S 2 t g j T U x 5 K W F 1 m q 3 j u X Z W 7 G / e E + 2 p 7 j a m v 2 e 8 Q m I l r o n 9 S z f N / K 9 O 1 S I x x I m u w a e a E s 2 o 6 r h x y X V X 1 M 3 t L 1 V J c k i I U 3 h A 8 Z Q w 1 8 p p n 2 2 t y X T t q r d M x 9 9 0 p m L V n p v c H O / q l j R g 9 + c 4 Z 0 H r o O o e V d 3 z w 0 r t 1 I y 6 i B 3 s Y p / m e Y w a 6 m i g S d 4 j P O I J z 1 b d i q z c u v t M t Q p G s 4 1 v y 3 r 4 A N k m k B A = < / l a t e x i t > a1 < l a t e x i t s h a 1 _ b a s e 6 4 = " x G s T 0 n 6 e F 2 Q u 0 u G 9 q i l I O 0 e Z E Y U = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V Z I i 6 r L o p s u K 9 g G 1 l M l 0 W k P T J C Q T p R T B H 3 C r n y b + g f 6 F d 8 Y p q E V 0 Q p I z 5 9 5 z Z u 6 9 X h z 4 q X S c 1 5 y 1 s L i 0 v J J f L a y t b 2 x u F b d 3 m m m U J V w 0 e B R E S d t j q Q j 8 U D S k L w P R j h P B x l 4 g W t 7 o X M V b t y J J / S i 8 o 0 D a j L I E Z T B i R / Q d 0 q 5 j 2 J D 2 y j P V a k 6 n B P Q m p L R x Q J q I 8 h L C 6 j R b x z P t r N j f v K f a U 9 1 t Q n / P e I 2 J l b g h 9 i / d L P O / O l W L x A C n u g a f a o o 1 o 6 r j x i X T X V E 3 t 7 9 U J c k h J k 7 h P s U T w l w r Z 3 2 2 t S b V t a v e M h 1 / 0 5 m K V X t u c j O 8 q 1 v S g N 2 f 4 5 w H z U r Z P S 6 7 F 0 e l 6 p k Z d R 5 7 2 M c h z f M E V d R Q R 4 O 8 h 3 j E E 5 6 t m h V a m X X 3 m W r l j G Y X 3 5 b 1 8 A H b h p A R < / l a t e x i t > a2 < l a t e x i t s h a 1 _ b a s e 6 4 = " V h k 5 3 o s A w j Q 7 M s 3 / i H U 2 w X R 9 s K I f M E F V R R Q 5 2 8 B 3 j E E 5 6 t q h V a q X X 3 m W r l M s 0 2 v i 3 r 4 Q P d 5 p A S < / l a t e x i t > a3 < l a t e x i t s h a 1 _ b a s e 6 4 = " s Q v L o Q e B g P M 6 9 W u r b B I M 5 S 7 9 E l 0 = " > A M 3 H u 9 J P A z 6 T i v B W t u f m F x q b h c W l l d W 9 8 o b 2 6 1 s j h P u W j y O I j T j s c y E f i R a E p f B q K T p I K F X i D a 3 s 2 Z i r d v R Z r 5 c X Q p x 4 n o h W w U + U O f M 0 n U B e u 7 / X L F q T p 6 2 b P A N a A C s < l a t e x i t s h a 1 _ b a s e 6 4 = " E v e X X q k u i l 3 5 4 J B 0 8 based on the lexical matching of syntactic heads between elements.To tackle the overly lenient metric, subsequent approaches (Lechelle et al., 2019;Bhardwaj et al., 2019;Gashteovski et al., 2022) propose to use of exact matching between tokens for delicate evaluation.Among these benchmarks, CaRB (Bhardwaj et al., 2019) adopts the all-pair matching table to compute the tuple match scores between extractions, which has been considered the de facto standard for evaluation.Research including these efforts has been devoted to evaluating the pairwise matching correctness between model extractions and golden facts on a sentence.However, the conventional evaluation benchmarks do not measure the robustness of models in the realistic open-world scenario, where the syntactic and expressive forms may vary under the same knowledge meaning (Qi et al., 2023).As shown in Figure 1, while the three sentences s 1 , s 2 , s 3 contain the same structured knowledge (a 1 , p, a 2 , a 3 ), the state-of-the-art model OpenIE6 successfully extracts facts (in green color) on sentence s 1 , but fails to predict arguments (in red color) on the other sentences due to the syntactic and expressive drifts.In this example, the sentence s 1 comes from CaRB which has a similar syntactic distribution to the training set, and existing benchmarks can only eval-uate models on this limited target attributing it the commendable scores (46.4/33.3),rather than on the other world samples.For accurate and faithful evaluation, we should measure the performance of models on sentences with various syntactic and expressive distributions under the same knowledge meaning (Zhong et al., 2022).Nevertheless, it is not trivial to construct a benchmark that satisfies the aforementioned conditions of encompassing both knowledge invariance and distributional shift.First, manual annotation of parallel texts to maintain the same knowledge meaning with different syntactic and expressive forms may result in either too trivial or artificial.Second, it is difficult to build a metric that measures the robustness as well as be compatible with existing benchmarks (e.g., (Bhardwaj et al., 2019;Gashteovski et al., 2022)) to ensure comparability.
On the other hand, natural language paraphrasing is defined as producing sentences with different surface forms (syntactic and lexical) by conveying the same semantic meaning (Zhou and Bhat, 2021).Going beyond the pairwise correctness comparison, can we evaluate the robustness of models based on reliable paraphrases equipped with syntactic and expressive transformations?
In this paper, we introduce ROBUST, a Robust OpenIE Benchmark with Ubiquitous Syntactic Transformations, aiming to evaluate the robustness of OpenIE models.ROBUST is a large-scale human-annotated benchmark consisting of 1, 272 robustness testing cliques, where each clique contains sentences with different syntactic and expressive variations while conveying the same underlying knowledge meaning, for a total of 4,971 sentences and 16,191 knowledge extractions.To obtain each clique, we first adopt a syntactically controllable paraphraser with diversified syntactic sampling and expressive filtering strategies to generate paraphrases for each sentence in CaRB.We then design a two-stage annotation pipeline to perform sentence correction and knowledge extraction for each individual paraphrase in cliques based on human experts.This data paradigm enables evaluation to go beyond pairwise matching to clique-wise comparisons.Upon the testbed structure, we calculate the robustness scores with respect to the worst performance within a clique and further analyze the performance variances on all cliques.This metric fairly reflects the robustness of models to distributional drifts and is also compatible with existing benchmarks calculated at one sentence magnitude.
To explore the robustness of existing models, we implement typical OpenIE systems published in the past decade.The experimental results show a dramatic degradation in model performance on ROBUST, with an average drop of 18 percentage points in F 1 scores, indicating that the robustness of existing successful models is far from satisfactory.We then further analyze the correlation between the variances of the model performance and the divergences of the syntactic distances on the cliques.The results find that the variance grows as the syntactic distance increases, and models behaved with similar variance on most of the cliques also demonstrate the inner consistency of our benchmark.In addition, we also evaluate the a representative large language model ChatGPT1 for OpenIE.Experimental results demonstrate that ChatGPT achieves a remarkable performance that is compatible with the state-of-the-art model on CaRB (F 1 score of 0.516 under the 10-shot setting), yet it still exhibits the robustness issue on ROBUST (F 1 score of 0.275 under the 10-shot setting).

The ROBUST Benchmark
In this section, we describe the details of the benchmark construction.The benchmark consists of cliques based on syntactically diverse paraphrase generation and human annotation to unsure the knowledge invariance and distributional shift, where both the syntactic transformations sampled from real world and the human experience guarantee the naturalness.We also provide details of annotations and strategies in the Appendix A.1 and A.2.

Data Preparation
Paraphrase Generation.Considering the compatibility with previous benchmarks, we build our benchmark based on CaRB (Bhardwaj et al., 2019), which contains 1,272 sentences2 of general domain originated from OIE2016 (Stanovsky and Dagan, 2016) with high-quality n-tuples annotations.To build sufficient paraphrases, we adopt AESOP (Sun et al., 2021), a syntactically controllable paraphrasing model generating paraphrases by specifying pruned target syntactic trees that can be sampled diversely.The model used in our work is trained on a parallel annotated data with two-level target  N 2 f 4 5 w H z U r Z P S 6 7 F 0 e l 6 p k Z d R 5 7 2 3 N e 2 I 8 9 d 3 G 9  syntactic trees.During generation, we first collect a set of constituency parse pairs {(T P s i , T P t i )} pruned at height 3 from the ParaNMT-50M (Wieting and Gimpel, 2018).And then for each sentence s with its constituency parse tree T , we obtain 2 most similar parses {(T ′P s i , T ′P s 2 )} by calculating weighted ROUGE scores between parse strings and select 5 top-ranked parses from {T P t i } for each T ′P s i by a sampling with the distribution of ) .We thus generate 10 syntactically varying paraphrases for each sentence.
Diversified Expressive Filtering.Though different syntactic trees are specified in the paraphrase generation, we find that there are still similar expressions in the generated sentences.Therefore, we further filter the paraphrases with a heuristic search strategy to maintain the most diverse ones.For each clique composed of multiple sentence nodes, including an original sentence and multiple paraphrases, we first calculate the BLEU scores (Papineni et al., 2002) between all pairs of nodes.We then repeat the following simple strategy on paraphrase nodes until reaching the maximum acceptable number to eliminate homogeneity: (1) find the pair of nodes with the largest score in the current clique; (2) remove a node if its length is less than 2/3 of the original sentence, otherwise remove the node with the highest sum of scores with all other nodes.As depicted in Figure 1, the remaining sentences s 2 and s 3 exhibit distinct syntactic structures and expressive forms compared to the original sentence s 1 .The detailed process with an example is shown in Appendix A.2.2.

Annotation
For each paraphrase within a clique, we further design a two-stage annotation pipeline based on human experts to perform sentence correction and structured knowledge extraction.All annotators undergo training with tutorials to pass a final examination, and our batch-wise sampling validation ensure an annotation accuracy of over 90%.Detailed annotation including annotators, platforms, and quality checking can be found in Appendix A.1.Paraphrase Annotation.While automatically generated paraphrases present syntactic and expressive variants, the correctness of the sentences cannot be fully guaranteed.To ensure the quality of the sentences, we perform a thorough paraphrase annotation with three types of corrections: • Grammar Correcting: Correct grammatical mistakes in sentences to ensure the fluency.
• Phrase Replacing: Replace the incorrect phrases in sentences to ensure the correctness.
• Sentence Rewriting: Rewrite the entire sentence if it has a semantic difference from the original sentence.
All operations are required to preserve both the distinctiveness of the annotation from the original sentence and their semantic equivalence.Based on this paradigm, all paraphrases are guaranteed to differ from the original sentence in expression, while retaining the same semantic meaning.As shown in Figure 2, the three sentences in the 1st column exhibit different syntactic and expressive forms.A detailed process is available in Appendix A.1.1.Knowledge Annotation.In the second stage, we leverage human experts to annotate N-ary knowledge tuples on the paraphrases finished in the first stage.We design a guideline involving an iterative process to instruct annotators in extracting all possible facts from a sentence.By referring to the annotation of CaRB, in each iteration, we also divide the task of annotating into three steps: (1) recognizing the predicate, (2) finding the arguments for that predicate, and (3) optionally obtaining the time and location arguments for the tuple if possible.
In particular, we distribute the complete clique to individual annotators to obtain extractions with the same structured knowledge meaning.This annotation process ensures the characteristics in CaRB (i.e.Completeness, Assertedness, Informativeness, and Atomicity) while maintaining consistency with the underlying knowledge.As illustrated in the fourth column of Figure 2, the extractions from different sentences correspond to the same underlying knowledge.Detailed annotation process is available in Appendix A.1.2.

Data Analysis
To understand the general characteristics of RO-BUST, we provide quantitative statistics at different granularities in comparison to previous benchmarks.In contrast to the traditional analysis on words and sentences, we further investigate the syntactic phenomena on cliques to explain the robustness evaluation.

Data Statistics
Table 1 shows the quantitative statistics of RO-BUST and representative OpenIE benchmarks, including OIE2016 (Stanovsky and Dagan, 2016), Re-OIE2016 (Zhan and Zhao, 2020), CaRB (Bhardwaj et al., 2019) and BenchIE (Gashteovski et al., 2022).In comparison with the conventional dataset, ROBUST provides the largest number of humanannotated high-quality sentences.Meanwhile, based on the annotation paradigm, ROBUST raises a new data structure, the clique, which establishes the interconnection of sentences with underlying knowledge.The average number of sentences per clique is 3.877.
In addition, we find that previous benchmarks completely originate from OIE2016 based on Wiki and newswires, potentially leading to distribution bias to similar training corpus, especially for pretrained language models (e.g.BERT (Devlin et al., 2019)) trained on the general corpora.ROBUST mitigates this bias by extending syntactic and expressive distributions to realistic scenarios.We further compute the vocabulary sizes for CaRB and ROBUST, resulting in 7648 and 7981, respectively, demonstrating that our natural annotations do not introduce many rare words.

Syntactic Analysis
The proposed benchmark measures the robustness of models on the drifts of linguistic observations.Therefore, the syntactic divergence in the clique is the key to ensuring robustness evaluation.We provide a thorough syntactic analysis of cliques to investigate the divergence.Metrics of Syntactic Correlation.In order to analyze the syntactic divergence in the cliques, we need a metric to measure the syntactic correlation between two sentences.A fast and effective algorithm is the HWS distance proposed in (Qi et al., 2023), which calculates the syntactic tree distance between two sentences based on a hierarchically weighted matching strategy, where smaller weights imply a greater focus on the comparison of skeletons.The value domain of this is [0, 1], where 1 indicates the farthest distance.However, we find that their method may lead to the overcounting problem for repeated consecutive spans 3 .We revise the original algorithm to solve the problem while maintaining efficiency.The details of the revised algorithm are shown in Appendix A.2.1 for ease of use.
We additionally implement the algorithm of Convolutional Tree Kernel (CTK) similarity proposed in (Collins and Duffy, 2001) to fairly illustrate the syntactic phenomenon.In contrast to distance, it measures the similarity between a pair of tree structures by counting the number of tree fragments in common.The value domain of this algorithm is also [0, 1], where 1 means the maximum similarity.Intra-clique Syntactically Analysis.To exhaustively investigate the syntactic divergence on the cliques, we calculate the average syntactic distance/similarity in each individual clique based on the algorithms described above.The result is shown in Figure 3, where the horizontal axis and vertical axis are the output and the discounting weights of two algorithms, respectively.
Overall, we observe that the values of syntactic distance and syntactic similarity are mainly scattered between [0.6, 0.9] and [0.0, 0.7], respectively, indicating that most of the cliques exhibit significant syntactic discrepancies.Another notable observation is that the distribution of the HWS scatter representing the distance is closer to 1 as the discount weight decreases, suggesting that the differences in syntactic skeletons are more significant in ROBUST.
Inter-cliques Syntactically Analysis.Going be-3 For two strings s1s3s4 and s1s2s1 with consecutive span s1 in common (e.g, SVPNP and SVPNPVP), the resulting distance may increase with the repetition of span s1.
yond the individual clique, we further explore the syntactic divergence over all cliques.As shown in Figure 4, we average the mean of clique-wise syntactic distance/similarity on all cliques, based on the linearly increased discounting weights.We find that the average similarity of syntactic trees on ROBUST decreases rapidly as the discounted weight of the algorithm increases.Considering that increasing the weights implies a reduced focus on the low-level tree fragments, this result suggests that ROBUST involves prominent variability in the high-level skeleton of syntactic trees.

Experiments
In this section, we explore the robustness of existing successful OpenIE systems and further analyze the impact of different model architectures on robustness.We first introduce the proposed ROBUST metric, which calculates the robustness performance on a clique, and then extensively evaluate six typical models from three major categories and a large language model ChatGPT.Furthermore, based on the clique structure, we analyze the correlation between the variances of the model performance and the syntactic divergences in cliques.

Evaluation Metrics
The existing widely used CaRB scorer computes pairwise matching scores based on extractions on a sentence.Though accurate, it has rather limitations.We extend this scorer on cliques to calculate the robustness scores.The CaRB Metric.To evaluate the correctness of system tuples, CaRB first creates an all-pair matching table, with each column as a system tuple and each row as a gold tuple, and computes precision and recall scores in each cell.Then, it calculates the overall recall R by averaging the maximum values of all rows and the overall precision P by averaging the one-to-one precisions between system tuples and gold tuples in the order of the best match score to the worst.Finally, the overall F 1 is computed with R and P .The ROBUST Metric.An OpenIE system is considered robust if it behaves consistently on sentences with the same underlying knowledge meaning but differing syntactic and expressive variations, indicating the preservation of knowledge invariance.Therefore, we naturally calculate the robustness scores of a model on each clique.
Given  {s 1 , ..., s k } in ROBUST, we first calculate the P/R/F 1 scores of the model on each sentence, and then select the scores from the sentence with the worst F 1 as the ultimate robustness scores As mentioned above, we can compute the pair-wise P/R/F 1 scores based on the CaRB scorer.
It is noteworthy that the ROBUST evaluation metric is compatible with existing benchmarks because we calculate on the order of magnitude of one sentence, and we can directly compare our robustness scores with CaRB and others.

Evaluation Models
To exhaustively evaluate the robustness of existing paradigms, we select six typical OpenIE approaches from 3 categories.(1) Rule-based models, which adopt linguistic patterns to identify knowledge facts, including OpenIE4 (Christensen et al., 2011), ClauseIE (Del Corro and Gemulla, 2013), and OpenIE5 (Saha et al., 2017(Saha et al., , 2018)).( 2) Independent NN-based models, that train neural networks from scratch with designed architecture, including RnnOIE (Stanovsky et al., 2018) and SpanOIE (Zhan and Zhao, 2020).(3) PLMbased models, that rely on a pre-trained language model usually trained on a large-scale text corpus, including OpenIE6 (Kolluru et al., 2020a) which introduces a novel iterative grid labeling architecture, which treats OpenIE as a 2-D grid labeling task to produce extractions gradually based on BERT.
We also evaluate the OpenIE performance of ChatGPT.We use the python API interface of gpt-3.5-turboversion4 for all experiments.We perform few-shot experiments with manually constructed prompts and sampled demonstrations for CaRB and ROBUST benchmarks.The prompt template is available in Appendix A.3.

Results on Typical OIE Models
We run the source code of all baselines on both CaRB and ROBUST and compute the average scores across all samples.All results are shown in Table 2.Note that although the ROBUST scores are calculated in a different environment than CaRB, it still offers a fair comparison due to the calculation manner.Based on the result, we can see that current successive OpenIE systems experience a considerable performance decline on ROBUST across the board.Compared with CaRB, the average degradation for precision, recall, and the F 1 score is 20%, 15%, and 18%, respectively.This observation suggests that research on the robustness of existing OpenIE models still needs to be completed, as overly idealized evaluations encourage models to match fixed expressions strictly.
With the concrete comparison of model architectures, we find that the SpanOIE model demonstrates a relatively small decrease in all three scores compared to other models, indicating its robustness to syntactic transformations.This result suggests that the extraction strategy of enumerating geometric spans is, to some extent, independent of syntactic drift, making it less susceptible to sentence form transformations.

Results on ChatGPT
We evaluate ChatGPT's OpenIE capability on CaRB and ROBUST.We randomly select 1/3/5/10 demonstrations from CaRB, and prompt ChatGPT to extract knowledge tuples by incorporating these demonstrations.We exclude sentences that belong to the same clique as the demonstrations during extraction.The result shows that ChatGPT exhibits impressive capability on CaRB, attaining a 51.6 F 1 score in 10-shot setting, comparable to the supervised state-of-the-art model OpenIE6.However, it still faces the robustness problem, as evidenced by a decline in the F robust 1 score to 27.5 on ROBUST in the same setting.
We also investigate the impact of ChatGPT's performance on the diversity of demonstrations.We first randomly select 100 pairs of cliques {(C i , C j )|(C i = (s 1 i , s 2 i , ...)} 100 from ROBUST.For each sentence in clique C i , we prompt Chat-GPT by specifying 1/2/3/4 demonstrations from clique C j .We then calculate the CaRB F 1 score for each sentence (shown in blue), the average CaRB F 1 score for all sentence (s 1 i , s 2 i , ...) (shown in orange), and the ROBUST F robust score on all sen-tence in clique C j (shown in green).The results in Figure 6b show that the correctness and robustness of ChatGPT can be improved by giving more diversified demonstrations.

Detailed Analysis
In this section, we investigate the coherence among cliques in ROBUST, as well as the variations in model performance across different cliques.
Is the evaluation of model performance consistent across cliques?It is necessary to investigate whether our evaluation of the model is consistent across the majority of cliques in order to explore the internal consistency of our data samples.Based on the main results, we calculate the F 1 score variance in each clique for three representative models, Rn-nOIE, SpanOIE, and OpenIE6.The distribution of the number of cliques based on variance is depicted in Figure 5a.We find that the majority of cliques exhibit relatively slight variances, indicating a high degree of consistency among robustness cliques.In addition, we sample 11 subsets of interval 100 in ROBUST and calculate the Person's Correlation Coefficient between the average F robust of Ope-nIE6 on each subset and the number of cliques of each subset.This result is −0.1480, indicating a weak correlation between these two factors.How does the syntactic divergence affect the performance of models?Benefiting from the data structure of ROBUST, we can further investigate the effect of syntactic divergence on the performance of models.Concretely, for each clique, we calculate the average HWS/CTK values between all pairs of sentences and the variance of F 1 across all sentences.The result is shown in Figure 5

Related Work
OpenIE Approaches.The OpenIE task was first proposed by (Banko et al., 2007) and is a fundamental NLP task.Earlier models focused on statistical or rule-based methods to handle this task (Christensen et al., 2011;Schmitz et al., 2012;Del Corro and Gemulla, 2013;Angeli et al., 2015;Pal et al., 2016;Saha et al., 2017Saha et al., , 2018)).Recently, with the rapid development of deep representation learning, many supervised neural models have been proposed for OpenIE.These approaches could be roughly classified into two lines: 1.Sequence Labeling-based models.RnnOIE (Stanovsky et al., 2018) applies a BiLSTM transducer, extending deep Semantic Role Labeling models to extract tuples.SenseOIE (Roy et al., 2019) leverages an ensemble of multiple unsupervised OpenIE systems' outputs and the lexical and syntactic information to improve performance.SpanRel (Jiang et al., 2020) represents the OpenIE task in a single format consisting of spans and relations between spans.SpanOIE (Zhan and Zhao, 2020) predicts the candidate relation spans and classifies all possible spans of the sentence as subject or object for each span.Multi 2 OIE (Ro et al., 2020) first predicts all relational arguments by BERT and then predicts the subject and object arguments associated with each relation using multi-headed attention.Ope-nIE6 (Kolluru et al., 2020a) provides an iterative grid labeling architecture, which treats OpenIE as a 2-D grid labeling task.2.Sequence Generative models.Neural Open IE (Cui et al., 2018) and Logician (Sun et al., 2018) generate OpenIE extractions by a seq2seq paradigm.IMoJIE (Kolluru et al., 2020b) leverages a BERT-based encoder and generates the next extraction which is fully conditioned on the extractions produced so far.OpenIE Benchmarks.
Several benchmark datasets have been proposed to evaluate existing OpenIE approaches.OIE2016 (Stanovsky and Dagan, 2016) developed a method to create a large-scale OpenIE dataset using QA-SRL annotations (He et al., 2015) which was found to be noisy with missing extractions.After that, CaRB (Bhardwaj et al., 2019) and Re-OIE2016 (Zhan and Zhao, 2020) re-annotated the corpus to improve the dataset's quality for more accurate evaluation.Wire57 (Lechelle et al., 2019) provided high-quality expert annotations, but the size is too small to serve as a comprehensive test dataset with only 57 sentences.DocOIE (Dong et al., 2021) argued that in reality a sentence usually exists as part of a document rather than standalone; the contextual information can help models understand it better and annotate a document-level Ope-nIE dataset.LSOIE (Solawetz and Larson, 2021) was built by converting the QA-SRL BANK 2.0 dataset (FitzGerald et al., 2018) to OpenIE which had a significant improvement over previous work in terms of data quantity.BenchIE (Gashteovski et al., 2022) created a fact-based benchmark and framework for multi-faceted comprehensive evaluation of OpenIE models in the multi-lingual setting.
Despite the widespread interest in these benchmarks and the related OpenIE approaches provides promising results.However, the traditional peer-topeer matching-based evaluation can not measure the robustness of those approaches, where the syntax and expression may be various with underlying meaning (Qi et al., 2023).This work significantly fills the gap between traditional metrics and missed robustness evaluation for OpenIE and calls for more efforts in this research area.

Conclusion and Future Work
In this work, we propose ROBUST, a large-scale human-annotated OpenIE benchmark consisting of 1272 robustness testing cliques, where each clique contains sentences with different syntactic and expressive variations while conveying the same underlying knowledge meaning.We introduce our methodology for constructing the benchmark, including a syntactically and expressively diverse paraphrase generation, and a two-stage manual annotation.A comprehensive analysis is then performed to demonstrate the consistency of the proposed data with the real world.We finally perform extensive experiments on existing successive models as well as a representative large language model, and the results show that the robustness of existing methods is far from satisfied.The further detailed analysis demonstrates the substantial internal coherence of our benchmark, providing inspiration for the future development of robustness benchmarks.

Limitations
We have presented a dataset with metrics to evaluate the robustness of OpenIE models in this paper.However, there are still several limitations that need to be improved in further study.First, there are a few studies exploring the pre-trained language models to perform zero-shot information extraction with advantages.To the lack of open source code, we have not explored the robustness performance of these zero-shot models.Second, we think the robustness problem generally exists in the NLP community, we remain the extensive study of robustness examination for more domains and models in future works.

Ethic Consideration
There are two major considerations for conducting the evaluation of our proposed new benchmark.First, the source sentences are selected as same as CaRB, the original dev and test splits of OIE2016 in the open domain source of Wall Street Journal text and Wikipedia.All these data files are leveraged for the research purpose, and the result will be publicly available.Second, the annotators in this research are paid a salary higher than the market average and further allowed to choose flexible working time for human rights.For data utilization, we will make all annotation results publicly available under the CC BY-SA 4.0 license (free for research use).

A.1 Annotation Details
We have the following detailed annotation information.Who: For Task1 and Task2, we employed two separate annotation teams consisting of 6 and 9 students respectively, who are all majoring in CS at universities.We ensured their professionalism through the tutorials and a final examination.Where: As both tasks are easy to read and write for annotators, we distributed the data directly without using a special annotation platform.Quality: We adopted a batched iterative annotation and evaluation process to ensure that the sampling accuracy is above 90%.License: We will release all annotation results under the CC BY-SA 4.0 license (free for research use).

A.1.1 Paraphrase Annotation Process
The goal of paraphrase annotation is to correct the automatically generated sentences from the models based on human intelligence.Overall, we adopt an iterative step of combining human annotation paired with expert evaluation to ensure accuracy and efficiency.In each iteration, at least three human workers who are fluent in English reading and writing annotate a batch of samples, and then two domain experts will check the annotation results on a random sample of 40% of the batch.The batch annotations will be accepted until the validation accuracy is greater than 90%.For the annotation of each paraphrase, the annotators are asked to correct the sentence with syntactic, phrasal, or semanticdifferent mistakes against the original sentence.

A.1.2 N-tuples Annotation Process
We leverage the same iterative annotation strategy with the paraphrase annotation for OpenIE N-tuples annotation.In particular, we design an annotation flowchart for the workers according to the similar process in CaRB, by dividing the task into 4 steps: (1) identifying the relation, (2) identifying the arguments for that relation, and (3) optionally identifying the location and time attributes for the tuple.The same validation meaner with the paraphrase annotation is adopted to reach each acceptable annotation batch.
Algorithm 1 HWS Distance Input: Constituency parses T 1 , T 2 of sentences s 1 , s 2 , pruning height h, discount factor α Output: Syntactic distance d between s 1 , s 2 1: Get trees T h 1 , T h 2 pruned at height h, and their level-order traversal sequences q 1 , q 2 2: Initialize total length and count l = 0; m = 0 j], j = 1, ..., q 2.len 7: for i = 2 → q 1.len do 8: for j = 2 → q 2.len do end for 23: end for The revised Hierarchically Weighted Syntactic Distance Algorithm (HWS distance) is shown in algorithm 1.We fix the over-counting problem for repeated consecutive spans while preserving the efficiency with the same time complexity in the original work (Qi et al., 2023).

A.2.2 Diversified Filtering Process
We perform diversified filtering based on BLEU scores between all pairs of sentences in each set of generated paraphrases to maintain the most diverse paraphrases.For example, given the generated paraphrases following: ori In 1840, he was appointed to command his regiment, a post he held for nearly fourteen years.p1 1840, the regiment's commander, which he held for nearly 14 years.p2 In 1840 he took command of the regiment and held it for nearly 14 years.p3 When he was 14 years old , he became a member of the regiment .p4 1840, the command of the regiment, which he held for nearly 14 years.p5 The regiment, then, in 1840, the rank of captain, which he held for nearly 14 years.
As shown in Figure 7, we first calculate the BLEU scores between all pairs of paraphrases (shown on the edges).We then find the two sentences p 1 , p 4 with the maximum BLEU score.Because the lengths of these two sentences are larger than 2/3 of the original sentence, we then calculate the summation of scores from each of them to all other sentences which results sum(p 1 , p /1 ) = 136.9and sum(p 1 , p /1 ) = 158.7,and remove the sentence p 4 that has larger summation score.We repeat the strategy above to remove the sentence p 1 and obtain 3 expressively diverse paraphrases.
< l a t e x i t s h a 1 _ b a s e 6 4 = " N m M e t c 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " S j i q 0 c X + X + 5 o 6 p j x i X T X V E 3 t 7 9 U J c k h I U 7 h H s U F Y a a V 0 z 7 b W p P q 2 l V v P R 1 / 0 5 m K V X t m c j O 8 q 1 v S g N 2 f 4 5 w F j Y O y e 1 x 2 L 4 5 K l T M z 6 j x 2 s I t 9 m u c J K q i i h j p 5 D / C I J z x b V S u y M u v u M 9 X K G c 0 2 v i 3 r 4 Q M B s 5 A h < / l a t e x i t > Z 9 t r U l 1 7 a q 3 n o 6 / 6 U z F q n 1 g c j O 8 q 1 v S g N 2 f 4 5 w H z a O y e 1 x 2 L y q l 6 p k Z d R 5 7 2 M c h z f M E V d R Q R 4 O 8 h 3 j E E 5 6 t m h V Z m X X 3 m W r l j G Y X 3 5 b 1 8 A E E E 5 A i < / l a t e x i t > p 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 Q o l S V 3 a D 8 V C P t H t B n q 3 Q q B S V B I = " > A x O C e k V p L S x R 5 q Y 8 g R h d Z q t 4 5 l 2 V u x v 3 h P t q e 4 2 p r 9 v v E b E S t w Q + 5 d u m v l f n a p F o o 9 T X U N A N S W a U d U x 4 5 L p r q i b 2 1 + q k u S Q E K d w j + K C M N P K a Z 9 t r U l 1 7 a q 3 n o 6 / 6 U z F q j 0 z u R n e 1 S 1 p w O 7 P c c 6 C x k H Z P S 6 7 F 4 e l y p k Z d R 4 7 2 M U + z f M E F V R R Q 5 2 8 B 3 j E E 5 6 t q h V Z m X X 3 m W r l j G Y b 3 5 b 1 8 A E G c 5 A j < / l a t e x i t >  o 6 p j x i X T X V E 3 t 7 9 U J c k h I U 7 h H s U F Y a a V 0 z 7 b W p P q 2 l V v P R 1 / 0 5 m K V X t m c j O 8 q 1 v S g N 2 f 4 5 w F j Y O y e 1 x 2 L 4 5 K l T M z 6 j x 2 s I t 9 m u c J K q i i h j p 5 D / C I J z x b V S u y M u v u M 9 X K G c 0 2 v i 3 r 4 Q M B s 5 A h < / l a t e x i t > p 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 R 5 q Y 8 g R h d Z q t 4 5 l 2 V u x v 3 h P t q e 4 2 p r 9 v v E b E S t w Q + 5 d u m v l f n a p F o o 9 T X U N A N S W a U d U x 4 5 L p r q i b 2 1 + q k u S Q E K d w j + K C M N P K a Z 9 t r U l 1 7 a q 3 n o 6 / 6 U z F q j 0 z u R n e 1 S 1 p w O 7 P c c 6 C x k H Z P S 6 7 F 4 e l y p k Z d R 4 7 2 M U + z f M E F V R R Q 5 2 8 B 3 j E E 5 6 t q h V Z m X X 3 m W r l j G Y b 3 5 b 1 8 A E G c 5 A j < / l a t e x i t > p 5 2 1 .7 8 .2 4 3 .7 1 2 .917.2 6. 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " S j i q 0 c X + X + 5 o 6 p j x i X T X V E 3 t 7 9 U J c k h I U 7 h H s U F Y a a V 0 z 7 b W p P q 2 l V v P R 1 / 0 5 m K V X t m c j O 8 q 1 v S g N 2 f 4 5 w F j Y O y e 1 x 2 L 4 5 K l T M z 6 j x 2 s I t 9 m u c J K q i i h j p 5 D / C I J z x b V S u y M u v u M 9 X K G c 0 2 v i 3 r 4 Q M B s 5 A h < / l a t e x i t > p 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 R 5 q Y 8 g R h d Z q t 4 5 l 2 V u x v 3 h P t q e 4 2 p r 9 v v E b E S t w Q + 5 d u m v l f n a p F o o 9 T X U N A N S W a U d U x 4 5 L p r q i b 2 1 + q k u S Q E K d w j + K C M N P K a Z 9 t r U l 1 7 a q 3 n o 6 / 6 U z F q j 0 z u R n e 1 S 1 p w O 7 P c c 6 C x k H Z P S 6 7 F 4 e l y p k Z d R 4 7 2 M U + z f M E F V R R Q 5 2 8 B 3 j E E 5 6 t q h V Z m X X 3 m W r l j G Y b 3 5 b 1 8 A E G c 5 A j < / l a t e x i t > p 5 1 2 .917.2 6. 1 remove < l a t e x i t s h a 1 _ b a s e 6 4 = " Q S p l f n 8 + h y e 9 X 5 U t p O p Q 6 B e l 7 i c = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I p 6 r L o p s u K 9 g G 1 l C S d 1 q F p E i Y T p R T B H 3 C r n y b + g f 6 F d 8 Y p q E V 0 Q p I z 5 9 5 z Z u 6 9 f h L y V D r O a 8 5 a W F x a X s m v F t b W N z a 3 i t s 7 z T T O R M A a Q R z G o u 1 7 K Q t 5 x B q S y 5 C 1 E 8 G 8 s R + y l j 8 6 V / H W L R M p j 6 M r O U l Y d + w N I z 7 g g S e J u k x 6 l V 6 x 5 J Q d v e x 5 4 B p Q g l n 1 u P i C a / Q R I 0 C G M R g i S M I h P K T 0 d O D C Q U J c F 1 P i B C G u 4 w z 3 K J A 2 o y x G G R 6 x I / o O a d c x b E R 7 5 Z l q d U C n h P Q K U t o 4 I E 1 M e Y K w O s 3 W 8 U w 7 K / Y 3 7 6 n 2 V H e b 0 N 8 3 X m N i J W 6 I / U s 3 y / y v T t U i M c C p r o F T T Y l m V H W B c c l 0 V 9 T N 7 S 9 V S X J I i F O 4 T 3 F B O N D K W Z 9 t r U l 1 7 a q 3 n o 6 / 6 U z F q n 1 g c j O 8 q 1 v S g N 2 f 4 5 w H z a O y e 1 x 2 L y q l 6 p k Z d R 5 7 2 M c h z f M E V d R Q R 4 O 8 h 3 j E E 5 6 t m h V Z m X X 3 m W r l j G Y X 3 5 b 1 8 A E E E 5 A i < / l a t e x i t > p 4 remove < l a t e x i t s h a 1 _ b a s e 6 4 = " N m M e t c 6 A f P E Z Z w z H C Q t g t e M 8 b 1 c = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d V l 0 0 2 V F + 4 B a S j K d 1 s G 8 m E y U U g R / w K 1 + m v g H + h f e G V N Q i + i E J G f O v e f M 3 H v 9 J B C p c p z X g j U 3 v 7 C 4 V F w u r a y u r W + U N 7 d a a Z x J x p s s D m L Z 8 b 2 U B y L i T S V U w D u J 5 F 7 o B 7 z t 3 5 z p e P u W y 1 T E 0 a U a J 7 w X e q N I D A X z F F E X S d / t l y t O 1 T H L n g V u D i r I V y M u v + A K A 8 R g y B C C I 4 I i H M B D S k 8 X L h w k x P U w I U 4 S E i b O c Y 8 S a T P K 4 p T h E X t D 3 x H t u j k b 0 V 5 7 p k b N 6 J S A X k l K G 3 u k i S l P E t a n 2 S a e G W f N / u Y 9 M Z 7 6 b m P 6 + 7 l X S K z C N b F / 6 a a Z / 9 X p W h S G O D E 1 C K o p M Y y u j u U u m e m K v r n 9 p S p F D g l x G g 8 o L g k z o 5 z 2 2 T a a 1 N S u e + u Z + J v J 1 K z e s z w 3 w 7 u + J Q 3 Y / T n O W d A 6 q L p H V f f 8 s F I 7 z U d d x A 5 2 s U / z P E Y N d T T Q J O 8 R H v G E Z 6 t u R V Z m 3 X 2 m W o V c s 4 1 v y 3 r 4 A P z k k B 8 = < / l a t e x i t > p 1 Figure 7: By performing the diversified filtering, 3 paraphrases p 2 p 3 p 5 maintained.

A.3.1 Prompt Design
We create a prompt template for the task of OpenIE to query the ChatGPT.An example of a 1-shot prompt is shown in Figure 8, where the highlighted demonstration and the variable <sentence> can be replaced with specified examples.

A.3.2 Performance with Syntactic Correlations
In this section, we further investigate the correlation between the model performance and syntactic distance of demonstrations and questions for the ChatGPT model.We first randomly sample a set of 100 pairs of cliques {(C i 1 , C i 2 )|i = 1, ..., 100} In these tuples, we always put the predicate first, the second is the subject corresponding to the predicate, the third is the object corresponding to the predicate (if there is none, it is not labeled), and the last two are time and place in that order, which can be omitted if there is none.
Please follow the example above and extract all the relational tuples in the following sentence: <sentence> Please show the results in one line strictly in the form of the results above" in ROBUST.Then for each pair, we select all examples in clique C i 1 as demonstrations and select all sentences in C i 2 as questions to calculate the F robust 1 -score.For syntactic correlations, we first calculate the averaged value a i between question i and all sentences in C 1 and further calculate the average on (a 1 , a 2 , ...) as the final correlation on current clique-pairs.We divide the scores into several intervals and compute the average value in each corresponding interval to avoid abnormal values.The results based on both implementations of HWS distance and Tree Kernel similarity as the syntactic correlation are shown in Figure 9.
In the left figure of the result, we can see that the F robust 1 -score of the model gradually increases as the average syntactic similarity of the two cliques increases.The same observation is also shown in the right figure with the averaged syntactic distance between two cliques.These results suggest that ChatGPT is sensitive to the syntactic distribution between questions and demonstrations and that giving demonstrations with similar syntactic distribution enhances the effectiveness of ChatGPT.

A.4 Error Analysis for OIE Systems
We conduct error analysis for three typical OpenIE models OpenIE4, SpanOIE, and OpenIE6 on a robustness clique.The model predictions with the CaRB and ROBUST scores are shown in Table 4.
First, we can see that the sentences in the clique exhibit a significant syntactic and expressive divergence.It implies that the constructed data source satisfies the expectation.Second, we find all sentences in the clique have more than one extraction, while the OpenIE4 and OpenIE6 models predict the extractions insufficiently, which causes a lower recall.On the other hand, the SpanOIE model outputs predictions by enumerating all possible geometric spans, which build sufficient outputs regardless of syntactic features.This architecture offers SpanOIE a consistent performance.
9 g F a J J l O 6 9 A 0 C Z O J U I r + g F v 9 N v E P 9 C + 8 M 0 5 B L a I T k p w 5 9 5 4 z c + 8 N 0 0 h k y v N e C 8 7 c / M L i U n G 5 t L K 6 t r 5 R 3 tx q Z U k u G W + y J E p k J w w y H o m Y N 5 V Q E e + k k g e j M O L t c H i m 4 + 0 7 L j O R x J d q n P L u K B j E o i 9 Y o I h q p D f l i l f 1 z H J n g W 9 B B X b V k / I L r t F D A o Y c I 3 D E U I Q j B M j o u Y I P D y l x X U y I k 4 S E i X P c o 0 T a n L I 4 Z Q T E D u k 7 o N 2 V Z W P a a 8 / M q B m d E t E r S e l i j z Q J 5 U n C + j T X x H P j r N n f v C f G U 9 9 t T P / Q e o 2 I V b g l 9 i / d N P O / O l 2 L Q h 8 n p g Z B N a W G 0 d U x 6 5 K b r u i b u 1 + q U u S Q E q d x j + K S M D P K a Z 9 d o 8 l M 7 b q 3 g Y m / m U z N 6 j 2 z u T n e 9 S 1 p w P 7 P c c 6 C 1 k H V P 6 r 6 j c N K 7 d S O u o g d 7 G K f 5 n m M G i 5 Q R 9 N 4 P + I J z 8 6 5 E z m Z k 3 + m O g W r 2 c a 3 5 T x 8 A F o U j 3 s = < / l a t e x i t > p < l a t e x i t s h a 1 _ b a s e 6 4 = " k 0 n 0 o M O y L k h p J G w P H U H D 1 F Z + B a Q = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I V d V l 0 0 2 V F + 4 B a S j K d 1 q F p E i Y T p R T B H 3 C r n y b + g f 6 F d 8 Y U 1 C I 6 I c m Z c + 8 5 M / d e P w 5 E o h z n N W f N z S 8 s L u W X C y u r a + s b x c 2 t R h K l k v E 6 i 4 J I t n w v 4 Y E I e V 0 J F f B W L L k 3 8 g P e 9 I f n O t 6 8 5 T I R U X i l x j H v j L x B K P q C e Y q o S 6 9 7 2 C 2 W n L J j l j 0 L 3 A y U k K 1 a V H z B N X q I w J B i B I 4 Q i n A A D w k 9 b b h w E B P X w Y Q 4 S U i Y O M c 9 C q R N K Y t T h k f s k L 4 D 2 r U z N q S 9 9 k y M m t E p A b 2 S l D b 2 S B N R n i S s T 7 N N P D X O m v 3 N e 2 I 8 9 d 3 G 9 P c z r x G x C j f E / q W b Z v 5 X p 2 t R 6 O P U 1 C C o p t g w u j q W u a S m K /r m 9 p e q F D n E x G n c o 7 g k z I x y 2 m f b a B J T u + 6 t Z + J v J l O z e s + y 3 B T v + p Y 0 Y P f n O G d B 4 6 D s H p f d i 6 N S 5 S w b d R 4 7 2 M U + z

Figure 2 :
Figure 2: An example of a robustness clique consisting of three sentences from ROBUST, where the sentences exhibit syntactic and expressive variants while preserving the same structured knowledge meaning.In contrast to conventional metrics, ROBUST measures the robustness score on a clique of all nodes.

Figure 3 :
Figure 3: The average syntactic distances/similarity in each clique is calculated using HWS distance and Convolutional Tree Kernels, where the x-axis refers to the hierarchical discounting weights for two algorithms.

Figure 4 :
Figure 4: The average syntactic distance/similarity over all cliques with the hierarchical discounting weights.Cliques containing only one point will be a line with a value of 0 or 1.
Figure 5: (a) The distribution of the number of cliques with the variance of F 1 scores in each clique.(b) The variance of F 1 scores with the values HWS distance.(c) The variance of F 1 scores with the values of Convolutional Tree Kernel similarity.The both correlation values are divided into several intervals to avoid abnormal values.
5 .The results indicate a general trend where the variance of model performance decreases with increasing syntactic divergence.Based on the main experiment results, which indicate low performance of models on the overall benchmark, the observed degradation implies a consistent trend of poorer model performance in more open scenarios.
w m S i m C P + B W P 0 3 8 A / 0 L 7 4 w p q E V 0 Q p I z 5 9 5 z Z u 6 9 f h y I R D n O a 8 5 a W F x a X s m v F t b W N z a 3 i t s 7 z S R KJ e M N F g W R b P t e w g M R 8 o Y S K u D t W H J v 7 A e 8 5 Y / O d b x 1 y 2 U i o v B K T W L e H X v D U A w E 8 x R R l 3 G v 0 i u W n L J j l j 0 P 3 A y U k K 1 6 V H z B N f q I w J B i D I 4 Q i n A A D w k 9 H b h w E B P X x Z Q 4 S U i Y O M c 9 C q R N K Y t T h k f s i L 5 D 2 n U y N q S 9 9 k y M m t E p A b 2 S l D Y O S B N R n i S s T 7 N N P D X O m v 3 N e 2 o 8 9 d 0 m 9 P c z r z G x C j f E / q W b Z f 5 X p 2 t R G O D U 1 C C o p t g w u j q W u a S m K / r m 9 p e q F D n E x G n c p 7 g k z I x y 1 m f b a B J T u + 6 t Z + J v J l O z e s + y 3 B T v + p Y 0 Y P f n O O d B s 1 J 2 j 8 v u x V G p e p a N O o 8 9 7 O O Q 5 n m C K m q o o 0 H e Q z z i C c 9 W z Q q t 1 L r 7 T L V y m W Y X 3 5 b 1 8 A H / R J A g < / l a t e x i t >p 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " p 1 p o g 1 / k o E e h D v 3 + 5 l J s d O p d I i 8 = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I V d V l 0 0 2 V F + 4 B a S j K d 1 t A 0 C Z O J U o r g D 7 j V T x P / Q P / C O + M U 1 C I 6 I c m Z c + 8 5 M / d e P w m D V D r O a 8 6 a m 1 9 Y X M o v F 1 Z W 1 9 Y 3 i p t b j T T O B O N 1 F o e x a P l e y s M g 4 n U Z y J C 3 E s G 9 k R / y p j 8 8 V / H m L R d p E E d X c p z w z s g b R E E / Y J 4 k 6 j L p H n a L J a f s 6 G X P A t e A E s y q x c U X X K O H G A w Z R u C I I A m H 8 J D S 0 4 Y L B w l x H U y I E 4 Q C H e e 4 R 4 G 0 G W V x y v C I H d J 3 Q L u 2 Y S P a K 8 9 U q x m d E t I r S G l j j z Q x 5 Q n C 6 j R b x z P t r N j f v C f a U 9 1 t T H / f e I 2 I l b g h 9 i / d N P O / O l W L R B + n u o a A a k o 0 t e x i t s h a 1 _ b a s e 6 4 = " Q S p l f n 8 + h y e 9 X 5 U t p O p Q 6 B e l 7 i c = " > AA A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I p 6 r L o p s u K 9 g G 1 l C S d 1 q F p E i Y T p R T B H 3 C r n y b + g f 6 F d 8 Y p q E V 0 Q p I z 5 9 5 z Z u 6 9 f h L y V D r O a 8 5 a W F x a X s m v F t b W N z a 3 i t s 7 z T T O R M A a Q R z G o u 1 7 K Q t 5 x B q S y 5 C 1 E 8 G 8 s R + y l j 8 6 V / H W L R M p j 6 M r O U l Y d + w N I z 7 g g S e J u k x 6 l V 6 x 5 J Q d v e x 5 4 B p Q g l n 1 u P i C a / Q R I 0 C G M R g i S M I h P K T 0 d O D C Q U J c F 1 P i B C G u 4 w z 3 K J A 2 o y x G G R 6 x I / o O a d c x b E R 7 5 Z l q d U C n h P Q K U t o 4 I E 1 M e Y Kw O s 3 W 8 U w 7 K / Y 3 7 6 n 2 V H e b 0 N 8 3 X m N i J W 6 I / U s 3 y / y v T t U i M c C p r o F T T Y l m V H W B c c l 0 V 9 T N 7 S 9 V S X J I i F O 4 T 3 F B O N D K W 8 s L u W X C y u r a + s b x c 2 t R h p n g v E 6 i 8 N Y t H w v 5 W E Q 8 b o M Z M h b i e D e y A 9 5 0 x + e q 3 j z l o s 0 i K M r O U 5 4 Z + Q N o q A f M E 8 S d Z l 0 j 7 r F k l N 2 9 L J n g W t A C W b V 4 u I L r t F D D I Y M I 3 B E k I R D e E j p a c O F g 4 S 4 D i b E C U K B j n P c o 0 D a j L I 4 Z X j E D u k 7 o F 3 b s B H t l W e q 1 Y t e x i t s h a 1 _ b a s e 6 4 = " N m M e t c 6 A f P E Z Z w z H C Q t g t e M 8 b 1 3 H v 9 J B C p c p z X g j U 3 v 7 C 4 V F w u r a y u r W + U N 7 d a a Z x J x p s s D m L Z 8 b 2 U B y L i T S V U w D u J 5 F 7 o B 7 z t 3 5 z p e P u W y 1 T E 0 a U a J 7 w X e q N I D A X z F F E X S d / t l y t O 1 T H L n g V u D i r IV y M u v + A K A 8 R g y B C C I 4 I i H M B D S k 8 X L h w k x P U w I U 4 S E i b O c Y 8 S a T P K 4 p T h E X t D 3 x H t u j k b 0 V 5 7 p k b N 6 J S A X k l K G 3 u k i S l P E t a n 2 S a e G W f N / u Y 9 M Z 7 6 b m P 6 + 7 l X S K z C N b F / 6 a a Z / 9 X p W h S G O D E 1 C K o p M Yy u j u U u m e m K v r n 9 p S p F D g l x G g 8 o L g k z o 5 z 2 2 T a a 1 N S u e + u Z + J v J 1 K z e s z w 3 w 7 u+ J Q 3 Y / T n O W d A 6 q L p H V f f 8 s F I 7 z U d d x A 5 2 s U / z P E Y N d T T Q J O 8 R H v G E Z 6 t u R V Z m 3 X 2 m W o V c s 4 1 v y 3 r 4 A P z k k B 8 = < / l a t e x i t > p 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " S j i q 0 c X + X + 5 I 2 O h c 4 O 6 Z O C i i M 8 k = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V Z I i 6 r L o p s u K 9 g G 1 l G Q 6 r U P T J E w m S i m C P + B W P 0 3 8 A / 0 L 7 4 w p q E V 0 Q p I z 5 9 5 z Z u 6 9 f h y I R D n O a 8 5 a W F x a X s m v F t b W N z a 3 i t s 7 z S R K J e M N F g W R b P t e w g M R 8 o Y S K u D t W H J v 7 A e 8 5 Y / O d b x 1 y 2 U i o v B K T W L e H X v D U A w E 8 x R R l 3 G v 0 i u W n L J j l j 0 P 3 A y U k K 1 6 V H z B N f q I w J B i D I 4 Q i n A A D w k 9 H b h w E B P X x Z Q 4 S U i Y O M c 9 C q R N K Y t T h k f s i L 5 D 2 n U y N q S 99 k y M m t E p A b 2 S l D Y O S B N R n i S s T 7 N N P D X O m v 3 N e 2 o 8 9 d 0 m 9 P c z r z G x C j f E / q W b Z f 5 X p 2 t R G O D U 1 C C o p t g w u j q W u a S m K / r m 9 p e q F D n E x G n c p 7 g k z I x y 1 m f b a B J T u + 6 t Z + J v J l O z e s + y 3 B T v + p Y 0 Y P f n O O d B s 1 J 2 j 8 v u x V G p e p a N O o 8 9 7 O O Q 5 n m C K m q o o 0 H e Q z z i C c 9 W z Q q t 1 L r 7 T L V y m W Y X 3 5 b 1 8 A H / R J A g < / l a t e x i t > p 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " p 1 p o g 1 / k o E e h D v 3 + 5 l J s d O p d I i 8 = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I V d V l 0 0 2 V F + 4 B a S j K d 1 t A 0 C Z O J U o r g D 7 j V T x P / Q P / C O + M U 1 C I 6 I c m Z c + 8 5 M / d e P w m D V D r O a 8 6 a m 1 9 Y X M o v F 1 Z W 1 9 Y 3 i p t b j T T O B O N 1 F o e x a P l e y s M g 4 n U Z y J C 3 E s G 9 k R / y p j 8 8 V / H m L R d p E E d X c p z w z s g b R E E / Y J 4 k 6 j L p H n a L J a f s 6 G X P A t e A E s y q x c U X X K O H G A w Z R u C I I A m H 8 J D S 0 4 Y L B w l x H U y I E 4 Q C H e e 4 R 4 G 0 G W V x y v C I H d J 3 Q L u 2 Y S P a K 8 9 U q x m d E t I r S G l j j z Q x 5 Q n C 6 j R b x z P t r N j f v C f a U 9 1 t T H / f e I 2 I l b g h 9 i / d N P O / O l W L R B + n u o a A a k o 0 9 k y M m t E p A b 2 S l D Y O S B N R n i S s T 7 N N P D X O m v 3 N e 2 o 8 9 d 0 m 9 P c z r zG x C j f E / q W b Z f 5 X p 2 t R G O D U 1 C C o p t g w u j q W u a S m K / r m 9 p e q F D n E x G n c p 7 g k z I x y 1 m f b a B J T u + 6 t Z + J v J l O z e s + y 3 B T v + p Y 0 Y P f n O O d B s 1 J 2 j 8 v u x V G p e p a N O o 8 9 7 O O Q 5 n m C K m q o o 0 H e Q z z i C c 9 W z Q q t 1 L r 7 T L V y m W Y X 3 5 b 1 8 A H / R J A g < / l a t e x i t > p 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " p 1 p o g 1 / k o E e h D v 3 + 5 l J s d O p d I i 8 = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I V d V l 0 0 2 V F + 4 B a S j K d 1 t A 0 C Z O J U or g D 7 j V T x P / Q P / C O + M U 1 C I 6 I c m Z c + 8 5 M / d e P w m D V D r O a 8 6 a m 1 9 Y X M o v F 1 Z W 1 9 Y 3 i p t b j T T O B O N 1 F o e x a P l e y s M g 4 n U Z y J C 3 E s G 9 k R / y p j 8 8 V / H m L R d p E E d X c p z w z s g b R E E / Y J 4 k 6 j L p H n a L J a f s 6 G X P A t e A E s y q x c U X X K O H G A w Z R u C I I A m H 8 J D S 0 4 Y L B w l x H U y I E 4 Q C H e e 4 R 4 G 0 G W V x y v C I H d J 3 Q L u 2 Y S P a K 8 9 U q x m d E t I r S G l j j z Q x 5 Q n C 6 j R b x z P t r N j f v C f a U 9 1 t T H / f e I 2 I l b g h 9 i / d N P O / O l W L R B + n u o a A a k o 0

Figure 8 :
Figure 8: The 1-shot prompt to ChatGPT for the OpenIE task, where the <sentence> corresponds to the query sentence.

Figure 9 :
Figure 9: The F robust 1 scores of OpenIE6 model with syntactic correlations between clique-pairs.
measure the performance of extractions * Corresponding author: xubin@tsinghua.edu.cn Watson has served as Minority Leader since elected by his caucus in November 1998 .Since his election by his caucus in November 1998, Watson has been the Minority Leader.Watson, who was elected by his caucus in November 1998, has served as Minority Leader since then.

Table 2 :
The performance of typical OpenIE systems on CaRB and ROBUST benchmarks.The row ∆ represents the the difference between CaRB score and ROBUST score (↓ means the degradation from CaRB).Bold numbers refers to the highest score per metric or highest difference per row (i.e.highest ∆ for P , R and F 1 ).
prompt = "Open information extraction requires the extraction of all relations in the sentence, i.e., predicates, the subjects and objects corresponding to these relations, and the possible time and place thesis elements.For example, in the sentence: Watson, who was elected by his caucus in November 1998, has served as Minority Leader since then.From this sentence, the following tuple can be extracted: (was elected by, Wastson, his caucus in November 1998); (has served as,Wastson, Minority Leader, since then)