ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select

We study the problem of extracting N-ary relation tuples from scientific articles. This task is challenging because the target knowledge tuples can reside in multiple parts and modalities of the document. Our proposed method ReSel decomposes this task into a two-stage procedure that first retrieves the most relevant paragraph/table and then selects the target entity from the retrieved component. For the high-level retrieval stage, ReSel designs a simple and effective feature set, which captures multi-level lexical and semantic similarities between the query and components. For the low-level selection stage, ReSel designs a cross-modal entity correlation graph along with a multi-view architecture, which models both semantic and document-structural relations between entities. Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.


Introduction
Scientific information extraction (SciIE) (Augenstein et al., 2017;Luan et al., 2018;Jiang et al., 2019), the task of extracting scientific concepts along with their relations from scientific literature corpora, is important for researchers to keep abreast of latest scientific advances.A key subtask of SciIE is the N -ary relation extraction problem (Jia et al., 2019;Jain et al., 2020), which aims to extract the relations of different entities as N -ary knowledge tuples.This problem is challenging because the entities of the knowledge tuples often reside in multiple sections (e.g., abstracts, experiments) and modalities (e.g., paragraphs, tables, figures) of the document.Effective scientific N -ary relation extraction requires not only understanding the semantics of different modalities, but also performing document-level inference based on interleaving sig-1 Our code is available on https://github.com/night-chen/ReSel.nals such as co-occurrences, co-references, and structural relations, as shown in Figure 1.Document-level N -ary relation extraction has been studied in literature (Jia et al., 2019;Jain et al., 2020;Viswanathan et al., 2021;Liu et al., 2021).Some works (Zeng et al., 2020;Tu et al., 2019) use graph-based approaches to model long-distance relations in the document with the focus on text only.However, for scientific articles, an equally if not more important data structure is the table, as scientific results are often reported in tables and then referred to and discussed in text.There are also works that pre-train large-scale transformer models on massive table and text pairs (Yin et al., 2020;Herzig et al., 2020).These methods are designed for question answering, which are strong at retrieving answers that semantically match the query but fall short in inferring fine-grained entitylevel N -ary relations.Besides, to perform well on SciIE, they usually require large task-specific data to fine-tune the pre-trained model, especially for long documents that contain many candidates.But in practice, such large-scale annotation data can be expensive and labor-intensive to curate.Therefore, extracting N -ary relations jointly from scientific text and tables still remains an important but challenging problem.
We propose RESEL, a hierarchical Retrieve-and-Selection model for multi-modal and documentlevel SciIE.In RESEL, we pose the N -ary relation extraction problem as a question answering task over text and tables (Figure 1).RESEL then decomposes the challenging task into two simpler subtasks: (1) high-level component retrieval, which aims to locate the target paragraph/table where the final target entity resides, and (2) low-level entity extraction, which aims to select the target entity from the chosen component.
For high-level component (i.e., paragraph or table) retrieval, we design a feature set that combines the strengths of two classes of retrieval methods: (1) sparse retrieval (Aizawa, 2003;Robertson and Zaragoza, 2009) that represents the query-candidate pairs as high-dimensional sparse vectors to encode lexical features; (2) dense retrieval (Karpukhin et al., 2020) that leverages latent semantic embeddings to represent query and candidates.We design sparse and dense retrieval features for query-component pairs by augmenting BERT (Devlin et al., 2019)-based semantic similarities with entity-level semantic and lexical similarities, allowing for training an accurate high-level retriever using only a small amount of labeled data.
The low-level entity extraction stage aims to infer N -ary entity relations from complex and noisy signals across paragraphs and tables.In this stage, we first build a cross-modal entitycorrelation graph, which encodes different entityentity relations such as co-occurrence, co-reference, and table structural relations.While most of the existing methods (Zheng et al., 2020;Zeng et al., 2020) use BERT embeddings as node representations, we find BERT embeddings limited in distinguishing adjacent table cells or similar entities.This issue is even more severe when the BERT embeddings are propagated on the graph.To address this, we design a new bag-of-neighbors (BON) representation.It computes the lexical and semantic similarities between each candidate entity and its 1hop neighbors.We then feed the BON features into a graph attention network (GAT) to capture both neighboring semantics and structural correlations.Such GAT-learned features and BERT-based embeddings are treated as two complementary views, which are co-trained with a consistency loss.
We summarize our key contributions as follows: (1) We propose a hierarchical retrieve-and-select learning method that decomposes N -ary scientific relation extraction into two simpler subtasks; (2) For high-level component retrieval, we propose a simple but effective feature-based model that combines multi-level semantic and lexical features between queries and components; (3) For low-level entity extraction, we propose a multi-view architecture, which fuses graph-based structural relations with BERT-based semantic information for extraction; (4) Extensive experiments on three datasets show the superiority of both the high-level and lowlevel modules in RESEL.

Related Work
Component Retrieval For component retrieval, traditional sparse retrieval methods such as TF-IDF (Aizawa, 2003) and BM25 (Robertson and Zaragoza, 2009) focus on keyword-level matching but ignore entity semantics.Recently, pretrained language models have also been used to represent queries and documents in a learned space (Karpukhin et al., 2020) and have been extended to handle tabular context (Herzig et al., 2021;Ma et al., 2022).However, these methods mainly focus on passage-level retrieval, and cannot well capture fine-grained entity-level semantics (Zhang et al., 2020;Su et al., 2021).Such an issue makes them suboptimal for encoding nuanced terms and descriptions in scientific articles.In contrast, RESEL leverages both component-and entity-level semantic and lexical features that help the model better understand the correlations between components and queries.
N -ary Relation Extraction Many existing methods (Jia et al., 2019;Jain et al., 2020;Viswanathan et al., 2021) treat N -ary relation extraction as a binary classification problem and predict whether the composition of N entities in the document are valid or not.However, the candidate space grows exponentially with N , and the performance of the binary classifiers can be largely influenced by the number and quality of negative tuples.Some other methods (Du et al., 2021;Huang et al., 2021) formulate the problem as role-filler entity extraction and propose BERT-based generative models to extract the correct entities for each element of the N -ary relation.None of these methods consider N -ary relation across modalities.Lockard et al. (2020) leverages the layout information for extracting relations from web pages.However, the layout information in science articles are less prominent and harder to be utilized.) , and the task is to extract the correct N -th element from document D to form a valid N -ary relation.We assume a dataset {x k , y k } M k=1 that can be used to learn such a N -ary relation extractor, each sample includes a document and a set of queries In Stage I, we design a high-level model to retrieve the most relevant paragraphs or tables that contain the final answer.We first use BERT to embed the paragraphs/tables into sequences of vectors (details in Appendix A.1).We encode the j-th query Q j = [e j,1 , • • • , e j,N −1 ], into query embedding h(Q j ) and get the corresponding element embeddings of the query h(e j,a ), a ∈ N [0,N ) .Similar to the query encoder, we encode the i-th component, C i , as component embedding h(C i ), and the aver- Component-Level Semantic Features (CS).The first view extracts the semantic features for component-query pairs from two different angles: (1) Embedding-Based Similarity: the cosine similarities f cs-1 (C i , Q j ) between component and query embeddings.(2) Entailment-Based Score: the classification score f cs-2 (C i , Q j ) between Q j and C i calculated by feeding them both into a BERT binary sequence classifier as a concatenated sequence (Nogueira and Cho, 2019;Nie et al., 2019).We concat these two scalar features as the first view f cs (C i , Q j ).
Entity-Level Semantic Features (ES).The second view computes entity-level cosine similarities f es (m i,b , e j,a ) between the component entity embeddings h(m i,b ) and the query elements embeddings h(e j,a ).With all these similarity scores, we apply a max-pooling operation over all component entities m i,b , and use the obtained maximum f es (C i , e j,a ) = max m i,b ∈C i f es (m i,b , e j,a ) to represent the relation between the component C i and one query element e j,a .Then, we gather the relation scores f es (C i , e j,a ) as the final entitylevel semantic feature vector: Entity-Level Lexical Features (EL).Our third view extracts lexical features between component entities and the query elements.We compute three text similarities (Appendix A.2): (1) Levenshtein Distance (Levenshtein et al., 1966); (2) the length of Longest Common Substring; (3) the length of Longest Common Subsequence.As the metrics vary in scale according to the length of the strings, we use the normalized metrics f el (m i,b , e j,a ) ∈ [0, 1] 3 via dividing by involved string lengths.Similar to ES features, we perform max-pooling to obtain the relation scores between the component and a single query element, ) and concatenate the results as entity-level lexical features: We aggregate the features to predict which component has the highest probability to contain the final answer.As the features in the three views share the same scale range and similar dimensionality, we just concatenate these features together as T , and train one unified classifier over f h for component retrieval.

Entity Extractor
In Stage II, we use the predictions from Stage I to restrict the searching space for low-level entity extraction.

Multi-Modal Entity-Level Graph
To model document-level entity correlations, we construct a multi-modal entity correlation graph } denotes the edges between them.Each node v i ∈ V represents a paragraph entity or a table cell.We construct different edge types to model the intra-and inter-modality relations to encode the entity correlation across modalities as in Figure 3: (1) Co-occurence Edge measures whether two entity nodes v i and v j occur in the same sentence or adjacent sentences; (2) Coreference Edge extracts the relation information of two entity nodes v i and v j referring to the same concept; (3) Reference Edge bridges the table and text with reference information (e.g., "in Table 3  all edge types are ranged in [0, 1] and most of them do not overlap, we treat them equally and define the graph as an undirected homogeneous graph.

Bag-of-Neighbors Features
For low-level entity extraction from the retrieved paragraph/table, a key challenge is that the entities (nodes) in the same sentence or adjacent table cells can have very similar BERT embeddings and hard to be discriminated by a BERT-only classifier.Further, such entities often share many common neighbors on the graph, which means their embeddings can be easily further over-smoothed when propagated on the graph.To tackle these challenges, we propose the bag-of-neighbors (BON) features (Figure 4(a)) based on the entity-level semantic and lexical features.Given an entity node v i and a query Q j , we define the initial embeddings as: where e j,k is the k-th query elements of Q j .We compute the BON features of node v i via maxpooling the initial embeddings of the adjacent neighboring nodes N (v i ): (1)

Graph Attention Network
Using BON features alone may not be expressive enough when there is query information missing in the 1st-order neighborhood.To include multihop relations from distant nodes, we apply a graph attention network (Veličković et al., 2018) to aggregate such information (Figure 4(b)).GAT first computes the normalized attention coefficients α (l) i,j between node i in the multi-modal correlation graph   and its neighboring node j ∈ N (i) in the l-th layer: , where h (l) i is the l-th layer hidden features of node i, W is a learnable weight matrix, a is trainable weight vector parameters, and σ(•) is the LeakyReLU(•) activation function.The initial node embeddings are the bag-of-neighbors feature embeddings, i.e., h Then, we aggregate the neighbor embeddings as the new (l + 1)-th layer node embeddings via computing a weighted sum based on the computed attention coefficients: (2) For an L-layer GAT, the updated node embedding is denoted as

Multi-View Aggregation
Although the GAT-propagated BON representations can enable the model to extract more answers from tables, they can fall short on paragraphs because of the lack of encoding the original semantic information in BERT embeddings.Thus, aside from the graph-based branch introduced in § 4.2.1, § 4.2.2, and § 4.2.3, we add another branch based on the BERT representations of these nodes.However, simply concatenating the BON features and the BERT embeddings might lead to several drawbacks: (1) one of the views dominating the other one during training; (2) different features with different dimensionality, making it difficult to learn a unifed classifier on the concatenation.Thus, we design two simple classifiers and make them mutually enhance each other during the entity selection: (1) one classifier based on the concatenation of the entity nodes' and query elements' BERT embeddings, and (2) the other classifier based on the GAT-updated BON features.Given the node v i and the query Q j , and using feedforward neural networks (FFNN) as classifiers, we have: (3) Then, we average the scores from simple classifiers as the prediction of the final aggregated classifier: (5)

Low-Level Entity Selection
The training objective for the low-level entity classifiers ( § 4.2.4) is separated into three parts: (1) the aggregated classification loss for the aggregated model: (2) the classification loss for the two subclassifiers: (3) the consistency loss between the two subclassifiers to encourage them to reach a consensus: The overall objective of the low-level entity extractor is then: where λ and µ are pre-defined balancing hyperparameters.
(1) BERT-Base model searching in the whole document; (2) GCN and (3) GAT testing the performance of our proposed graph in the whole document; (4) BERT-Entailment+Base combining the best baselines for both high-and low-level stage (see Appendix E for more details about baselines).

Comparison with Baselines
Table 1-3 present the average performances over multiple random trials2 .RESEL consistently outperforms the strongest baselines by 9.01%, 6.81%, 10.25% in Acc and 4.38%, 12.01%, 8.27% in MRR on the three datasets at all levels.For the other ranking metrics of hit rate, RESEL also show marginal improvements compared with baselines.
As RESEL-H employs both component-level semantic features and entity-level matching features, its high-level performance exceeds that of the sparse and dense retrieval baselines, which captures only single-sided information.For low-level, the embedding-based methods and pretrained-LMs only take advantage of the latent semantic information, while the graph-based methods only focus on the graph topology and can easily suffer from over-smoothing.Compared with these baselines, RESEL-L shows better performances with the multi-view aggregation of GAT-propagated BON features and BERT embeddings.
Comparing the performance gains for the lowlevel extraction, we find that the components of RE-SEL-L contribute variously on different datasets.In SciREX, with most of the questioned scores hidden in the tables, the GAT-propagated BON features work better in discriminating the table cells and numeric values.For PubMed, the targets mostly appear in text rather than tables.Thus, the semantic information in BERT embeddings contributes more to the performance increase.NLP-TDMS is a benchmark dataset that includes multiple relevant choices for a given query, the ambiguity of which hurts the performance of all models.

Ablation Studies
We conduct ablation studies on SciREX and present the results in Table 4.

PubMed
NLP-TDMS Methods Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5 Table 1: The performance of different methods for retrieving high-level components.We measure the performance of different methods in retrieving the ground-truth components in terms of accuracy, MRR, and top-k hit ratios.
SciREX PubMed NLP-TDMS Methods Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5 Base  Multi-View Aggregation (MVA) will make RE-SEL-L's performance decrease significantly.This is because when simply concatenating the BERT embeddings and the GAT-propagated BON features, the BERT embeddings (which has much higher dimensionality) can dominate the learning process.

Parameter Studies
Figure 5 shows our parameter study results.
λ and µ.The loss ℓ 1 in the aggregated classifier in Eq. ( 4) plays the leading role in training objective.When λ is too small or µ is too large, the regularization of consistency between two classifiers will contribute more than their respective classifica- tion loss, making them more intended to generate incorrect-but-same predictions; Conversely, when λ is too large or µ is too small, the classifiers begin to generate biased predictions, making the aggregation deteriorate to a mere average.
L and H.The number of GAT layers L determines the depth of neighboring information on the graph, also known as the orders of neighbors in aggregation.When L is increasing, we will aggregate more common neighbors for adjacent nodes, making it easier for GAT to fall into over-smoothing.
The width of neighboring information on the graph is dictated by the amount of relationships we encode from neighbors.When we increase the number of attention heads H, GAT will learn and combine several sets of attention scores on the neighboring nodes, which can also include more irrelevant or misleading information from them.Besides, whichever H or L is increasing, the model needs to train more parameters, taking more time and data.

Case Study
Figure 6 shows a representative example to illustrate the efficacy of RESEL.It shows the predictions from GCN baseline and RESEL for two queries on the same document.The darker the color is on the table cell, the higher prediction score we obtain for it.We can clearly see that BERT embeddings alone cannot distinguish which numerical value is the final answer.The graph-propagated

Conclusion
We proposed RESEL, a two-stage method for Nary relation extraction jointly from scientific text and tables.RESEL consists of two key components: a high-level component retriever and a lowlevel entity extractor.The multiple features defined in the high-level retriever enables our model to leverage semantic and lexical information from both paragraphs/tables and entities.For low-level entity extractor, the multi-view aggregation effectively encodes both the topology information from the graph and the semantic information from pretrained BERT embeddings.Extensive experiments on three datasets show that RESEL consistently outperforms all baseline models significantly.

Limitations
While RESEL has demonstrated superior performance compared with the state-of-the-art baselines, it has several limitations that can be addressed in the future.First, although RESEL extends the capability of previous N -ary relation extraction to both text and tables, it cannot extract from imagesanother important data modality in scientific articles.This necessitates augment RESEL with optical character recognition (OCR) techniques to parse images and jointly extract from the text, table, and image modalities.Second, we found the datasets for SciIE are limited and expensive to curate, especially as we aim to expand to include images.Accurate annotation for multi-modal SciIE is time-consuming and needs more future collaborative efforts from related communities.Third, currently RESEL has not modeled the layout information (e.g., font style, font size, etc.), which may also contain some clues for intra-and intermodality relations.Some existing studies (Xu et al., 2020(Xu et al., , 2021a,b) ,b) have worked on pre-training models that encode the layout information, which can be interesting to be combined with RESEL.

A Methodology Computational Details
A.1 Query and Component Encoder To generate a more understandable natural language sequence for the BERT encoder, we re-formulate the query into a question [q j,1 , • • • , q j,M j ], where M j is the number of words in the generated question.In this way, we are able to use the [CLS] token embedding as the query embedding: By averaging the embeddings of words that are related to the query elements, we obtain the a-th query element embeddings h(e j,a ), e j,a ∈ Q j : • Component Encoder As is mentioned in § 3, each component in a document can be denoted as a sequence of words Then, we directly encode the paragraph embedding h(C i ), the included word embeddings {h(w i,1 ), • • • , h(w i,N i )}, and the averaged entity embeddings h(m i,b ), where m i,b ∈ C i indicates the b-th entity extracted from the component C i : A.2 Text Similarities • Levenshtein Similarity: The string similarity based on Levenshtein Distance (Levenshtein et al., 1966): where Leven_Dist(•, •) refers to the Levenshtein Distance, which measures how different two strings are by counting the number of deletions, insertion, or substitutions required to transform one string to another.
• Longest Common Substring: The ratio between the length of longest common substring and the minimum length of the two strings: where LCStr(•, •) indicates the longest common substring between two given strings.
• Longest Common Subsequence: The longest common subsequence (LCS) is the longest subsequence that is common to all given strings.Different from the longest common substring, the elements of the subsequence are not needed to occupy consecutive locations within the original sequences.The ratio between the length of longest common subsequence and the minimum length of the two strings: where LCSeq(•, •) indicates the longest common subsequence between two given strings.

A.3 Cross-Modal Graph Construction
• Co-occurence Edge: When two entity nodes v i and v j occur in the same sentence, we connect them with a co-occurence edge E (v i ,v j ) with weight w s .If v i and v j do not co-occur in the same sentence but in two adjacent sentences, we still connect them but assign a smaller weight w t (w t < w s ) to edge E (v i ,v j ) .In practice, we set w t = w s /2.• Co-Reference Edge extracts the intra-paragraph relation.When two entity nodes v i and v j are referring to the same concept, we connect them with co-reference edge E (v i ,v j ) , e.g., abbreviations and full names, common names and scientific names, etc.
• Reference Edge extracts the inter-modality relationship between paragraphs and tables.When an entity node v i occurs in a sentence with a reference mark, (e.g., "Table .3", etc.), we link it to any node v j in the referenced table with a reference edge E (v i ,v j ) .

B.1 Hyper-parameters Settings
For the high-level component retriever training, the learning rate is set as 1e − 4 and the maximum number of epochs is 50.For low-level entity selector training, we use a 1-layer single head GAT (Veličković et al., 2018) based on the bag-ofneighbors features computed on 1-st order neighbors to aggregate graph topology information.We select λ = 0.3 and µ = 0.15 as the proportion weights in the multi-view aggregation.For the feature-based FFNN classifiers in both high-level and low-level models, we set the dimensions of the hidden layers to 32.The corresponding learning rate and maximum number of epochs to the lowlevel entity extractor are 1e − 3 and 50.During training, we use the Adam (Kingma and Ba, 2014) optimizer with β 1 = 0.9 and β 2 = 0.999 in our experiments for all the models.We select the best set of hyper-parameters of the models based on the accuracy on the corresponding dev sets.

B.2 Implementation Settings
We train and test our code on the System Ubuntu 18.04.4LTS with CPU: Intel(R) Xeon(R) Silver 4214 CPU@ 2.20GHz and GPU: NVIDIA GeForce RTX 2080.We implement our method using Python 3.8 and PyTorch 1.6 (Paszke et al., 2019).

C Dataset Description
We evaluate our work on three different datasets (see Methods Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5  Methods Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5  (3) Top-k Hit Rate (Hit@K) measures whether the ground-truth answer is included in the top-k selection made by the models: For Top-k Hit Rate, we report Hit@2, Hit@3, and Hit@5.For the high-level scenario, we only evaluate whether the model select the correct component or not; For the low-level scenario, we evaluate the performances restricting the searching space into the ground-truth paragraph/table for entity selection; For the overall scenario, we remove the restriction and test the model performance of entity selection from the whole document.

E Experimental Baselines
We use different baselines for the high-level component retrieval, low-level entity selection, and the overall framework.
High-Level Baselines.For the high-level model, we compare with the following baselines: • Sparse Retrieval Methods: 1) TF-IDF (Aizawa, 2003) and 2) BM25 (Robertson and Zaragoza, 2009) are two sparse-retrieval methods which ranks query-section pairs via computing the relevant score based on key words; • Entity-Based Methods: 1) Entity Cosine Similarities (ECS) calculates the cosine similarities between the embeddings of query and section entities, and sums them up as the final prediction score; 2) Deep Entity Cosine Similarities (DECS) improves cosine similarities by substituting the sumup function into a feedforward neural network.
• Embedding-Based Methods: 1) BERT-Matching is a matching method based on pretrained BERT embeddings, using the dot product between query and component representations; 2) BERT-Entailment is a textual inference method (Nogueira and Cho, 2019;Nie et al., 2019) for calculating the relevance score.3) Recurrent Retriever (Asai et al., 2019) is a graph-based recurrent retrieval method.It selects one paragraph p i in each step until it selects an end-ofevidence mark ([EOE]); 4) Dense Passage Retrieval (DPR) (Karpukhin et al., 2020) is a stateof-the-art model that use BERT as the encoder for passage retrieval in open-domain QA.

SciREX
PubMed NLP-TDMS Methods Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5 Acc MRR Hit@2 Hit@3 Hit@5  Low-Level Baselines.For the low-level model, we restrict the searching space to the ground-truth paragraph/table that contains the final answer and compare with the following baselines: • Embedding-Based Methods: 1) BERT-Base is a simple classifier trained directly on the concatenation of query and candidate embeddings; 2) SciREX (Jain et al., 2020) composes salient entity embeddings for each paragraph and learns a binary classifier to decide whether the N -ary relation exists or not.
• Graph-Based Methods: 1) Graph Convolutional Network (GCN) (Kipf and Welling, 2016) and 2) Graph Attention Network (GAT) (Veličković et al., 2018) are two classic graph neural network structures, we report performances by applying them on our proposed graph structure; 3) Heterogeneous Document-Entity (HDE) graph (Tu et al., 2019) is a heterogeneous graph model which conducts multi-hop reading comprehension by leveraging the relation between document, entity, and candidate nodes; • Pre-trained LMs: 1) TAPAS (Herzig et al., 2020) is the start-of-the-art pre-trained model on text and tables.We fine-tune the pre-trained model on our own datasets; 2) TDMS-IE (Hou et al., 2019) is an entailment model based on the score context and hypothesis of dataset and metric to judge whether these elements are related to each other.
Overall Baselines.For the overall performance our two-stage model, we compare with the following baselines: 1) The BERT-Base model searching in the whole document; 2) GCN and 3) GAT testing the performance of our proposed graph in the whole document; 4) BERT-Entailment+Base is a two-stage model combining the best baselines for both high-level and low-level stage.

F Standard Deviation of Main Results
Table 6-Table 8 list the standard deviations we obtain for the main results from multiple trials.The results indicate that RESEL show competitive stability compared with all the baselines on three different datasets under different settings.The evaluation computation is based on the number of queries, but we split the training, validation, and test set based on the number of documents to prevent data leakage.Due to the fact that various documents include varying numbers of queries, the exact number of queries in train/val/test set is not fixed, causing the performances to vary in different trials and the standard deviations to increase.For PubMed dataset, the test set is fixed and the random seeds can only influence the split between training and validation set, thus the standard deviations on this dataset is relatively smaller than on the other two datasets, SciREX and NLP-TDMS.

Figure 1 :
Figure 1: Illustration of the multi-modal scientific Nary relation extraction problem on the SciREX dataset.
by the dynamic oracle for English is 93.56 UAS.…… 2 Table and each ground-truth label y k indicates the target entity in the document D k .
aged entity embeddings h(m i,b ), where m i,b ∈ C i , indicates the b-th entity extracted from C i .With the encoded sequences of vectors, we compute the different views of features for the component-query pair (C i , Q j ) as follows to take advantage of both the entity-level matching signals and componentlevel semantic signals, which are complementary:

Figure 3 :
Figure 3: Illustration of constructing the multi-modal entity correlation graph between paragraphs and tables. Bag-of-neighbors.
Graph attention networks.

Figure 4 :
Figure 4: Illustration of (a) bag-of-neighbors features and (b) GAT.For both sub-figures, nodes in darker grey contribute more during the aggregation.
Given a document D k and a query Q j , y high jk and y low jk indicate the ground-truth label of the correct component and entity for high-level component retrieval and low-level entity extraction, while ŷhigh jk and ŷlow jk indicate the predictions from the component retriever and entity extractor.We define the following training objectives: High-Level Component Retrieval We use the traditional classification loss, ℓ CE (y, ŷ) = i −y i log ŷi − (1 − y i ) log(1 − ŷi ), as the highlevel model training objective:

Table : Dependency parsing: English (SD).
"); (4) Table-Structure Edge extracts the structural information of columns and rows of tables; (5) Table-Paragraph Connection enhances the linking between table cells and paragraph entities via text similarities (detailed in Appendix A.3).With these five edge types from different modalities covering nearly all hidden relations in the document, the multi-modal entity correlation graph can effectively model document-level information.As

Table [ 1] Following the same settings of …… the English PTB and Chinese CTB-5. [ 2 ] Table 1 shows the parser that …… [3] The table is also …… with the static oracle (obtained by rerunning Dyer et al. parser) for the sake of comparison between static and dynamic training strategies. [4] The score achieved by the dynamic oracle for English is 93.56 UAS. [1] Following the same settings of …… the English PTB and Chinese CTB-5. [ 2 ] Table 1 shows the parser that …… [3] The table is also …… with the static oracle (obtained by rerunning Dyer et al. parser) for the sake of comparison between static and dynamic training strategies. [4] The score achieved by the dynamic oracle for English is 93.56 UAS. [1] Following the same settings of …… the English PTB and Chinese CTB-5. [ 2 ] Table 1 shows the parser that …… [3] The table is also …… with the static oracle (obtained by rerunning Dyer et al. parser) for the sake of comparison between static and dynamic training strategies. [4] The score achieved by the dynamic oracle for English is 93.56 UAS.
(Alsentzer et al., 2019)ns 332 unannotated full-length natural language processing papers (see Appendix C for more details).We extend the original datasets to include both text and tables from the LaTeX or PDF files.Our experiments show that the domain-specific BERTs work better than the general domain BERT model.For SciREX and TDMS-NLP, we use SciBERT(Beltagy et al., 2019)as the encoder for all methods; for PubMed, we use ClinicalBERT(Alsentzer et al., 2019).

Table 2 :
The performance of different low-level methods in extracting the target entities.

Table 4 :
Performance comparison of ablation study.

Table 5 :
Structure Edge extracts the intra-table relation.We connect a table-structure edge E (v i ,v j ) between a table cell node v i and another node v j appearing in the corresponding column header, row header, or the table caption.Datasets details.The numbers in training/val/test set split are the numbers of full-length scientific paper documents.• Table-Paragraph Connection bridges the paragraph-table relation.Given an entity node v i in a paragraph and a cell node v j in a table, we place a table-paragraph connection edge E (v i ,v j ) between them.The weight of E (v i ,v j ) ∈ [0, 1] is computed based on text similarities between the surface strings of two nodes.

Table 6 :
The standard deviation of different methods for retrieving high-level components in terms of Acc, MRR, and top-k hit ratios.

Table 7 :
The standard deviation of different low-level methods in extracting the target entities.

Table 8 :
The standard deviation of overall document-level extraction performance of different methods.