Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks

Entity and Relation Extraction (ERE) is an important task in information extraction. Recent marker-based pipeline models achieve state-of-the-art performance, but still suffer from the error propagation issue. Also, most of current ERE models do not take into account higher-order interactions between multiple entities and relations, while higher-order modeling could be beneficial.In this work, we propose HyperGraph neural network for ERE ($\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based pipleline model). To alleviate error propagation,we use a high-recall pruner mechanism to transfer the burden of entity identification and labeling from the NER module to the joint module of our model. For higher-order modeling, we build a hypergraph, where nodes are entities (provided by the span pruner) and relations thereof, and hyperedges encode interactions between two different relations or between a relation and its associated subject and object entities. We then run a hypergraph neural network for higher-order inference by applying message passing over the built hypergraph. Experiments on three widely used benchmarks (\acef{}, \ace{} and \scierc{}) for ERE task show significant improvements over the previous state-of-the-art PL-marker.


Introduction
Entity and Relation Extraction (ERE) is a fundamental task in information extraction (IE), compromising two sub-tasks: Named Entity Recognition (NER) and Relation Extraction (RE).There is a long debate on joint vs. pipeline methods for ERE.Pipeline decoding extracts entities first and predicts relations solely on pairs of extracted entities, while joint decoding predicts entities and relations simultaneously.2021) shows that pipeline decoding with a frustratingly simple marker-based encoding strategyi.e., inserting solid markers (Baldini Soares et al., 2019;Xiao et al., 2020) around predicted subject and object spans in the input text -achieves state-of-the-art RE performance.Modified sentences (with markers) are fed into powerful pretrained large language models (LLM) to obtain more subject-and object-aware representations for RE classification, which is the key to the performance improvement.However, current markerbased pipeline models (e.g., the recent state-of-theart ERE model PL-marker (Ye et al., 2022)) only send predicted entities from the NER module to the RE module, therefore missing entities would never have the chance to be re-predicted, suffering from the error propagation issue.On the other hand, for joint decoding approaches (e.g.Table Filling methods (Miwa and Sasaki, 2014;Zhang et al., 2017;Wang and Lu, 2020))-though they do not suffer from the error propagation issue-it is hard to incorporate markers for leveraging LLMs, since entities are not predicted prior to relations.Our desire is to obtain the best of two worlds, being able to use marker-based encoding mechanism for enhancing RE performance and meanwhile alleviating the error propagation problem.We adopt PL-marker as the backbone of our proposed model and a span pruning strategy to mitigate error propagation.That is, instead of sending only predicted entity spans to the RE module, we over-predict candidate spans so that the recall of gold entity spans is nearly perfect (but there also could be many non-entity spans), transferring the burden of entity classification and labeling from the NER module to the RE module of PL-marker.The number of over-predicted spans is upper-bounded, balancing the computational complexity of marker-based encoding and the recall of gold entity span.Empirically, we find this simple strategy by itself clearly improves PL-marker.
We further incorporate a higher-order interaction module into our model.Most previous ERE models either implicitly model the interactions between instances by shared parameters (Wang and Lu, 2020;Yan et al., 2021;Wang et al., 2021) or use a traditional graph neural network that models pairwise connections between a relation and an entity (Sun et al., 2019).It is difficult for these approaches to explicitly model higher-order relationships among multi-instances, e.g. the dependency among a relation and its corresponding subject and object entities.Many recent works in structured prediction tasks show that explicit higher-order modeling is still beneficial even with powerful large pretrained encoders (Zhang et al., 2020a;Li et al., 2020;Yang and Tu, 2022;Zhou et al., 2022, inter alia), motivating us to use an additional higherorder module to enhance performance.
A common higher-order modeling approach is by means of probabilistic modeling (i.e., conditional random field (CRF)) with end-to-end Mean-Field Variational Inference (MFVI), which can be seamlessly integrated into neural networks as a recurrent neural network layer (Zheng et al., 2015a), and has been widely used in various structured prediction tasks, such as dependency parsing (Wang et al., 2019), semantic role labeling (Li et al., 2020;Zhou et al., 2022), and information extraction (Jia et al., 2022).However, the limitations of CRF modeling with MFVI are i): CRF's potential functions are parameterized in log-linear forms with strong independence assumptions, suffering from low model capacities (Qu et al., 2022), ii) MFVI uses fully-factorized Bernoulli distributions to approximate the otherwise multimodal true posterior distributions, oversimplifying the inference problem and thus is sub-optimal.Therefore we need more expressive tools to improve the quality of higher-order inference.Fortunately, there are many recent works in the machine learning community showing that graph neural networks (GNN) can be used as an inference tool and outperform approximate statistical inference algorithms (e.g., MFVI) (Yoon et al., 2018;Zhang et al., 2020b;Kuck et al., 2020;Satorras and Welling, 2021) (see (Hua, 2022) for a survey).Inspired by these works, we employ a hypergraph neural network (HyperGNN) instead of MFVI for high-order inference and propose our model HGERE (HyperGraph Neural Network for ERE).Concretely, we build a hypergraph where nodes are candidate subjects and objects (obtained from the span pruner) and relations thereof, and hyperedges encode the interactions between either two relations with shared entities or a relation and its associated subject and object entity spans.In contrast, existing GNN models for IE (Sun et al., 2019;Nguyen et al., 2021) only model the pairwise interactions between a relation and one of its corrsponding entity.We empirically show the advantages of our higher-order interaction module (i.e., hypergraph neural network) over MFVI and tranditional GNN models.
Our contribution is three-fold: i) We adopt a simple and effective span pruning method to mitigate the error propagation issue, enforcing the power of marker-based encoding.ii) We propose a novel hypergraph neural network enhanced higher-order model, outperforming higher-order CRF-based models with MFVI.iii) We show great improvements over the prior state-of-the-art PLmarker on three commonly used benchmarks for ERE: ACE2004, ACE2005 and SciERC.

Problem formulation
Given a sentence X with n tokens: x 1 , x 2 , ..., x n , an entity span is a sequence of tokens labeled with an entity type and a relation is an entity span pair labeled with a relation type.We denote the set of all entity spans of the sentence with a span length limit L by S(X) = {s 1 , s 2 , ..., s m } and define ST(i) and ED(i) as the start and end token indices of the span s i .
The joint ERE task is to simultaneously solve the NER and RE tasks.Let C e be the set of entity types and C r be the set of relation types.For each span s i ∈ S(X), the NER task is to predict an entity type y e (s i ) ∈ C e or y e (s i ) = null if the span s i is not an entity.The RE task is to predict a relation type y r (r ij ) ∈ C r or y r (r ij ) = null for each span r ij = (s i , s j ), s i , s j ∈ S(X).

Packed levitated marker (PL-marker)
Zhong and Chen (2021) insert two pairs of solid markers (i.e., [S] and [\S]) to highlight both the subject and object entity spans in a given sentence, and this simple approach achieves state-of-the-art RE performance.We posit that this is because LLM is more aware of the subject and object spans (with markers) and thus can produce better span representations to improve RE.But this strategy needs to iterate over all possible entity span pairs and is As a consequence, the relative positions of levitated markers in the concatenated sentence do not matter at all, eliminating potential implausible inductive bias on the concatenation order.
However, marker-base encoding is only used in RE, not in NER.To leverage marker-based encoding in the NER module for modeling span interrelations, PL-marker (Ye et al., 2022) associates each possible span with two levitated markers and concatenates all of them to the end of the input sentence.However, this strategy could make the input sentence extremely long since there are quadratic number of spans.To solve this issue, PL-marker clusters the markers based on the starting position of their corresponding spans, and divides them into N groups.Then the input sentence is duplicated N times and each group of levitated markers is concatenated to the end of one sentence copy.Ye et al. (2022) refers to this strategy as neighborhood-oriented packing scheme.Furthermore, to balance the efficiency and the model expressiveness, Ye et al. (2022) combine solid markers and levitated markers, proposing Subject-oriented Packing for Span Pair in the RE module.That is, if there are m entities, they copy the sentence for m times, and for each copy, they use solid markers to mark a different entity as the subject and concatenate the levitated markers of all other entities (as objects) at the end of the sentence.

Method
Overview.Our method is built upon the state-ofthe-art PL-marker.We employ a high-recall span pruner to obtain candidate entity spans, similar to the NER module in PL-marker.However, instead of aiming to accurately predict all possible entity spans, our pruner focuses on removing unlikely candidates to achieve a much higher recall.Then we feed the candidate span set to the RE module to obtain entity and relation representations, which are used to initialize the node representations of our hypergraph neural network for higher-order inference with a message passing scheme.Finally, we perform NER and RE based on the refined entity and relation representations.Fig. 1 depicts the neural architecture of our model.

Span Pruner
We adopt the neighborhood-oriented packing scheme from PL-marker for span encoding, except that we simply predict entity existence (i.e., binary classification) instead of predicting entity labels during the training phrase.See Appendix A.4 for details.
To produce a candidate span set, we rank all the spans by their scores and take top K as our prediction S p (X).We assume that the number of entity spans of a sentence is linear to its length n, so K is set to λ • n where λ is a coefficient.For a very long sentence, the number of entity spans is often sublinear to n, while for a very short sentence, we wish to keep enough candidate spans, so we additionally set an upper and lower bound: In practice, with our span pruner, more than 99% gold entity spans are included in the candidate set for all three datasets.If we predict entities as in PLmarker instead of pruning, only around 95% and 80% gold entities are kept in the predicted entities for ACE2005 and SciERC respectively, leading to severe error propagation (see §5.1 for an ablation study).
The span pruner is trained independently from the joint ERE model introduced in the next section.This is because the joint ERE training loss will be defined based on candidate entity spans produced by the span pruner.When sharing parameters, the pruner would provide a different candidate span set during training, leading to moving targets and thereby destabilizing the whole training process.

Joint ERE Model: First-order Backbone
The backbone module is based on the RE module of PL-marker.Concretely, given an input sentence X = {x 1 , x 2 , ..., x n } and a subject span s i = (x ST(i) , x ED(i) ) ∈ S p (X) provided by the span pruner, every entity span s j ∈ S p (X), 1 ≤ j ≤ K, j ̸ = i could be a candidate object span of s i .The module inserts a pair of solid markers [S] and [\S] before and after the subject span and assign every object span s j a pair of levitated markers [O] j and [\O] j .As shown below, the levitated markers are packed together and inserted at the end of the input sequence to a PLM: Then we obtain the contextualized hidden representation h x of the modified input sequence and the final subject representation is: FFN represents a single linear layer in this work.The object representation of s j for the current subject s i and the representation of relation r ij = (s i , s j ) are: Repeating K times, we get all K subject representations and K(K − 1) relation representations.
As the object representation of s j is not identical for different subject span s i , there are K object representation sets h i o , 1 ≤ i ≤ K.We apply a max-pooling layer to obtain a unique object representation for each object span s j ∈ S p (X): 3.3 Joint ERE Model: Higher-order Inference with Hypergraph Neural Networks Hypergraph Building So far, the representations of the entities and relations from the backbone module do not explicitly consider beneficial interactions among related instances.To model higher-order interactions among a relation and its associated subject and object entities as well as between any two relations sharing an entity, we build a hypergraph G = (V, E) to connect the related instances.The nodes set V is composed of candidate subjects, objects (provided by the span pruner) and all possible pairwise relations thereof, and we denote them as Hyperedges E capture the interactions we are concerned with, and they can be divided into two categories: the subject-object-relation (sub-obj-rel) hyperedges E sor and the relation-relation (rel-rel) hyperedges E rr .Each hyperedge e ij sor ∈ E sor connects a subject node v i s , an object node v j o and the corresponding relation node v ij r , and we refer to these hyperedges as ternary edges (ter for short).Each rel-rel edge e ijk rr ∈ E rr connects two relation nodes with a shared subject or object entity.We assume in a relation, the subject is the parent node and the object is the child node, and then we can refine rel-rel edges into three subtypes, sibling (sib, connecting v ij r and v ik r ) , co-parent (cop, connecting v ij r and v kj r ) and grand-parent (gp, connecting v ij r and v jk r ), following the common definitions in the dependency parsing literature.
If we incorporate all aforementioned hyperedges into the hypergraph, we obtain the tersibcopgp variant which is illustrated in Fig. 1.By removing some types of hyperedges we can get different variants, but without loss of generality we describe the message passing scheme in the following using tersibcopgp.
As such, we can define a CRF on the hypergraph and leverage probabilistic inference algorithms such as MFVI for higher-order inference.However, as discussed in §1, we can use a more expressive method to improve inference quality and introduce a HyperGraph Neural Network (HGNN) as described next.
Initial node representation For a relation node v ij r with its associated subject node v i s and object node v j o , we use g l (v ij r ), g l (v i s ), g l (v j o ) to denote their respective representation outputs from the lth HGNN layer.Initial node representations (before being fed to a HGNN) are , respectively (from the backbone module).
Message representation A hyperedge connecting to nodes serve as the bridge for message passing between nodes connected by it.Let N e (v) be the set of hyperedges connecting to a node v.
For a ter hyperedge e ij ter ∈ E sor connecting a subject node v i s , a object node v j o and a relation node v ij r , the message representation it carries is: where • is the Hadamard product.
A rel-rel edge e ijk z ∈ E rr , z ∈ {sib, cop, gp} connects two relations sharing an entity.For simplicity, we denote them relation a and b.If we fix a as a ≜ v ij r , then as previously described, relation b is v ik r for sib edge, v kj r for cop edge, and v jk r for gp edge.The message e ijk z carries is given by, Node representation update We aggregate messages for each node v ∈ V from adjacent edges N e (v) with an attention mechanism by taking a learned weighted sum, and add the aggregated message to the prior node representation, where σ(•) is a non-linear activator and w, W are two trainable parameters.An entity node would receive messages only from ter edges while a relation node would receive messages from both ter edges and rel-rel edges.
Training We obtain refined g l (v) from the final layer of HGNN.Give an entity span s i ∈ S p (X), we concatenate the corresponding subject representation g l (v i s ) and object representation g l (v i o ) to obtain the entity representation, and compute the probability distribution over the types {C e } {null}: )) Given a relation r ij = (s i , s j ), s i , s j ∈ S p (X), we compute the probability distribution over the types {C r } {null}: We use the cross-entropy loss for both entity and relation prediction: log(P e (y * e (s i )|s i )) where y * e and y * r are gold entity and relation types respectively.The total loss is L = L e + L r .

Evaluation metrics
We report micro labeled F1 measures for NER and RE.For RE, the difference between Rel and Rel+ is that the former requires correct prediction of subject and object entity spans and the relation type between them, while the latter additionally requires correct prediction of subject and object entity types.  1 shows the main results.Surprisingly, Backbone outperforms prior approaches in almost all metrics by a large margin (except on ACE2004 with BERT B and ACE2005 with ALBERT), which we attribute to the reduction of error propagation with a span pruning mechanism.Our proposed model HGERE outperforms almost all baselines in all metrics (except the entity metric on ACE2004), validating that using hyperedges to encode higherorder interactions is effective (compared with GCN) and that using hypergraph neural networks for higher-order modeling and inference is better than CRF-based probabilistic modeling with MFVI.Finally, we remark that HGERE obtains state-of-the-art performances on all the three datasets.

Effectiveness of the span pruner
To study the effectiveness of the span pruner, we replace it with an entity identifier which is the original NER module from PL-marker and is trained only on entity existence.The performance of the span pruner and the entity identifier (denoted by Eid) on entity existence is shown in Table 2.We can observe that if we replace the span pruner with the entity identifier, the recall of gold unlabeled  3. We can see that without a span pruner, both NER and RE performances drop significantly, validating the usefulness of using a span pruner.Moreover, it has a consequent influence on the higher-order inference module (i.e., HGNN).Without a span pruner, the improvement from using a HGNN over Backbone is marginal compared to that with a span pruner.We posit that without a pruner many gold entity spans could not exist in the hypergraph of HGNNs, making true entities and relations less connected in the hypergraph and thus diminishing the usefulness of HGNNs.

Effect of the choices of hyperedges
We compare different variants of HGNN with different combinations of hyperedges.Note that if ter is not used, entity nodes do not have any hyperedges connecting to them, so their representations would not be refined.We can see that in the sib and cop variants, the NER performance improves slightly, which we attribute to the shared encoder of NER and RE tasks3 .On the other hand, in the ter variant, entity node representations are iteratively refined, resulting in significantly better NER performance than Backbone (74.2 vs. 71.3).Combining ter edges with other rel-rel edges (e.g., sib) is generally better than using ter alone in terms of NER performance, suggesting that joint (and higher-order) modeling of NER and RE indeed has a positive influence on NER, while prior pipeline approaches (e.g., PL-marker) cannot enjoy the benefit of such joint modeling.
For RE, sib and cop have positive effects on the performance (despite gp having a negative effect somehow), showing the advantage of modeling interactions between two different relations.Further combining them with ter improves RE performances in all cases, indicating that NER also has a positive effect on RE and confirming again the advantage of joint modeling of NER and RE.

Inference speed of higher-order module
To analyze the computing cost of our higher-order module, we present the inference speed of HGERE with three baseline models Backbone, GCN and MFVI on the test sets of SciERC and ACE2005.Inference speed is measured by the number of candidate entities processed per second.The results are shown in Table 5.We can observe that when utilizing a relatively smaller PLM, HGERE, GCN and MFVI were slightly slower than the firstorder model Backbone.However, the difference in speed between HGERE and the other models was relatively small.When using ALBERT, which is much slower than BERT B , all four models demonstrated comparable inference speeds.

Error correction analysis
We provide quantitative error correction analysis between our higher-order approach HGERE and the first-order baseline Backbone on the SciERC dataset in Fig. 2. We can see that most error corrections of entities and relations made by HGERE come from two categories.The first category is where Backbone incorrectly predicts a true entity or relation as null, and the second category is where Backbone incorrectly assigns a label to a null sample.
This work is similar to Sun et al. (2019); Nguyen et al. ( 2021) for we both use a graph neural network to enhance the instance representations.The main difference is that the GCN they use cannot adequately model higher-order relationship among multiple instances, while our hypergraph neural network is designed for higher-order modeling.
CRF-based higher-order model A commonly used higher-order model utilizes approximate inference algorithms (mean-field variational inference or loopy belief propagation) on CRFs.Zheng et al. (2015b) formulate the mean-field variational inference algorithm on CRFs as a stack of recurrent neural network layers, leading to an end-to-end model for training and inference.Many higher-order models employ this technique for various NLP tasks, such as semantic parsing (Wang et al., 2019;Wang and Tu, 2020) and information extraction (Jia et al., 2022).
Hypergraph neural network Hypergraph neural network (HyperGNN) is another way to construct an higher-order model.Traditional Graph Neural Networks employ pairwise connections among nodes, whereas HyperGNNs use a hypergraph structure for data modeling.Feng et al. (2019) and Bai et al. (2021) proposed spectralbased HyperGNNs utilizing the normalized hypergraph Laplacian.Arya et al. ( 2020) is a spatialbased HyperGNN which aggregates messages in a two-stage procedure.Huang and Yang (2021) proposed UniGNN, a unified framework for interpreting the message passing process in HyperGNN.Gao et al. (2023) introduced a general high-order multi-modal data correlation modeling framework to learn an optimal representation in a single hypergraph based framework.

Conclusion
In this paper, we present HGERE, a joint entity and relation extraction model equipped with a span pruning mechanism and a higher-order interaction module (i.e., HGNN).We found that simply using the span pruning mechanism by itself greatly improve the performance over prior state-of-the-art PL-marker, indicating the existence of the error propagation problem for pipeline methods.We compared our model with prior tranditional GNN-based models which do not contain hyperedges connecting multiple instances and showed the improvement, suggesting that modeling higher-order interactions between multiple instances is beneficial.Finally, we compared our model with the most popular higher-order CRF models with MFVI and showed the advantages of HGNN in higher-order modeling.

Limitations
Our model achieves a significant improvement in most cases (on ACE2004, SciERC datasets and on ACE2005 with Bert base ).While on ACE2005 with stronger encoder (e.g., ALBERT) we observe less siginificant improvements.We posit that, with powerful encoders, the recall of gold entity spans would increase, thereby mitigating the error propagation issue and diminishing the benefit of using a span pruning mechanism.
Another concern regarding our model is computational efficiency.The time complexity of the Subject-oriented Packing for Span Pair encoding scheme from PL-marker grows linearly with the size of candidate span size.Recall that we overpredict many spans using a span pruning mechanism, which slows down the running time.In practice, our model's running time is around as three times as that of PL-marker.
the pruner training and inference, we consider the span length limitation L of 12 for SciERC and 8 for ACE2004 and ACE2005.For pruners of any datasets and PLMs, the top-K ratio λ = 0.5, the boundaries of K are l min = 3, l max = 18.We use three hypergraph convolution layers for GCN, MFVI and HGERE.As the entity recall is high enough, pruners use on ACE2004 and ACE2005 are only trained with BERT B .For all experiments, we run each configuration with 5 different seeds and report the average micro-F1 scores and standard deviation.
For the pruner, the output sizes of FFN ST , FFN ED and FFN q are d m = 768, the bi-affine embedding size d biaf = 256, the output size of FFN attn is 256.
For the backbone module, the output sizes of FFN s , FFN o and FFN r are tuned on [400,512,768] for all datasets.
For the hypergraph neural network, the output sizes of FFN ter r , FFN ter s , FFN ter o , FFN z a , FFN z b are tuned among [256,400,512] and fixed on 400 for all experiments on SciERC.The output sizes of FFN ter e , FFN z e are tune on [256,400,512,768] for all experiments.For GCN, MFVI and HGERE, we all use three layers to refine the node representations.We train our models with Adam optimizer and a liner scheduler with warmup ratio of 0.1.We tune the eps of Adam optimizer on ) .Then we concatenate two kinds of span representations-bi-affine (Dozat and Manning, 2016) and attentive poolingas the final one.For a span s i consisting of tokens x ST(i) , ..., x ED(i) , its bi-affine span representation is a d biaf -dimension vector, the symbol ; is the concatenation operation, FFN ST and FFN ED are feed-forward layers with an output size d m and W p ∈ R (dm+1) * d biaf * (dm+1) is a learn-able weight.The attentive pooling layer is a weighted average over the contextualize token representations in the span, w j x j and the final span representation is, Training and Inference Given the gold binary tag y(s i ) ∈ {0, 1} (indicating the existence of a candidate span in the gold span set), we train the span pruner with the binary cross-entropy (BCE) loss:

A.5 Mean-field Variant Inference
Here we introduce the method used in baseline MFVI.The hyperedges in our graph are replaced by factors in MFVI, so there are also four kinds of factors: ter, sib, cop, gp.
first-order scores We use the node representations to score the entities and relations for each label (include the null).
Higher-order scores Each factor scores the joint distribution of the node types connected to it.For a ter factor connects a subject v i s , an object v j o and a relation v ij r , the factor score f ter ij ∈ R |Ce+1| 2 |Cr+1| is: For a factor z, z ∈ {sib, cop, gp}which connects two relations, we name them relation a and b for simplicity.If relation a is v ij r , then relation b is v ik r , v kj r and v jk r for sib, cop and gp respectively.We use g(a), g(b) to refer to the relation representations of relations a and b.The factor score f z ijk ∈ R |Cr+1| 2 is defined as: higher-order inference In the model, computing the node distribution can be seen as doing posterior inference on a Conditional Random Field (CRF).MFVI iteratively updates a factorized variational distribution Q to approximate the posterior label distribution.We use Q s i (e 1 ), Q o j (e 2 ) to refer to the probability of subject v i s and object v j o has entity type e 1 and e 2 respectively and Q r ij (r) represents the relation v ij r has the relation type r.For simplicity, we use u s i (e 1 ), u o j (e 2 ), u r ij (r 1 ), f ter ij (e 1 , e 2 , r 1 ), f z ijk (r 1 , r 2 ) to represent the first-order and higher-order scores when the subject v i s , the object v j o have entity type e 1 , e 2 , the relation a (v ij r ), the relation b have relation types r 1 , r 2 respectively.Following is the iterately updating of the distribution Q The message only passed from ter factor in the l-th iteration is: similarly, the message passed from ter factor to the object v j o is: For a relation v ij r , the message could be passed from four factors, we list them by the source.From ter factor: From sib factor: From the cop factor: From the gp factor: The posterior distribution of entity e i with respect to the subject s i and object o i : Q l s i (e) ∝ exp(u s i (e) + F l s i (e)) Q l o j (e) ∝ exp(u o j (e) + F l o j (e)) Then the entity distribution is : We initial the Q of subject v s i , object v o j , the relation v r ij by normalizing the unary potential exp(u s i ), exp(u o j ), exp(u r ij ) respectively.The posterior distribution of the relation r ij is: The symbol 1 z , z ∈ {ter, sib, cop, gp} indicates whether the factor z exists in the graph.

A.6 GCN
Here is the introduction of the baseline GCN.As in HGERE, we also build the graph G = (V, E) with subject, object and relation nodes, V = V s V o V r .For each relation node v ij r ∈ V r , we build two edges connecting its subject node v i s ∈ V s and object node v j o ∈ V o respectively.the  From the main results, we can see that the HGERE shows a significantly greater improvement in performance compared to the Backbone model on the SciERC dataset than on the ACE2005 dataset.We guess one of the reason is the size of the training data.Because with more training data, models could learn enough knowledge from a large number of samples and reduce the demand of higher-order information.So we compare HGERE to Backbone with 5% and 10% of training data on the ACE2005 (BERT B ) to see if higher-order inference is more effectiveness with small training data.From the SciERC Ent Rel Rel+ max 74.0 54.7 41.4 sum 73.6 54.5 41.5 attn 74.9 55.7 43.6  From Fig. 4 we can see that using three HGNN layers performs the best while more layers lead to worse results.We posit that this is because using more HGNN layers would suffer from the wellknown over-smoothing problem (Cai and Wang, 2020).

A.9 Effect of the aggregation function in message passing
We study the influence of using different message aggregation functions.HGERE uses an attention mechanism (attn) to update node representations while it is also possible to use max-pooling (max) or sum-pooling (sum).Table 8 shows that attn performs the best.
(a) Error correction matrix of HGERE vs. Backbone of entities p a r t -o f u s e d -f o r f e a t u r e -o f e v a l u a t e -f o r h y p o n y m -o f c o n j u n c t i Figure 2: Error correction of entity and relation types on the SciERC dataset.Red color indicates positive corrections and blue color indicates negative corrections.Specifically, positive numbers on the diagonal of the matrix (in red color) indicate that HGERE makes more correct predictions compare to Backbone; negative numbers on non-diagonal entries (in red color) indicate that HGERE makes fewer wrong predictions compare to Backbone.Numbers in blue indicate the opposite.We do not count the null-null case.
[1e − 8, 1e − 9] for ACE2005, and eps=1e − 8 for other datasets.The batch size of all experiments are 18.The learning rate of PLM are 2e − 5, for other module the learning rate is tune on [5e − 5, 1e − 4].The epochs on SciERC for Backbone are 20, and 30 for other models.The epochs on ACE2004 and ACE2005 (BERT B ) are 15, on ACE2004 and ACE2005 (AL-BERT) are 10.We do all experiments on a A40 GPU with apex fp16 training option on.A.4 Details of the span pruner We obtain contextualized representations of the tokens x and levitated marker representations xs (for [O]) and xe (for [\O]

Figure 3 :
Figure 3: Illustration of an example graph of GCN

Figure 4 :
Figure 4: The change of F1 scores with respect to the number of HGNN layers .

Table 1 :
F1 scores and standard deviations on ACE2004, ACE2005 and SciERC.The models marked with ⋆ leverage cross-sentence information.A model with subscript re means we re-evaluate the model with the evaluation method commonly used in other work 2 .Backbone, MFVI and GCN are our baseline models.
Baseline Our baseline models include: i) Backbone.It is described in Sect.3.2 and does not contain the higher-order interaction module.ii)GCN.It has a similar architecture to Sun et al. graphs with different types of hyperedges: ter, cop, sib, gp, tersib, tercop, tergp, tersibcop, tersibgp, tercopgp, and tersibcopgp.The best variants of HGERE are tersibcop on SciERC and ACE2005 (BERT B ); tersib on ACE2005 (ALBERT); tercop on ACE2004.For MFVI we use the same variants as used in HGERE.Table

Table 3 :
F1 scores of Backbone and HGERE with and without a pre-trained span pruner on the SciERC and ACE2005 (BERT B ) test set.

Table 4 :
F1 scores of HGERE with different graph topologies on the SciERC test set.

Table 5 :
Comparison of inference speed (#entities/sec) between HGERE and three baseline models on test sets of SciERC and ACE2005.

Table 7 :
7 Performance with part of the training data F1 score of HGERE on ACE2005 test set when only provide 5% and 10% training samples.

Table 8 :
F1 scores of HGERE (the tersibcop variant) with different aggregation functions on the SciERC test set.resultsshown in Table7we can see that the increments of absolute F1 score on Rel+ metric from Backbone to HGERE are 2.1%, 2.2% on 5% and 10% of training set respectively, which are much higher than 0.6% on full training set.A.8 Effect of the number of HGNN layers