A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach

Relation extraction has the potential for large-scale knowledge graph construction, but current methods do not consider the qualifier attributes for each relation triplet, such as time, quantity or location. The qualifiers form hyper-relational facts which better capture the rich and complex knowledge graph structure. For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by including the qualifier (End Time, 1967). Hence, we propose the task of hyper-relational extraction to extract more specific and complete facts from text. To support the task, we construct HyperRED, a large-scale and general-purpose dataset. Existing models cannot perform hyper-relational extraction as it requires a model to consider the interaction between three entities. Hence, we propose CubeRE, a cube-filling model inspired by table-filling approaches and explicitly considers the interaction between relation triplets and qualifiers. To improve model scalability and reduce negative class imbalance, we further propose a cube-pruning method. Our experiments show that CubeRE outperforms strong baselines and reveal possible directions for future research. Our code and data are available at github.com/declare-lab/HyperRED.


Introduction
Knowledge acquisition is an open challenge in artificial intelligence research (Lenat, 1995).The standard form of representing the acquired knowledge is a knowledge graph (Hovy et al., 2013), which has broad applications such as question answering (Yih and Ma, 2016;Chia et al., 2020) and search engines (Xiong et al., 2017).Relation extraction (RE) is a task that has the potential for large-scale and automated knowledge graph construction by extracting facts from natural language text.Most relation extraction methods focus on binary relations (Bach and Badaskar, 2007) which consider the relationship between two entities, forming a relation triplet consisting of the head entity, relation and tail entity respectively.However, knowledge graphs commonly contain hyper-relational facts (Guan et al., 2019) which have qualifier attributes for each relational triplet, such as time, quantity, or location.For instance, Wen et al. (2016) found that the Freebase knowledge graph contains hyper-relational facts for 30% of entities.Hence, extracting relation triplets may be an oversimplification of the rich and complex knowledge graph structure.As shown in Figure 1, a relation triplet can be attributed to one or more qualifiers, where a qualifier is composed of a qualifier label and value entity.For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by specifying the qualifier of (End Time, 1967), forming the hyper-relational fact (Leonard Parker, Educated At, Harvard University, End Time, 1967).
Hyper-relational facts generally cannot be simplified into the relation triplet format as the qualifiers are attributed to the triplet as a whole and not targeted at a specific entity in the triplet.Furthermore, attempting to decompose the hyperrelational structure to an n-ary format would lose the original triplet information and be incompatible with the knowledge graph schema (Rosso et al., 2020).On the other hand, hyper-relational facts have practical benefits such as improved fact verification (Thorne et al., 2018) and representation learning for knowledge graphs (Galkin et al., 2020).Thus, it is necessary to extract relation triplets together with qualifiers to form hyper-relational facts.
In this work, we propose the task of hyperrelational extraction to jointly extract relation triplets with qualifiers from natural language sentences.To support the task, we contribute a generalpurpose and large-scale hyper-relational extraction dataset (HyperRED) which is constructed through distant supervision (Mintz et al., 2009) and partially refined through human annotation.Our dataset differs from previous datasets in two distinct ways: (1) Compared to existing datasets for binary relation extraction (Zhang et al., 2017;Han et al., 2018), HyperRED enables richer information extraction as it contains qualifiers for each relation triplet in the sentence.(2) While datasets for n-ary relation extraction (Jia et al., 2019) are restricted to the biomedical domain, HyperRED covers multiple domains and has a hyper-relational fact structure that is compatible with the knowledge graph schema.
Unfortunately, to the best of our knowledge, there are no existing models for hyper-relational extraction.Currently, a popular end-to-end method for binary relation extraction is to cast it as a tablefilling problem (Miwa and Sasaki, 2014).Generally, a two-dimensional table is used to represent the interaction between any two individual words in a sentence.However, hyper-relational extraction requires the model to consider the interactions between two entities in the relation triplet, as well as the value entity for the qualifier.Thus, we extend the table-filling approach to a third dimension, casting it as a cube-filling problem.On the other hand, a naive cube-filling approach faces two issues: (1) Computing the full cube representation is computationally expensive and does not scale well to longer sequence lengths.(2) The full cube will be sparsely labeled with a vast majority of entries as negative samples, causing the model to be biased in learning (Li et al., 2020) and hence underperform.
To tackle these two issues, we propose a simple yet effective cube-pruning technique that filters the cube entries based on words that are more likely to constitute valid entities.Our experiments show that cube-pruning significantly improves the com-putational efficiency and simultaneously improves the extraction performance by reducing the negative class imbalance.In addition to our cube-filling model which we refer to as CubeRE, we also introduce two strong baseline models which include a two-stage pipeline and a generative sequence-tosequence (Sutskever et al., 2014) model.
In summary, our main contributions include: (1) We propose the task of hyper-relational extraction to extract richer and more complete facts by jointly extracting each relation triplet with the corresponding qualifiers; (2) To support the task, we provide a large-scale and general-purpose dataset known as HyperRED.(3) As there is no existing model for hyper-relational extraction, we propose a cubefilling model known as CubeRE, which consistently outperforms baseline extraction methods.
2 HyperRED: A Hyper-Relational Extraction Dataset Our goal is to construct a large-scale and generalpurpose dataset for extracting hyper-relational facts from natural language text.However, it is seldom practical to assume to have an ample amount of high-quality labeled samples in real applications, especially for complex tasks such as information extraction.Hence, we propose a weakly supervised (Craven and Kumlien, 1999) data setting which enables us to collect a larger and more diverse training set than would be otherwise possible.To minimize the effect of noisy samples in evaluation, we then perform human annotation for a portion of the collected data and allocate it as the held-out set.In the following sections, we first introduce the process of collecting the distantly supervised data, followed by the human-annotated data portion.

Distantly Supervised Data Collection
To collect a large and diverse dataset of sentences with hyper-relational facts, we employ distant supervision which falls under the weakly supervised setting.Distant supervision automatically collects a dataset of relational facts by aligning a text corpus with facts from an existing knowledge graph.Similar to Elsahar et al. (2018), we first extract and link entities from the corpus to an existing knowledge graph, and resolve any coreference cases to the previously linked entities.To align hyper-relational facts from the knowledge graph to the text corpus, we detect if the entities that comprise each fact are also present in each sentence.Each sentence with  aligned facts is collected as part of the distantly supervised dataset.To ensure that the large-scale text corpus can be well-aligned with the knowledge graph, we perform distant supervision between English Wikipedia and Wikidata (Erxleben et al., 2014), which is the central knowledge graph for Wikipedia.Following Elsahar et al. (2018), we use the introduction sections of Wikipedia articles as the text corpus as they generally contain the most important information.

Entity Extraction and Linking
The distant supervision process relies on matching entities in a sentence with facts from the knowledge graph.To detect and identify the named entities in the articles, we use the DBpedia Spotlight (Mendes et al., 2011) entity linker.For the extraction of temporal and numerical entities, we use the spaCy 1 tool.
Coreference Resolution As Wikipedia articles often use pronouns to refer to entities across sentences, it is necessary to resolve such references.We employ the Stanford CoreNLP tool (Manning et al., 2014) for this task.
Hyper-Relational Alignment To extend the distant supervision paradigm to hyper-relational facts, we jointly match based on the entities that comprise each hyper-relational fact.Formally, let f = (e head , r, e tail , q, e value ) be a possible hyperrelation fact consisting of the head entity, relation, tail entity, qualifier label and value entity, respectively.Given a corpus of text articles, each article contains a set of sentences {s i , ..., s n }, where each sentence s i has E i entities that are linked to the knowledge graph.For each hyper-relational fact f in the knowledge graph, it is aligned to the sentence s i if the head entity e head , tail entity e tail and value entity e value are all linked in the sentence.Hence, we obtain a set of aligned facts for each sentence: 1 https://spacy.ioFollowing Riedel et al. (2010), we remove any sentence that does not contain aligned facts.

Human-Annotated Data Collection
Although distant supervision can align a large amount of hyper-relational facts, the process can introduce noise in the dataset due to possible spurious alignments and incompleteness of the knowledge graph (Nickel et al., 2016).However, it is not feasible to completely eliminate such noise from the dataset due to the annotation time and budget constraints.Hence, we select a portion of the distantly supervised data to be manually labeled by human annotators.To provide a solid evaluation setting for future research works, the human-annotated data will be used as the development and testing set.We include the development set in the annotated portion as it is necessary for hyperparameter tuning and model selection.
The goal of the human annotation stage is to identify correct alignments and remove invalid alignments.During the process, the annotators are tasked to review the correctness of each aligned fact, where an aligned fact consists of the sentence s i and hyper-relational fact f .The alignment may be invalid if the relation triplet of the fact is not semantically expressed in the sentence, based on the Wikidata relation meaning.For instance, given the sentence "Prince Koreyasu was the son of Prince Munetaka who was the sixth shogun.", the relation triplet (Prince Koreyasu, Occupation, shogun) is considered invalid as the sentence did not explicitly state if "Prince Koreyasu" became a shogun.Similarly, the alignment may be invalid if the qualifier of the fact is not semantically expressed in the sentence, based on the Wikidata definition of the qualifier label.For example, given the sentence "Robin Johns left Northamptonshire at the end of the 1971 season.", the hyper-relational fact (Robin Johns, member of sports team, Northamptonshire, Start Time, 1971)   label should be changed to "End Time".Hence, the annotation is posed as a multi-class classification over each alignment with three classes: "correct", "invalid triplet" or "invalid qualifier".Appendix A has the annotation guide and data samples.Each alignment sample is annotated by two professional annotators working independently.There are 6780 sentences annotated in total and the interannotator agreement is measured using Cohen's kappa with a value of 0.56.The kappa value is comparable with previous relation extraction datasets (Zhang et al., 2017), demonstrating that the annotations are of reasonably high quality.For each sample with disagreement, a third annotator is brought to judge the final result.We observe that 76% of samples are annotated as "correct", which indicates a reasonable level of accuracy in the distantly supervised data.To reduce the long-tailed class imbalance (Zhang et al., 2019), we use a filter to ensure that all relation and qualifier labels have at least ten occurrences in the dataset.Although it can be more realistic to include challenging samples such as long-tailed class samples or negative samples in the dataset, we aim to address such challenges in a future dataset version release.

Data Analysis
To provide a better understanding of the HyperRED dataset, we analyze several aspects of the dataset.
Qualifier Typology The qualifiers of the hyperrelational facts can be grouped into several broad categories as shown in Table 1.Notably, the majority of the qualifiers fall under the "Time" category, as it can be considered a fundamental attribute of many facts.The remaining qualifiers are distributed among the "Quantity", "Role", "Part-Whole" and "Location" categories.Hence, the HyperRED dataset is able to support a diverse typology of hyper-relational facts.

Size and Coverage
The statistics of HyperRED are shown in Table 2.We find that in terms of size and number of relation types, HyperRED is comparable to existing sentence-level datasets, such as TACRED (Zhang et al., 2017), NYT24 and NYT29 (Nayak and Ng, 2020).Table 1 also demonstrates that HyperRED can serve as a general-purpose dataset, covering several domains such as business, sports and politics.Appendix C has more details.
3 CubeRE: A Cube-Filling Approach 3.1 Task Formulation Hyper-Relational Extraction Given an input sentence of n words s = {x 1 , x 2 , ..., x n }, an entity e is a consecutive span of words where e = {x i , x i+1 , ..., x j }, i, j ∈ {1, ..., n}.For each sentence s, the output of a hyper-relational extraction model is a set of facts where each fact consists of a relation triplet with an attributed qualifier.A relation triplet consists of the relation r ∈ R between head entity e head and tail entity e tail where R is the predefined set of relation labels.The qualifier is an attribute of the relation triplet and is composed of the qualifier label q ∈ Q and the value entity e value , where Q is the predefined set of qualifier labels.Hence, a hyper-relational fact has five components: (e head , r, e tail , q, e value ).
Cube-Filling Inspired by table-filling approaches which can naturally perform binary relation extraction in an end-to-end fashion, we cast hyper-relational extraction as a cube-filling problem, as shown in Figure 2. The cube contains multiple planes where the front-most plane is a two-dimensional table containing the entity and relation label information, while the following planes contain the corresponding qualifier information.Each entry on the table diagonal represents a possible entity, while each entry outside the table diagonal represents a possible relation triplet.For example, the entry "Educated At" represents a relation between the head entity "Parker" and the tail entity "Harvard".Each table entry y t ij can contain the null label ⊥, an entity or relation label, i.e., y t ij ∈ Y t = {⊥, Entity} ∪ R. The following planes in the cube represent the qualifier dimension, where each entry represents a possible qualifier label and value entity word for the corresponding relation triplet.For instance, the entry "Academic Degree" in the qualifier plane for "PhD" corresponds to the relation triplet (Parker, Educated At, Harvard), hence forming the hyperrelational fact (Parker, Educated At, Harvard, Aca-demic Degree, PhD).Each qualifier entry y q ijk can contain the null label ⊥ or a qualifier label, i.e., y q ijk ∈ Y q = {⊥} ∪ Q.Note that the cube-filling formulation also supports hyper-relational facts that share the same relation triplet, as the different qualifiers can occupy separate planes in the qualifier dimension and still correspond to the same relation triplet entry.

Model Architecture
Our model known as CubeRE first encodes each input sentence using a language model encoder to obtain the contextualized sequence representation.We then capture the interaction between each possible head and tail entity as a pair representation for predicting the entity-relation label scores.To reduce the computational cost, each sentence is pruned to retain only words that have higher entity scores.Finally, we capture the interaction between each possible relation triplet and qualifier to predict the qualifier label scores and decode the outputs.

Sentence Encoding
To encode a contextualized representation for each word in a sentence s, we use the pre-trained BERT (Devlin et al., 2019) language model: where h i denotes the contextualized representation of the i-th word in the sentence.

Entity-Relation Representation
To capture the interaction between head and tail entities, we concatenate each possible pair of word representations and project with a dimensionreducing feed-forward network (FFN): Thus, we construct the table of categorical probabilities over entity and relation labels by applying an FFN and softmax over the pair representation: where ŷt ij denotes the predicted table entry corresponding to the relation between the i-th possible head entity word and j-th possible tail entity word.Note that we use the concatenation operation in Equation 2 instead of the averaging operation or other representation methods (Baldini Soares et al., 2019) as the concatenation operation is simple and shown to be effective in recent RE works (Wang et al., 2021a;Wang and Lu, 2020).

Cube-Pruning
To predict the qualifier of a hyper-relational fact, the model needs to consider the interaction between each possible relation triplet and value entity, where the relation triplet contains a head entity and a tail entity.For a sentence with n words, there are n 3 interactions that do not scale well for longer input sequences.Hence, we propose a cube-pruning method to consider only interactions between the top m words in terms of entity score.Consequently, the model will only consider the interaction between the top-m most probable words of the potential head entities, tail entities, and value entities respectively.This reduces the number of interactions to m 3 where m is a fixed hyperparameter.
The cube-pruning method also has the benefit of alleviating the negative class imbalance by reducing the proportion of entries with the null label, and we analyze this effect in Section 5.1.To detect the most probable entity words, we obtain the respective entity scores from the diagonal of the table ŷt containing the entity and relation scores (i.e., the front-most plane in Figure 2): The entity scores are then ranked to obtain the pruned indices {1, ..., m} which will be applied to each dimension of the cube representation.
To capture the hyper-relational structure between relation triplets and qualifier attributes, we use a bilinear interaction layer between each possible pair representation and word representation.The categorical probability distribution over qualifier labels for each possible relation triplet and value entity is then computed as: where i ′ , j ′ , k ′ ∈ {1, ..., m} are the pruned indices and U is a trainable bilinear weight matrix.

Training Objective
The training objective for the entity-relation table is computed using the negative log-likelihood as: The training objective for the qualifier dimension is computed using the negative log-likelihood as: To enable end-to-end training, the overall cubefilling objective is aggregated as the sum of losses:

Decoding
To decode the hyper-relational facts from the predicted scores, we implement a simple and efficient method and provide the pseudocode in Appendix D.
As it is intractable to consider all possible solutions, a slight drop in decoding accuracy is acceptable.
A key intuition is that if a valid qualifier exists, this indicates that a corresponding relation triplet also exists.Hence, we first decode the qualifier scores (Equation 5) to determine the span positions of the head entity, tail entity and value entity in each hyper-relational fact.Consequently, we can determine the relation and qualifier label from the corresponding entries in the relation scores (Equation 3) and qualifier scores respectively.
To handle entities that may contain multiple words, we consider adjacent non-null qualifier entries to correspond to the same head entity, tail entity, and value entity, hence belonging to the same hyper-relational fact.This assumption holds true for 97.14% of facts in the dataset.To find and merge the adjacent non-null entries, we use the nonzero operation which is more computationally efficient compared to nested for-loops.For each group of adjacent entries that correspond to the same hyper-relational fact, we determine the relation label by averaging the corresponding relation scores.Similarly, we determine the qualifier label by averaging the corresponding qualifier scores.When using cube-pruning, we map the pruned indices back to the original indices before decoding.Appendix E has the model speed comparison.

Experimental Settings
Evaluation Similar to other information extraction tasks, we use the Micro F 1 metric for evaluation on the development and test set.For a predicted hyper-relational fact to be considered correct, the whole fact f = (e head , r, e tail , q, e value ) must match the ground-truth fact in terms of relation label, qualifier label and entity bounds.
Hyperparameters For the encoding module, we use the BERT language model, specifically the uncased base and large versions.We train for 30 epochs with a linear warmup for 20% of training steps and a maximum learning rate of 5e-5.We employ AdamW as the optimizer and use a batch size of 32.For model selection and hyperparameter selection, we evaluate based on the F 1 on the development set.We use m = 20 for cube-pruning and Appendix B has more experimental details.

Baseline Methods
As there are no existing models for hyper-relational extraction, we introduce two strong baselines that leverage pretrained language models.The pipeline baseline is based on a competitive table-filling model for joint entity and relation extraction, while the generative baseline is extended from a state-ofthe-art approach for end-to-end relation extraction.
Pipeline Baseline As pipeline methods can serve as strong baselines for information extraction tasks (Zhong and Chen, 2021), we implement a pipeline method for hyper-relational extraction.Concretely, we first train a competitive relation extraction model architecture UniRE (Wang et al., 2021a) to extract relation triplets from each input sentence.Separately, we train a span extraction model based on BERT-Tagger (Devlin et al., 2019) that is conditioned on the input sentence and a relation triplet to extract the value entities and corresponding qualifier label.However, as both stages fine-tune a pretrained language model, the pipeline method doubles the number of trainable parameters compared to an end-to-end method which only fine-tunes one pretrained language model.To avoid an unfair comparison as larger models are more sample-efficient (Kaplan et al., 2020), we use DistilBERT (Sanh et al., 2019) in both stages of the pipeline.
Generative Baseline Inspired by the flexibility of language models for complex tasks such as information extraction and controllable structure generation (Shen et al., 2022), we propose a generative method for hyper-relational extraction.Compared to a pipeline method, a generative method can perform hyper-relational extraction in an end-toend fashion without task-specific modules (Paolini et al., 2021).Similar to existing generative methods for relation extraction (Huguet Cabot and Navigli, 2021;Chia et al., 2022), we use BART (Lewis et al., 2020) which takes the sentence as input and outputs a structured text sequence that is then decoded to form the extracted facts.For instance, given the sentence "Parker received his PhD from Harvard.", the sequence-to-sequence model is trained to generate "Head Entity: Parker, Relation: educated at, Tail Entity: Harvard, Qualifier: academic degree, Value: PhD."The generated text is then decoded through simple text processing to form the hyper-relational fact (Parker, Educated At, Harvard, Academic Degree, PhD).

Main Results
We compare CubeRE with the baseline models and report the precision, recall, and F 1 scores with standard deviation in valid hyper-relational facts, which is demonstrated by the higher recall and F 1 scores.Compared to the generative baseline, our cube-filling approach is able to explicitly consider the interaction between relation triplets and qualifiers to better extract hyper-relational facts.Furthermore, we argue that CubeRE is more interpretable than the generative baseline as it can compute the score for each possible relation triplet and qualifier.Hence, Cu-beRE can also be more controllable as it is possible to control the number of predicted facts by applying a threshold to the triplet and qualifier scores.

Triplet-Based Evaluation
To further investigate the differences in model performance, we also report the results when considering only the triplet component of hyper-relational facts in

Analysis
In this section, we study the effect of cube-pruning and identify directions for future research.Further analysis is shown in Appendix F.

Effect of Pruning
In addition to improving the computational efficiency of CubeRE as discussed in Section 3.2.3,our cube pruning method may also improve the extraction performance of the model.During training, the cube-filling approach faces the issue of having mostly null entries, thus biasing the learning process with negative class imbalance (Li et al., 2020).By pruning the cube to consider only the entries associated with higher entity scores, the proportion of null entries is reduced, hence alleviating the class imbalance issue.This is supported by the trend in Figure 3, as relaxing the pruning threshold m leads to reduced F 1 scores.On the other hand, overly strict pruning will reduce the recall, negatively affecting the overall performance.

Model Performance Breakdown
To identify directions for future research in hyperrelational extraction, we analyze the model performance separately for each general qualifier category.As shown in Table 4, there is a variance in model performance across qualifier categories that cannot be fully explained by their proportion in the dataset.For instance, although the "Time" category comprises a majority of the qualifiers, it does not have the highest performance.This suggests that future research may focus on areas such as temporal reasoning, which is an open challenge for language models (Vashishtha et al., 2020;Dhingra et al., 2022).In addition, CubeRE demonstrates strong performance across all categories which suggests that it can serve as a general extraction model for different qualifiers.

Related Work
Knowledge Graph Construction In addition to extraction from natural language text, the under- lying facts for knowledge graphs can also be extracted from semi-structured websites (Lockard et al., 2018), tables (Dong et al., 2020) or link prediction (Wang et al., 2017).However, textual extraction may be a more pressing challenge due to the vast amount of unstructured textual data on the web (Lockard et al., 2020).Hence, this work focuses on extracting facts from unstructured text.(Bing et al., 2015).Hence, a possible future direction is to adapt CubeRE for extracting other types of information such as attributes (Bing et al., 2013), events (Wang et al., 2021b), arguments (Cheng et al., 2020(Cheng et al., , 2022)), aspect-based sentiment (Xu et al., 2021;Yu Bai Jian et al., 2021), commonsense knowledge (Ghosal et al., 2021), or visual scene relations (Andrews et al., 2019).Additionally, as HyperRED relies on distant supervision for dataset construction, it is necessary to further explore how to mitigate the noise in distantly supervised datasets for information extraction tasks (Nayak et al., 2021).et al., 2021) to address this issue.

Table-Filling
Data Limitations Regarding the HyperRED dataset, the distant supervision method of data collection may not align all valid facts present in the text articles.This is due to the possible incompleteness of the knowledge graph which is an open research challenge (Nickel et al., 2016).On the other hand, it is not feasible to manually annotate all possible facts due to constraints in annotation time and cost.Furthermore, there are a large number of relation and qualifier labels to consider, resulting in a challenging task for human annotators.A promising and practical method to address the challenges in distant supervision is to adopt a human-in-theloop annotation scheme for RE (Tan et al., 2022b).
The annotation scheme can increase the number of facts in a dataset by training a RE model to predict more candidate facts for each text article, which are then reviewed and filtered by humans.However, this model-assisted annotation approach is not applicable to the construction of HyperRED as it relies on existing strong RE models, whereas there are no suitable models for hyper-relational extraction existing prior to this work.

Ethics Statement
Model Ethics Regarding the model generalization, we expect that the models introduced should perform similarly for factual text articles such as news articles from various domains, similar to the proposed dataset.However, it may not perform well for more casual text formats such as chat discussions or opinion pieces.On the other hand, we note that the models extract hyper-relational facts from the input sentences and do not guarantee the factual correctness of the extracted facts.This is an ethical consideration of RE models in general and further fact verification (Nie et al., 2019) modules are necessary before the facts can be integrated into knowledge graphs or downstream applications.
Data Ethics For the dataset construction, we collect texts and facts from Wikipedia and Wikidata respectively, which is a common practice for distantly supervised datasets.Wikidata facts are under the public domain 3 while Wikipedia texts are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License4 .Hence, we are free to adapt the texts to construct our dataset, which will also be released under the same license.
For the human data annotation stage, we employ two professional data annotators, and they have been fairly compensated.The compensation is negotiated based on the task complexity and assessment of the reasonable annotation speed.Based on the agreed annotation scheme, each annotation batch is required to undergo quality checking where a portion of samples are manually checked.If any batch does not meet the acceptance criteria of 95% accuracy, the annotators are required to fix the errors before the batch can be accepted.

A Annotation Guide
This section explains the guideline for human annotators.The task is a classification of whether each hyper-relational fact can be reasonably extracted from a piece of text.Each annotation sample contains one sentence and one corresponding fact for judgment.The annotator should classify each sample as "Correct" or "Invalid Triplet" or "Invalid Qualifier".Each hyper-relational fact has five components with the format (head entity, relation label, tail entity, qualifier label, value entity).The head entity is the main subject entity of the relationship.The relation label is the category of relationship that is expressed between the head and tail entity.The tail entity is the object entity of the relationship that is paired with the head entity.The qualifier label is the category of the qualifier information.The value entity is the corresponding value of the qualifier that is applied to the relation triplet (head, relation, tail).The value entity can contain a date, quantity, or short piece of text which is the mentioned name of the entity.For the annotation objective, we want to know whether this piece of information is clearly expressed by the given text.All the entities, relations, and qualifiers exist in the Wikidata database, so annotators can refer to the relation or qualifier definition at https://www.wikidata.orgfor clarification.The annotation steps are as follows: 1. Read and understand the text sample which is a continuous sequence of words.Then, consider the corresponding hyper-relational fact.related to the triplet, then the fact should be marked as "Invalid Qualifier".
4. If there is no error in the fact, then it can be marked as "Correct".
For example, given the sentence "The film's story earned Leonard Spigelgass a nomination as Best Story for the 23rd Academy Awards.", the fact (Leonard Spigelgass, nominated for, Best Story, statement is subject of, 23rd Academy Awards) is correct as Leonard was nominated and the main topic is the Academy Awards.However, given the sentence "Prince Koreyasu was the son of Prince Munetaka who was the sixth shogun.", the fact (Prince Koreyasu, occupation, shogun, replaces, Prince Munetaka) has an invalid triplet as we don't know if Koreyasu became a shogun.On the other hand, given the sentence "Robin Johns left Northamptonshire at the end of the 1971 season.", the fact (Robin Johns, member of sports team, Northamptonshire, Start Time, 1971) has an invalid qualifier as the qualifier label should be "End Time" instead of "Start Time".

B Experiment Details
Hyperparameters Table 8 shows the details of our experimental setup and model hyperparameters.For the analysis experiments in Section 5, we use the BERT-Base version of CubeRE and report the F 1 metric score on the development set of HyperRED unless otherwise stated in the specific subsection.
Pipeline Baseline Details For the pipeline baseline, we use DistilBERT as the language model encoder for both the triplet extraction and conditional qualifier extraction stages.Both stages of the pipeline are fine-tuned separately on the gold labels.At inference time, the triplet extraction stage takes the sentence as input and outputs the predicted relation triplets.For each predicted relation triplet, the conditional qualifier extractor takes the sentence and the relation triplet as input to predict the possible qualifiers where each qualifier consists of the qualifier label and value entity.The input of the qualifier extraction model is the concatenated sentence and relation triplet.For example, the sentence "Leonard Parker received his PhD from Harvard University in 1967." and relation triplet (Leonard Parker, Educated At, Harvard University), will be concatenated to become 'Leonard Parker received his PhD from Harvard University in 1967.Leonard Parker | Educated At | Harvard University".The outputs of both stages are then merged to form the predicted hyper-relational facts.Following the BERT-Tagger, the conditional qualifier extraction model is trained using the crossentropy loss for sequence labeling.To encode the qualifier information as sequence labels, we use the BIO tagging scheme where the sequence label corresponds to the possible qualifier label for each entity word.For both stages which are trained separately, we use the same epochs, learning rate and batch size as the CubeRE model for fairness.

Generative Baseline Details
The generative baseline model can predict hyper-relational facts by learning to generate a text sequence with a special structured format as demonstrated in Section 4.2.Note that if the sentence contains multiple hyperrelational facts, the desired output sequence is simply the concatenated text sequence of the structured text for each fact.The multiple facts can be easily decoded from the structured text format with simple text processing such as regex.As the input and output of the model are text sequences which do not violate the model vocabulary, the generative baseline can be trained using a standard sequenceto-sequence modeling objective.For training, we use the same epochs, learning rate and batch size as the CubeRE model for fairness.

C Dataset Details
Dataset Statistics Table 5 shows the detailed statistics of HyperRED, such as the number of unique facts and entities, as well as the average number of words in each sentence.Table 9 and Table 10 show the set of relation and qualifier labels respectively.For the construction of the  Annotation Challenges The human annotation of the dataset may be imperfect due to complexity of the hyper-relational fact structure, diversity of relation and qualifier labels, and possible ambiguous facts.The hyper-relational facts require annotators to joint consider the relation triplet and qualifier which is more challenging compared to previous datasets which commonly consider the relation between two entities.On the other hand, the annotators are also required to consider the definitions of a large set of relation and qualifier labels.This may pose difficult when some relations or qualifiers are similar in meaning.Lastly, there may be ambiguous cases where multiple entities are mentioned in relation to a topic and it is not clear which entity is the main subject.
Relation-Specific Qualifiers To investigate the link between relation triplets and qualifiers, we plot a histogram distribution in Figure 5.A majority (32) of the qualifier labels are each linked to a small number of relation labels (1-5), which suggests that most qualifiers are highly relation-specific.For example, the "electoral district" qualifier label is only linked to the "candidacy in election" and "position held" relation labels.On the other hand, a few (3) qualifier labels are each linked to a large number (16+) of relation labels, and not specific to any particular relation.For example, the "end time" qualifier is linked to 35 relation labels.Hence, it is generally important to consider the interaction between relation triplets and qualifiers in extracting hyper-relational facts.However, it is not trivial to predict the qualifier only based on the relation, as some qualifier labels are relation-agnostic and it also requires the model to consider the value entity.

D Decoding Algorithm
Algorithm 1: Pseudocode of our decoding algorithm in a PyTorch-like style.We include the pseudocode algorithm of the proposed decoding method in Algorithm 1.Note that we can use the nonzero operation to find and merge adjacent non-null entries as it returns the entries sorted in lexicographic order.This ensures that the order of entries seen in consecutive order if they correspond to the same hyper-relational fact.

E Model Costs
Table 7 shows a comparison of total training time, inference speed in samples per second and GPU memory usage for the different models.We observe that CubeRE has a comparable computational cost with the generative and pipeline models.This result that our cube-pruning method is effective in ensuring that the model is computationally efficient and practical in real applications.Note that we compute the statistics for the two-stage pipeline model by summing the time taken and memory used by both stages.

F Further Analysis
Additional Pipeline Results For a fair comparison of main results in Section 4.3, we do not include the pipeline baseline in the large model setting as it would have 680M parameters which is much more than the other models.On the other hand, we also do not include a BERT-Base version of the pipeline baseline in the main results, as it would have 221M parameters which is not comparable to both the base and large model settings.Hence, we only include the pipeline baseline using DistilBERT in the main result discussion as it has a comparable parameter count to the base model setting.However, we include the pipeline baseline with BERT-Base in Table 6 for reference.

Effect of Pruning
The main effect of cubepruning is to reduce the sparsity of the cube entries by retaining the entries which are most likely to be valid entities.To quantify the effect on sparsity, we measure the cube without pruning to consist of 99.9900% null entries on average.When using pruning threshold m = 20, the cube consists of 99.9098% null entries on average.Hence, there is a roughly tenfold increase in the proportion of non-null entries when using pruning.

Effect of Training Data Size
The HyperRED training set consists of distantly supervised data which enables large-scale and diverse model training.However, there may be noisy samples that affect the model performance.Hence, we aim to study whether the quantity of data can overcome noise in the training set.As shown in Figure 6, we observe a strictly increasing trend when the size of the training set is increased from 20% of the original size to 100% of the original size.Thus, the results suggest that the quantity of data is still a beneficial factor for model performance despite some noise in the distantly supervised training set.series ordinal position of an item in its parent series (most frequently a 1-based index), generally to be used as a qualifier P1686 for work qualifier of award received (P166) to specify the work that an award was given to the creator for P1706 together with qualifier to specify the item that this property is shared with P2453 nominee qualifier used with «nominated for» to specify which person or organization was nominated P2868 subject has role role/generic identity of the item ("subject"), also in the context of a statement.P3831 object has role (qualifier) role or generic identity of the value of a statement ("object") in the context of that statement P3983 sports league level the level of the sport league in the sport league system P5051 towards qualifier for "adjacent station" (P197) to indicate the terminal station(s) of a transportation line or service in that direction

Figure 1 :
Figure 1: A sample from our HyperRED dataset for the proposed task of hyper-relational extraction.

Figure 2 :
Figure 2: An example of cube-filling for hyperrelational extraction.The front-most plane is a twodimensional table that contains entity and relation information.It extends to the third dimension where each plane represents a possible qualifier label and value entity word that corresponds to the relation triplet entry.

Figure 3 :
Figure 3: The effect of pruning threshold m on Dev F 1 .The model without pruning is indicated as m = ∞.

Figure 6 :
Figure 6: The effect of training data size on Dev F 1 .The training set of HyperRED is distantly supervised, while the development and test set are human-annotated.
Senate is the upper house of the Ohio (Ohio, legislative body, Ohio General Assembly, has part, Ohio Senate) General Assembly, the Ohio state legislature.

Table 1 :
General typology and distribution of frequent qualifier labels for the HyperRED dataset, shown with example sentences and the corresponding hyper-relational facts.
has an invalid qualifier as the

Table 3 .
The results demonstrate the general effectiveness of our model as CubeRE has consistently higher F 1 scores on both the base and large model settings.While the pipeline baseline relies on a two-stage approach that is prone to error propagation, CubeRE can perform hyper-relational extraction in an end-to-end fashion.Hence, CubeRE is able to detect more

Table 4 :
Evaluation results on HyperRED considering only the triplet component of hyper-relational facts.

Table 4
2 Please refer to Appendix C for the qualifier analysis.

Table 5 :
Detailed statistics for the HyperRED dataset.

Table 7 :
Comparison of the computational cost for the Generative, Pipeline and CubeRE models.

Table 8 :
List of experimental details.
, department, canton or other administrative division of which the municipality is the governmental seat P1411 nominated for award nomination received by a person, organisation or creative work (inspired from "award received" (Property:P166)) P1441 present in work this (fictional or fictionalized) entity or person appears in that work as part of the narration P1535 used by item or concept that makes use of the subject (use sub-properties when appropriate) P1923 participating team like 'Participant' (P710) but for teams.For an event like a cycle race or a football match you can use this property to list the teams P3450 sports season of property that shows the competition of which the item is a season.Use P5138 for "season of club or team" .

Table 9 :
List of relation labels in HyperRED. of which a person is or has been a member or otherwise affiliated P131 located in the administrative the item is located on the territory of the following administrative entity.territorial entity P155 follows immediately prior item in a series of which the subject is a part, preferably use as qualifier of P179 P175 performer actor, musician, band or other performer associated with this role or musical work P197 adjacent station the stations next to this station, sharing the same line(s) P249 ticker symbol identifier for a publicly traded share of a particular stock on a particular stock market or that of a cryptocurrency P276 location location of the object, structure or event.In the case of an administrative entity as containing item use P131.P413 position played on position or specialism of a player on a team team / speciality P453 character role specific role played or filled by subject -use only as qualifier of "cast member" (P161), "voice actor" (P725) P512 academic degree academic degree that the person holds P518 applies to part part, aspect, or form of the item to which the claim applies P527 has part part of this subject; inverse property of "part of" (P361).See also "has parts of the class" (P2670).P577 publication date date or point in time when a work was first published or released P580 start time time an event starts, an item begins to exist, or a statement becomes valid P582 end time time an item ceases to exist or a statement stops being valid P585 point in time time and date something took place, existed or a statement was true P642 of qualifier stating that a statement applies within the scope of a particular item P670 street number number in the street address.To be used as a qualifier of Property:P669 "located on street"

Table 10 :
List of qualifier labels in HyperRED.