UniEX: An Effective and Efficient Framework for Unified Information Extraction via a Span-extractive Perspective

We propose a new paradigm for universal information extraction (IE) that is compatible with any schema format and applicable to a list of IE tasks, such as named entity recognition, relation extraction, event extraction and sentiment analysis. Our approach converts the text-based IE tasks as the token-pair problem, which uniformly disassembles all extraction targets into joint span detection, classification and association problems with a unified extractive framework, namely UniEX. UniEX can synchronously encode schema-based prompt and textual information, and collaboratively learn the generalized knowledge from pre-defined information using the auto-encoder language models. We develop a traffine attention mechanism to integrate heterogeneous factors including tasks, labels and inside tokens, and obtain the extraction target via a scoring matrix. Experiment results show that UniEX can outperform generative universal IE models in terms of performance and inference-speed on 14 benchmarks IE datasets with the supervised setting. The state-of-the-art performance in low-resource scenarios also verifies the transferability and effectiveness of UniEX.


Introduction
Information extraction (IE) aims at automatically extracting structured information from unstructured textual sources, covering a wide range of subtasks such as named entity recognition, relation extraction, semantic role labeling, and sentiment analysis (Muslea et al., 1999;Grishman, 2019). However, the variety of subtasks build the isolation zones between each other and form their own dedicated models . Fig 1 (a) presents that the popular IE approaches handle structured extraction by the addition of task-specific layers on top of pretrained language models (LMs) and a subsequent * Equal Contribution. † Corresponding Author.  fine-tuning of the conjoined model (Lample et al., 2016;Luo et al., 2020;Ye et al., 2022). The isolated architectures and chaotic situation prevents enhancements from one task from being applied to another, which hinders the effective latent semantics sharing such as label names, and suffer from inductive bias in transfer learning (Paolini et al., 2020). With powerful capabilities in knowledge sharing and semantic generalization, large-scale LMs bring the opportunity to handle multiple IE tasks using a single framework. As shown in Fig 1  (b), by developing sophisticated schema-based prompt and structural generation specification, the IE tasks can be transformed into text-to-text and text-to-structure formats via large-scale generative LMs (Dong et al., 2019;Paolini et al., 2020;Lu et al., 2022) such as T5 (Raffel et al., 2020a). Moreover, the universal IE frameworks can learn general knowledge from multi-source prompts, which is beneficial for perceiving unseen content in lowresource scenarios. Despite their success, these generative frameworks suffer from their inherent problems, which limit their potential and performance in universal modeling. Firstly, the schemabased prompt and contextual information are synthetically encoded for generating the target structure, which is not conducive to directly leveraging the position information among different tokens. Secondly, the generative architecture utilizes the token-wise decoder to obtain the target structure, which is extremely time-consuming.
The aforementioned issues prompt us to rethink the foundation of IE tasks. Fundamentally, we discover that the extraction targets of different IE tasks involve the determination of semantic roles and semantic types, both of which can be converted into span formats by the correlation of the inside tokens in the passage. For instance, an entity type is the boundary detection and label classification of a semantic role, while a relation type can be regarded as the semantic association between specific semantic roles. From this perspective, the IE tasks can be decoded using a span-extractive framework, which can be uniformly decomposed as several atomic operations: i) Span Detection, which locates the boundaries of the mentioned semantic roles; ii) Span Classification, which recognizes the semantic types of the semantic roles; iii) Span Association, which establishes and measures the correlation between semantic roles to determine semantic types. According to the above observation, we propose a new paradigm for universal IE, called Unified Extraction model (UniEX) as Figure 1 (c). Specifically, we first introduce a rule-based transformation to bridge various extraction targets and unified input formats, which leverages task-specific labels with identifiers as the schema-based prompt to learn general IE knowledge. Then, recent works (Liu et al., 2019a;Yang et al., 2022) state that the auto-encoder LMs with bidirectional context representations are more suitable for natural language understanding. Therefore, We employ BERT-like LMs to construct an extractive architecture for underlying semantic encoding. Finally, inspired by the successful application of span-decoder and biaffine network to decode entity and relation with a scoring matrix (Yu et al., 2020b;Li et al., 2020;Yuan et al., 2022), we introduce a triaffine attention mechanism for structural decoding, which jointly considers high-order interactions among multiple factors, including tasks, labels and inside tokens. Each triaffine scoring matrix is assigned to a demand-specific prompt for obtaining span-extractive objectives.
Through extensive experiments on several challenging benchmarks of 4 main IE tasks (entity/relation/event/sentiment extraction), we demonstrate that compared with the state-of-the-art universal IE models and task-specific low-resource approaches, our UniEX achieves a substantial improvement in performance and efficiency with supervised, few-shot and zero-shot settings.
Our main contributions are summarized as: • We develop an efficient and effective universal IE paradigm by converting all IE tasks into joint span classification, detection and association problem.
• We introduce UniEX, a new unified extractive framework that utilizes the extractive structures to encode the underlying information and control the schema-based span decoding via the triaffine attention mechanism.
• We apply our approach in low-resource scenarios, and significant performance improvements suggest that our approach is potential for attaching label information to generalized objects and transfer learning. Our code will be made publicly available.

Related Work
Unified NLP Task Formats Since the prompttuning can improve the ability of language models to learn common knowledge and fix the gap across different NLP tasks, recent studies show the necessity of unifying all NLP tasks in the format of a natural language response to natural language input (Raffel et al., 2020b;Sanh et al., 2022;Wei et al., 2021). Previous unified frameworks usually cast parts of text problems as question answering (McCann et al., 2018)  Label Information Label semantics is an important information source, which carries out the related meaning induced from the data (Hou et al., 2020;Ma et al., 2022a;Mueller et al., 2022). The L-TapNet (Hou et al., 2020) introduces the collapsed dependency transfer mechanism to leverage the semantics of label names for few-shot tagging tasks. LSAP (Mueller et al., 2022) improves the generalization and data efficiency of few-shot text classification by incorporating label semantics into the pre-training and fine-tuning phases of generative LMs. Together, these successful employments of label knowledge in low-resource setting motivates us to introduce label semantics into our unified inputs to handle few-shot and zero-shot scenarios.

Approaches
Generally, there are two main challenges in universally modeling different IE tasks via the extractive architecture. Firstly, IE tasks are usually demand-driven, indicating that each pre-defined schema should correspond to the extraction of specific structural information. Secondly, due to the diversity of IE tasks, we need to resolve appropriate structural formats from the output sequence to accommodate different target structures, such as entity, relation and event. In this section, we outline how the UniEX exploits a shared underlying semantic encoder to learn the prompt and text knowledge jointly, and conduct various IE tasks in a unified text-to-structure architecture via the triaffine attention mechanism.

Unified Input
Formally, given the task-specific pre-defined schema and texts, the universal IE model needs to adaptively capture the corresponding structural information from the text indicated by the taskrelevant information. To achieve this, we formulate a unified input format consisting of task-relevant schema and text, as shown in Figure 2. To promote the sharing of generalized knowledge across different IE tasks, we choose to simply use the task-based and label-based schemas as prompt rather than elaborate questions, fill-in blanks or structural indicators. , thus remaining to use token representation to symbolize the connotation of subsequent schemas. Consider an input set denoted as (s, x), includes the following: i) taskbased schema s d for span detection, ii) label-based schemas s c for span classification and s a for span association, iii) one passage x = {x 1 , . . . , x Nx }. The input sentence with N s = N sd + N sc + N sa schemas and N x inside tokens can be denoted as: (1)

Backbone Network
In our UniEX framework, we employ the BERTlike LMs as the extractive backbone, such as RoBERTa (Liu et al., 2019b) and ALBERT (Lan et al., 2020), to integrate the bidirectional modeled input x inp . Note that the unified input contains multiple labels, resulting in undesired mutual influence across different labels and leading to a misunderstanding of the correspondence between the label and its structural format during the decoding phase. Meanwhile, in some tasks, the large number of labels allows schemas to take up excessive locations, squeezing the space for text. Referring to the embedding methods in the UniMC (Yang et al., 2022), we address these issues from several perspectives, including position id and attention mask. Firstly, to avoid the information interference caused by the mutual interaction within label-based schemas, we constantly update the position id pos to tell apart intra-information in the label. In this way, the position information of label-relevant tokens is coequally treated based on their position embedding, and the refreshed location information for the first token of each label-based schema avoids the natural increase of the location id. Then, as shown in Figure 3, due to the detailed correlation among schema-based prompts in the IE tasks, we further   Figure 2: The overall architecture of UniEX. The sample text comes from CoNLL04 (Roth and Yih, 2004).
Not Masked introduce a schema-based attention mask matrix M mask in the self-attention calculation to control the flow of labels, ensuring that unrelated labels are invisible to each other. In particular, different entity, relation and event types are invisible to each other, while relation and event types can contact their bound entity types.
Furthermore, we take the encoded hidden vector from the last Transformer-based layer, where we combine the special tokens part as the schema representations H s ∈ R Ns×d and the passage tokens part as the text representations H x ∈ R Nx×d with hidden size d.

Triaffine Attention for Span Representation
After obtaining the schema representations and text representations from the auto-encoder LM, the following challenge is how to construct a unified decoding format that is compatible with different IE structures, with the goal of adaptively exploiting schemas to control various extraction targets. Take the example in Figure 4, for the event extraction system, we locate the start and end indices of the words boundary "Dariues", "Ferguson" and "injure" as the semantic roles, categorized as the Agent, Victim and Trigger semantic types (entity/trigger) respectively, and collectively to the Injure semantic type (event). For the relation extraction system, we associate the semantic roles "Betsy Ross" and "Philadelphia" by attaching their intersecting information to the Live in semantic type (relation).
In conjunction with the discussions in the Introduction, we consider two elements for universally modeling IE tasks as joint span detection, classification and association: I) Different extraction targets are presented in the form of span, relying on unified information carriers to accommodate various semantic roles and semantic types. II) The spanextractive architecture is necessary for establishing schema-to-text information interaction, which can adaptively extract schema-related semantic information from text. For the first proposal, we introduce two information carriers for decoding heterogeneous IE structures in a unified span format: 1. Structural Table indicates   for span-extractive parsing. 2. Spotting Designator indicates the location of spans in the preceding structural table, which represent extraction targets corresponding to the particular schema.
For the second proposal, we attempt to explore the internal interaction of the inside tokens by converting the text representation into span representation. Then, we apply two separate FFNs to create different representations (H s x / H e x ) for the start/end positions of the inside tokens. To further interact such multiple heterogeneous factors simultaneously, we define the deep triaffine transformation with weighted matrix W ∈ R d×d×d , which apply the triaffine attention to aggregate the schema-wise span representations by considering schema as queries as well as start/end of the inside tokens as keys and values. In this process, the triaffine transformation injects each schema information into the span representations and resolves the corresponding extraction targets. It creates a N s × N x × N x scoring tensor S by calculating continuous matrix multiplication as following: where × k is the matrix multiplication between input tensor and dimension-k of W. σ( * ) denotes the Sigmoid activation function. At this point, the tensor S provides a mapping score from the schema to internal spans of the text, where each rank-2 scoring matrix corresponding to a specific schema is the structural table. For the r-th structural table, the affine score of each span (p, q) that starts with p and ends with q can be denoted as S r,p,q ∈ [0, 1], while the affine score of a valid span in the structural table is the spotting designator. We divide all N s structural tables into three parts according to the distribution of the schemas, among them, N sd for span detection, N sc for span classification, and N sa for span association. For different schemas, we develop their spotting designators by following strategies: Span Detection: In particular, we usually use the structural table derived from the task-based schema representation for span detection, which can be obtained from the hidden state of the special token [CLS]. Since the [CLS] token is mutually visible to other schemas, the task-based schema representation can capture the span-related semantic information of the semantic roles from the task and label names. The spotting designators identify the start and end indices of the i-th semantic roles as (s i , e i ) using the axes. Span Classification: The label-based schema representations for entity/argument/trigger/event types are used for span classification. The spotting designators are identical with the span positions of the semantic roles, indicating that the semantic type of the i-th span can be identified by attaching to the (s i , e i ) position in the corresponding structural table. Span Association: The label-based schema representations for relation/sentiment types are used for span association. In this process, we model the potentially related semantic roles and correlate them to corresponding semantic types. The spotting designators locate at two interleaved positions associated with the semantic roles of the semantic type, that is, for the i-th and j-th spans, the extraction target is transformed to the identification of the (s i , s j ) and (e i , e j ) positions in the corresponding structural table.
Note that all span values in the structural table for label-based schemas are masked except for the spotting designators, because we only need to observe the semantic types and semantic association among the detected spans. Specifically, the spotting designators for span detection are the spans with q ≥ p, and the spotting designators for span classification and association are defined by the position consistency and interleaving of valid spans with S r,p,q = 1 in span detection.

EX Training Procedure
Given the input sentence x inp , We uniformly reformat different output targets as a rank-3 matrix Y , sharing the same spotting designators as the triaffine scoring matrix. Similarly, we denote the value of each valid span as Y r,p,q ∈ {0, 1}, with Y r,p,q = 1 denoting the desirable span for a groundtruth and Y r,p,q = 0 denoting the meaningless span for semantic role or semantic type. Hence it is a binary classification problem and we optimize our models with binary cross-entropy: Nx q=1 BCE (Yr,p,q, Sr,p,q) .

Experiments
To verify the effectiveness of our UniEX, we conduct extensive experiments on different IE tasks with supervised (high-resource), few-shot and zeroshot (low-resource) scenarios.

Experimental Setup
For the supervised setting, we follow the preparation in TANL (Paolini et al., 2020) and UIE (Lu et al., 2022) to collect 14 publicly available IE benchmark datasets and cluster the wellrepresentative IE tasks into 4 groups, including entity, relation, event and structured sentiment extraction. In particular, for each group, we design a corresponding conversion regulation to translate raw data into the unified EX format. Then, for the few-shot setting, we adopt the popular datasets FewNERD (Ding et al., 2021) and Cross-Dataset (Hou et al., 2020) in few-shot entity extraction and domain partition as (Ma et al., 2022b). For the zero-shot setting, we use the common zero-shot relation extraction datasets Wiki-ZSL (Chen and Li, 2021) and FewRel (Han et al., 2018) and follow the same process of data and label splitting as (Chia et al., 2022). Following the same evaluation metrics as all previous methods, we use span-based offset Micro-F1 with strict match criteria as the primary metric for performance comparison. Please refer to Appendix A for more details on dataset descriptions, unified EX input formats, metrics and training implementation.

Experiments on Supervised Settings
In our experiment, under the high-resource scenario, we compare our approach with the state-ofthe-art generative universal IE architectures that provide a universal backbone for IE tasks based on T5 (Raffel et al., 2020a), including TANL (Paolini et al., 2020 and UIE (Lu et al., 2022). For a fair comparison, We only consider results without exploiting large-scale contexts and external knowledge beyond the dataset-specific information, and present the average outcomes if the baseline is conducted in multiple runs.    Table 1. We can observe that: 1) By modeling IE as joint span detection, classification and association, and encoding the schema-based prompt and input texts with the triaffine attention mechanism, UniEX provides an effective universal extractive backbone for all IE tasks. The UniEX outperforms the universal IE models with approximate backbone sizes, achieving new state-of-the-art performance on almost all tasks and datasets.
2) The introduction of label-based schema facilitates the model learning task-relevant knowledge, while the triaffine scoring matrix establishes the correspondence between each schema and extraction targets. Obviously, the UniEX can better capture and share label semantics than using generative structures to encode underlying information. Meanwhile, triaffine transformation is a unified and cross-task adaptive operation, precisely controlling where to detect and which to associate in all IE tasks. Compared with the TANL and UIE, our approach achieves significant performance improvement on most datasets, with nearly 1.36% and 1.52% F1 on average, respectively.

Experiments on Low-resource Scenarios
To verify the generalization and transferability of UniEX in low-resource scenarios, we evaluate models under few-shot and zero-shot settings, respectively. In order to reduce the influence of noise caused by random sampling on the experiment results, we repeat the data/label selection processes for five different random seeds and report the averaged experiment results as previous works ( Table 2 and 3 illustrates the main results on FewNERD and Cross-Dataset of our approach alongside those reported by previous methods. It can be seen that UniEX achieves the best performance under different type granularity and domain divisions, and outperforms the prior methods with a large margin. Compare with DecomMeta on Cross-Dataset, UniEX achieves a performance improvement up to 6.94% and 5.63% F1 scores on average in 1-shot and 5-shot, which demonstrates the effectiveness of our approach in learning general IE knowledge. It indicates that even without pretraining on large-scale corpus, our approach can still sufficiently excavate the semantic information   related with objective entities from label names, which enhances the understanding of task-specific information when data is extremely scarce. Secondly, we compare UniEX with the latest baselines TableSequence (Wang and Lu, 2020) and RelationPrompt (Chia et al., 2022) on zero-shot relation triplet extraction task for Wiki-ZSL and Few-Rel datasets in Table 4. In both single-triplet and multi-triplet evaluation, UniEX consistently outperforms the baseline models in terms of Accuracy and overall F1 score respectively, which demonstrates the ability of our approach to handle unseen labels. Although we observe a lack of advantage in recall score for multi-triplet evaluation, the significant improvement in precision allowed our approach to achieve a balanced precision-recall ratio. The reason for such difference is probably because the directional matching in the triaffine transformation will tend to guide the model to predict more credible targets.

Ablation Study
In this section, we intend to verify the necessity of key components of the UniEX, including the flow controlling and triaffine transformation.  shows ablation experiment results of UniEX on four downstream tasks.
W/O SAM: removing the schema-based attention mask matrix that controls the flowing of labels. We find that model performance is almost zero on many tasks, which demonstrates the importance of eliminating intra-information of labels. AMM makes the labels unreachable to each other, effectively avoiding the mutual interference of label semantics.
W/O TriA: replacing the triaffine transformation with the multi-head selection network, which multiplies the schema and the head-to-tail span of the text respectively, and then replicates and adds them to get the scoring matrix. The significant performance decline demonstrates the important role of triaffine attention mechanism in establishing dense correspondence between schemas and text spans.
W/O Label: replacing the label names with the special token [unused n], which eliminates label semantics while allowing the model to still distinguish between different labels. We find a slight degradation of model performance in small datasets CoNLL03 and 16-res, indicating that the prior knowledge provided by label names can effectively compensate for the deficiency of training data. As the correspondence between schema and extraction targets is not affected, model performance in large datasets tends to stabilize.

Efficiency Analysis
To verify the computation efficiency of our approach on universal IE, we compare inferencespeed with UIE (Lu et al., 2022) on the four standard datasets mentioned in section 4.4. As shown in Table 6, we can find that since generating the target structure is a token-wise process, the inferencespeed of UIE is slow and limited by the length of the target structure. On the contrary, UniEX can decode all the target structures at once from the scoring matrices obtained by triaffine transformation, with an average speedup ratio of 13.3 to UIE.
In this paper, we introduce a new paradigm for universal IE by converting all IE tasks into joint span detection, classification and association problems with a unified extractive framework. UniEX collaboratively learns the generalized knowledge from schema-based prompts and controls the correspondence between schema and extraction targets via the triaffine attention mechanism. Experiments on both supervised setting and low-resource scenarios verify the transferability and effectiveness of our approaches.

Limitations
In this paper, our main contribution is an effective and efficient framework for universal IE. We aim to introduce a new unified IE paradigm with extractive structures and triaffine attention mechanism, which can achieve better performance in a variety of tasks and scenarios with more efficient inferencespeed. However, it is non-trivial to decide whether a sophisticated and artificial prompt is required for complex datasets and large label sets. In addition, we only compare with limited baselines with specific datasets configurations when analyzing the performance of the UniEX in supervised, few-shot and zero-shot settings. In experiments, we implement only a few comparative experiments between BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b) due to the limit of computational resources.

Ethical Considerations
As an important domain of natural language processing, information extraction is a common technology in our society. It is necessary to discuss the ethical influence when using the extraction models (Leidner and Plachouras, 2017). In this work, We develop a new universal IE framework, which enhances the generalization ability in various scenarios. As discussed (Schramowski et al., 2019(Schramowski et al., , 2022Blodgett et al., 2020), pre-trained LMs might contain human-made biases, which might be embedded in both the parameters and outputs of the open-source models. In addition, we note the potential abuse of universal IE models, as these models achieve excellent performance in various domains and settings after adapting to pre-training on largescale IE datasets, which allows the models to be integrated into applications often without justification. We encourage open debating on its utilization, such as the task selection and the deployment, hoping to reduce the chance of any misconduct.

A.1.3 Zero-shot Setting
For the zero-shot setting, we conduct downstream tasks on 2 zero-shot named entity recognition datasets: FewRel (Han et al., 2018) is hand-annotated for few-shot relation extraction, we further made it suitable for the zero-shot setting after data splitting into disjoint relation label sets for training, validation and testing as (Chia et al., 2022).
Wiki-ZSL (Chen and Li, 2021) is constructed through distant supervision over Wikipedia articles and the Wikidata knowledge base.
To partition the data into seen and unseen label sets, we follow the same process as (Chia et al., 2022)    number of labels are randomly selected as unseen labels while the remaining labels are treated as seen labels during training. The unseen label set size is set to m=5 in our experiments. In order to reduce the effect of experimental noise, the label selection process is repeated for five different random seeds to produce different data folds. For each data fold, the test set consists of the sentences containing unseen labels. Five validation labels from the seen labels are used to select sentences for early stopping and hyperparameter tuning. The remaining sentences are treated as the train set. Hence, the zero-shot setting ensures that train, validation and test sentences belong to disjoint label sets.

A.2 Evaluation Metric
We use span-based offset Micro-F1 as the primary metric to evaluate the model as (Lu et al., 2022) • Entity: an entity mention is correct if its offsets and type match a reference entity. • Relation Strict: relation with strict match, a relation is correct if its relation type is correct and the offsets and entity types of the related entity mentions are correct. • Relation Triplet: relation with boundary match, a relation is correct if its relation type is correct and the string of the subject/object are correct. • Event Trigger: an event trigger is correct if its offsets and event type matches a reference trigger. • Event Argument: an event argument is correct if its offsets, role type, and event type match a reference argument mention. • Sentiment Triplet: a correct triplet requires the  offsets boundary of the target, the offsets boundary of the opinion span, and the target sentiment polarity to be all correct at the same time.

A.3 Training Implementation
To make a fair comparison, we first initialize UniEX-base and UniEX-large with RoBERTa-base and RoBERTa-large checkpoints (Liu et al., 2019b) for the supervised setting, and use the BERT-base checkpoint (Devlin et al., 2019) as the backbone for the few-shot and zero-shot settings. The model architectures are shown in Table 9. We employ Adam optimizer (Kingma and Ba, 2015) as the optimizer with 1e-8 weight decay.  Figure 5: The decoding process of the UniEX. Each schema corresponds to a structural table, and each rectangle in the structural table represents an internal span, the gray spans are the invalid spans that and do not participate in model training. Other spans are spotting designators, among them, water-blue spans for span detection, viridis spans for span classification and atrovirens spans for span association.