Mirror: A Universal Framework for Various Information Extraction Tasks

Sharing knowledge between information extraction tasks has always been a challenge due to the diverse data formats and task variations. Meanwhile, this divergence leads to information waste and increases difficulties in building complex applications in real scenarios. Recent studies often formulate IE tasks as a triplet extraction problem. However, such a paradigm does not support multi-span and n-ary extraction, leading to weak versatility. To this end, we reorganize IE problems into unified multi-slot tuples and propose a universal framework for various IE tasks, namely Mirror. Specifically, we recast existing IE tasks as a multi-span cyclic graph extraction problem and devise a non-autoregressive graph decoding algorithm to extract all spans in a single step. It is worth noting that this graph structure is incredibly versatile, and it supports not only complex IE tasks, but also machine reading comprehension and classification tasks. We manually construct a corpus containing 57 datasets for model pretraining, and conduct experiments on 30 datasets across 8 downstream tasks. The experimental results demonstrate that our model has decent compatibility and outperforms or reaches competitive performance with SOTA systems under few-shot and zero-shot settings. The code, model weights, and pretraining corpus are available at https://github.com/Spico197/Mirror .


Introduction
Information Extraction (IE) is a fundamental field in Natural Language Processing (NLP), which aims to extract structured information from unstructured text (Grishman, 2019), such as Named Entity Recognition (NER) (Qu et al., 2023b;Gu et al., 2022;Qu et al., 2023a), Relation Extraction (RE) (Cheng et al., 2021), Event Extraction (EE).However, each IE task is usually isolated from specific data structures and delicate models, which makes it * Corresponding author  difficult to share knowledge across tasks (Lu et al., 2022;Josifoski et al., 2022).

ADR
In order to unify the data formats and take advantage of common features between different tasks, there are two main routes in recent studies.The first one is to utilize generative pretrained language models (PLMs) to generate the structured information directly.Lu et al. (2022) and Paolini et al. (2021) structure the IE tasks as a sequence-tosequence generation problem and use generative models to predict the structured information autoregressively.However, such methods cannot provide the exact positions of the structured information, which is essential to the NER task and fair evaluations (Hao et al., 2023).Besides, the generationbased methods are usually slow and consume huge resources to train on large-scale datasets (Wang et al., 2022).The second way is to apply the extractive PLMs, which are faster to train and inference.InstructUIE) may be capable of all the tasks if their current paradigms or patterns are changed.However, since the original papers do not contain relevant experiments, we mark them as ✗ or • here.
USM (Lou et al., 2023) regards the IE tasks as a triplet prediction problem via semantic matching.However, this method is limited to a small range of triplet-based tasks, and it is unable to address multi-span and n-ary extraction problems.To overcome the above challenges, we propose Mirror, a novel framework that can handle complex multi-span extraction, n-ary extraction, machine reading comprehension (MRC), and even classification tasks, which are not supported by the previous universal IE systems.As exemplified in Figure 1, we formulate IE tasks as a unified multislot tuple extraction problem and transform those tuples into multi-span cyclic graphs.This graph structure is rather flexible and scalable.It can be applied to not only complex IE tasks but also MRC and classification tasks.Mirror takes schemas as part of the model inputs, and this benefits few-shot and zero-shot tasks naturally.
Compared with other models in Table 1, Mirror supports efficient non-autoregressive decoding with position indexing and shows good compatibility across different tasks and datasets.We conduct extensive experiments on 30 datasets from 8 tasks, including NER, RE, EE, Aspect-based Sentiment Analysis (ABSA), multi-span discontinuous NER, n-ary hyper RE, MRC, and classification.To enhance the few-shot and zero-shot abilities, we man-ually collect 57 datasets across 5 tasks into a whole corpus for model pretraining.The experimental results demonstrate that Mirror achieves competitive results under few-shot and zero-shot settings.
Our contributions are summarized as follows: • We propose a unified schema-guided multislot extraction paradigm, which is capable of complex information extraction, machine reading comprehension, and even classification tasks.
• We propose Mirror, a universal nonautoregressive framework that transforms multiple tasks into a multi-span cyclic graph.
• We conduct extensive experiments on 30 datasets from 8 tasks, and the results show that our model achieves competitive results under few-shot and zero-shot settings.
In addition to explicit graph-based multi-task IE systems, generative language models are widely used.Yan et al. (2021b) and Yan et al. (2021a) add special index tokens into BART (Lewis et al., 2020) vocabulary to help perform various NER and ABSA tasks and obtain explicit span positions.TANL (Paolini et al., 2021) apply T5 (Raffel et al., 2020) to generate texts with special enclosures as the predicted information.GenIE (Josifoski et al., 2022) and DeepStruct (Wang et al., 2022) share a similar idea to generate subject-relation-object triplets, and DeepStruct extends the model size to 10B with GLM (Du et al., 2022).

Schema-guided Information Extraction
In schema-guided IE systems, schemas are input as a guidance signal to help the model extract target information.UIE (Lu et al., 2022) categorize IE tasks into span spotting and associating elementary tasks and devise a linearized query language.Fei et al. (2022) introduces the hyper relation extraction task to represent complex IE tasks like EE, and utilize external parsing tools to enhance the text representations.InstructUIE (Wang et al., 2023) formulates schemas into instructions and uses FlanT5-11B (Chung et al., 2022) to perform multi-task instruction tuning.
While the above methods utilize generative language models, they cannot predict exact positions, which brings ambiguity when evaluating (Hao et al., 2023).Besides, large generative language models are usually slow to train & infer and require tons of computing resources.USM (Lou et al., 2023) applies BERT-family models to extract triplets non-autoregressively. USM regards IE as a unified schema matching task and uses a labeltext matching model to extract triplets.However, these methods cannot extend to complex IE tasks, such as multi-span discontinuous NER and n-ary information extractions.

Mirror Framework
In this section, we introduce the Mirror framework.We first address the unified data input format to the model, then introduce the unified task formulation and the model structure.

Unified Data Interface
To enable the model to handle different IE tasks, we propose a unified data interface for the model input.As shown in Figure 2, there are three parts: instruction, schema labels, and text.The instruction is composed of a leading token [I] and a natural language sentence.The [I] token indicates the  With the above three parts, we can formulate extractive MRC, classification, and IE tasks into a unified data interface, and the model can be trained in a unified way even if the model is not based on generative language models.For the robust model training, we manually collect 57 datasets from 5 tasks to make a corpus for model pretraining.The data statistics for each IE task are listed in Table 2. To balance the number of examples in each task, we set a different maximum number of samples N max for each task dataset.If the number of instances in a dataset is less than N max , we keep the original dataset unchanged and do not perform oversampling.For NER, RE, and EE tasks, we manually design a set of instructions and randomly pick one of them for each sample.MRC datasets some classification datasets have inborn questions, so the numbers of instruction are much higher than the others.For detailed statistics on each dataset, please refer to Appendix C.

Multi-slot Tuple and Multi-span Cyclic Graph
We formulate IE tasks as a unified multi-slot tuple extraction problem.As exemplified in Figure 2, in the RE task, the model is expected to extract a three-slot tuple: (relation, head entity, tail entity).Here, the tuple is (LR friend of , Jerry Smith, Tom).The length of tuple slots could vary across tasks, so Mirror is able to solve n-ary extraction problems.
As shown in Figure 1 and the top right of Figure 3, we formulate multi-slot tuples into a unified multi-span cyclic graph, and regard labels as the leading tokens in schema labels.There are three types of connections in the graph: the consecutive connection, the jump connection, and the tail-tohead connection.The consecutive connection is adopted to spans in the same entity.For an entity with multiple tokens, the consecutive connection connects from the first to the last.As shown in Figure 3, "Jerry" connects to "Smith".If there is only one token in an entity, the consecutive connection is not used.The jump connection connects different slots in a tuple.Schema labels and spans from texts are in different slots, so they are connected in jump connections.For instance, the head and tail entities of a relation triplet are in different slots, so they are connected in jump connections.The tail-to-head connection helps locate the graph boundaries.It connects from the last token of the last slot to the first token of the first slot in a tuple.
In practice, we convert the answer of each slot into text positions.For schema labels, we use the position of leading tags instead of literal strings.For text spans, the position is a one-digit number if there is only one character, otherwise the start and end positions are listed.For example, the 3-slot relation tuple (LR friend of , Jerry Smith, Tom) end if 9: end for 10: return T ence, we first find the forward chain (9,16,17,22) and then verify the chain with the tail-to-head connection (22→9).After that, the multi-slot tuple is obtained with jump connections(9 . ..16) and (17 . ..22).

Model Structure
With the unified data interface and the multi-span cyclic graph, we propose a unified model structure for IE tasks.For each token x i from the inputs, Mirror transforms it into a vector h i ∈ R d h via a BERT-style extractive pretrained language model (PLM).We use biaffine attention (Dozat and Manning, 2017) to obtain the adjacency matrix A of the multi-span cyclic graph.
Mirror calculates the linking probability p k ij , k ∈ {consecutive, jump, tail-to-head} between x i and x j as Equation 1 shows.The final A is obtained via thresholding where hi b is the trainable parameter, and 3 denotes consecutive, jump, and tail-to-head connections.FFNN is the feedforward neural network with rotary positional embedding as introduced in Su et al. (2021).The FFNN comprises a linear transformation, a GELU activation function (Hendrycks and Gimpel, 2023), and dropout (Srivastava et al., 2014).
During training, we adopt the imbalance-class multi-label categorical cross entropy (Su et al., 2022) as the loss function: where Ω neg stands for negative samples (A k ij = 0), and Ω pos denotes positive samples (A k ij = 1).

Main Results
Mirror performances on 13 IE benchmarks are presented in Table 3.Compared with other baseline models, Mirror surpasses baseline models on some datasets in NER (ACE04), RE (ACE05, NYT), and EE (CASIE-Trigger) tasks.When compared to extraction-based USM, Mirror achieves competitive results on most of tasks, while lagging in NER (ACE05), RE (CoNLL04), and EE (CASIE-Arg).Compared to generation-based methods, Mirror outperforms TANL across all datasets and surpasses UIE in most datasets.When the model parameter comes to 10B, DeepStruct outperforms Mirror on CoNLL04 in the RE task, while Mirror reaches very close results or outperforms Deep-Struct on the other datasets.InstructUIE (11B) demonstrates similar performance on NER datasets, while achieving high scores in RE (SciERC) and EE (ACE05-Tgg & Arg), surpassing other models by a significant margin.Apart from these datasets, InstructUIE performs about the same as UIE, USM, and Mirror.
We provide ablation studies on Mirror with different pretraining and fine-tuning strategies.Performance degrades if either pretraining or instruction fine-tuning is not performed.Mirror benefits from pretraining when utilizing instructions (w/ Inst.) and increases 0.66% scores on average.However, when instructions are discarded (w/o Inst.), pretraining (w/ PT) does not bring performance gain.Pretraining has been confirmed on UIE and USM to enhance model performances, and it is crucial to enable the zero-shot inference ability.However, based on the results from Table 3, we find that if Mirror is applied in one specific task with sufficient training resources, it may not need to perform the pretraining step (e.g., NYT dataset).
Besides the traditional IE tasks in Table 3, Mirror also supports multi-span discontinuous NER and n-ary hyper relation extraction as shown in Table 4.We provide Mirror (w/ PT, w/ Inst) and Mirror (w/o PT, w/o Inst.) results on CADEC according to their good performances on the IE tasks in Table 3.However, Mirror is less powerful than task-specific SOTA models.On the n-ary hyper relation extraction task, Mirror outperforms the task-specific model CubeRE and achieves new SOTA results.Table 4 indicates Mirror's compatibility with complex multi-span and n-ary extraction problems.
The above facts indicate that Mirror has good compatibility across different IE problems, and we extend the universal IE system to complex multispan and n-ary extraction tasks, which are not supported by previous universal IE systems.

Few-shot Results
Followed by Lu et al. (2022)  erage scores.Among all the four tasks, NER may be relatively easier for the model to deal with.The 10-shot NER score of Mirror is 84.69, while the fine-tuned Mirror on the full dataset gets an F1 score of 92.73.The gaps on other datasets between 10-shot and fully fine-tuned results are larger, indicating the task difficulties.

Zero-shot Results
Table 6 shows the zero-shot performances on 7 NER datasets.These datasets are not included in pretraining, and we use the pretrained Mirror to

Results on MRC and Classification
To show the model compatibility on extractive MRC and classification tasks, we conduct experiments on SQuAD v2 and GLUE language understanding benchmarks.The experimental results are demonstrated in The results are demonstrated in Table 8.We find that the label type does not bring too many differences.In Mirror w/ Inst., the literal content string is slightly better than bare tags with only a 0.19 F1 score advantage.While in Mirror w/o Inst., the tagbased method surpasses the content-based method by 0.72 F1 scores.Similar to Baldini Soares et al. ( 2019), these results show that although the label tag is a simple token without pretraining, it does not affect the model's ability to incorporate features from global and local contexts.

Analysis on Pretraining Datasets
Traditionally, the classification task is different from the extraction task as they optimize different objectives.Since Mirror unifies the two tasks

Analysis on Inference Speed
We conduct speed tests on the CoNLL03's validation set with one NVIDIA V100 GPU under the same environment.The results are presented in Table 10.Compared to the popular generative T5large UIE model (Lu et al., 2022), our model is up to 32.61 times faster when inference, and the advantage grows when increasing the batch size from 1 to 2.

Conclusion
We propose Mirror, a schema-guided framework for universal information extraction.Mirror transforms IE tasks into a unified multi-slot tuple extraction problem and introduces the multi-span cyclic graph to represent such structures.Due to the flexible design, Mirror is capable of multi-span and n-ary extraction tasks.Compared to previous systems, Mirror supports not only complex information extraction but also MRC and classification tasks.We manually collect 57 datasets for pretraining and conduct experiments on 30 datasets across 8 tasks.The experimental results show good compatibility, and Mirror achieves competitive performances with state-of-the-art systems.

Limitations
Content input length: Due to the backbone De-BERTa model constraint, the maximal sequence length is 512 and can hardly extend to longer texts.This limits the exploration of tasks with many schema labels and document-level IE.
Multi-turn result modification: Mirror predicts the multi-span cyclic graph in a paralleled nonautoregressive style.Although it is efficient in training and inference, it may lack global history knowledge from previous answers.
Data format unification: There are many IE tasks, and the formats may vary a lot.Although the current unified data interface supports most common tasks, it may not be practical for some tasks.
Lack of large-scale event datasets for pretraining: There are many NER and RE datasets.However, there are few large-scale event extraction corpus with high diversity in domains and schemas, which may limit the model performance on event-relevant information extraction tasks.TANL (Paolini et al., 2021) can provides exact positions in NER since it generates the enclosure tags.However, it still faces the ambiguity problem when two entities have the same string in joint entity relation extraction because the tail entity is a generated text corresponding to an enclosed head entity (refer to section 3 in the TANL paper).We also calculate the upper bound F1 scores of relation extraction in a TANL manner, and the results show it does not ideally generate perfect positions.B Hyper-parameter Settings

C Dataset Statistics
This section contains detailed statistics for pretraining datasets and fine-tuning datasets.Pretraining data statistics are listed in  (2022).Downstream data statistics are listed in Table 19.We also provide direct inference results with the pretrained Mirror model in Table 19.

D Case Study
We provide some interesting cases across different tasks with the pretrained Mirror w/ Inst. to manually evaluate its versatility on various tasks under zero-shot settings.break up [TL] The drama surrounding the high-profile divorce between Hollywood actors Johnny Depp and Amber Heard appears to be over as the couple reportedly reached an amicable settlement.Output ([LR] break up , Amber Heard, Johnny Depp) Table 20: Case results obtained by the pretrained Mirror w/ Inst.The name of our proposed Mirror is borrowed from the magic mirror in Snow White and the Seven Dwarfs.We hope to build a universal model that can help more people solve more problems.

Figure 1 :
Figure 1: Multi-span cyclic graph for discontinuous NER and RE tasks (best viewed in color).The spans are connected by three types of edges, including consecutive connections, dotted jump connections and tail-tohead connections.ADR in discontinuous NER refers to the entity label of Adverse Drug Reaction.

Figure 3 :
Figure3: Model framework (best viewed in color).Mirror first constructs inputs for each task, then utilizes a pretrained language model to predict the adjacency matrix via the biaffine attention.After that, final results are decoded from the adjacency matrix accordingly.

Table 1 :
Comparisons among systems.Circle •: the model may support the task theoretically, but the current implementation is not available.AR: the auto-regressive decoding while NAR is non-autoregressive.Indexing: whether the model could provide exact position information.TANL partly supports indexing because the generated tail entity in relation extraction is text-based without position information.Please refer to Appendix A for more detailed comparisons.
Triplet: "(head, relation, tail)" triplet extraction.Single-span NER: flat and nested NER tasks with consecutive spans.Multi-span: multi-span extraction, e.g., the discontinuous NER.N-ary tuple: the ability of n-ary tuple extraction, e.g., quadruple extraction.Classification: the classification tasks.MRC: extractive machine reading comprehension tasks.It is worth noting that generative models (TANL, UIE, DeepStruct, and extractive MRC and QA texts without schema labels.[B]: the background text in the classification task. Li et al. (2022)nformation ExtractionMulti-task IE has been a popular research topic in recent years.The main idea is to use a single model to perform multiple IE tasks.IE tasks could be formulated as different graph structures.Li et al. (2022)formulate flat, nested, and discontinuous NER tasks as a graph with next-neighboring Figure 2: Unified data interface.We design a list of tokens to separate different parts: [I]: instruction.[LM]: mentions.[LR]: relations.[LC]: classifications.[TL]: text that connects with schema labels.[TP]:

Table 2 :
Pretraining dataset statistics.♣ Classification tasks contain multi-choice MRC datasets.♡ MRC stands for both extractive QA and extractive MRC datasets.

Table 3 :
Results on 13 IE benchmarks (ACE-Tgg and ACE-Arg are in the same dataset with different evaluation metrics).PT is the abbreviation of pretraining, and Inst.denotes the task instruction.

Table 4 :
Results on multi-span and n-ary information extraction tasks.Tgg and Arg in Event Extraction refer to Trigger (Event Detection) and Argument (Event Argument Extraction), respectively.

Table 5 :
Few-shot results on IE tasks.These datasets are not included in the pretraining phase of Mirror.
shot NER tasks.However, ChatGPT is very powerful in the zero-shot NER task and achieves absolute SOTA performance.Except for simple model scaling, we may need to collect a more diverse pretraining corpus for better results.

Table 6 :
Wang et al. (2023)on 7 NER datasets.Results of Davinci and ChatGPT are derived fromWang et al. (2023).Mirror direct is the pretrained Mirror w/ Inst.while these datasets are not included in the pretraining phase.

Table 7 :
He et al. (2021)nd classification tasks.We list Mirror performance on SQuAD 2.0 development set and GLUE development sets.Baseline results are derived fromHe et al. (2021).Because SQuAD v2 and GLUE datasets are included in Mirror pretraining for 3 epochs, we direct make inferences with the pretrained model (noted as Mirror direct , the same model used in zero-shot NER), and do not perform further fine-tuning, while other baselines are fine-tuned with a full dataset on every single task.

Table 7
different label span types, we conduct experiments to change the leading token into a literal content string.In other words, in a NER task that extract person entities, we compare the effect of [LM] token and person string as the label span.

Table 8 :
Results on different label span types.This experiment is conducted on the CoNLL03 dataset w/o pretraining.

Table 9 :
Ablation study on the pretraining data.We evaluate the pretrained Mirror direct without further fine-tuning.intooneframework,itisinteresting to find how they affect each other in the pretraining phase.We provide an ablation study on different types of pretraining data in Table9.It is surprising that pretraining on classification datasets help improve the extraction tasks, and relation extraction is the most affected one.This may be due to the similarity between relation labels and semantic class labels.It is also interesting that span-based datasets (e.g.MRC datasets) are beneficial to the classification task(87.50→89.22).Overall, all kinds of the pretraining datasets bring greater mutual benefits and improve the model performance.

Table 10 :
Inference speed (instances per second) test on CoNLL03 validation set.

Table 11 :
Upper bound of different string matching strategies on NER.

Table 12 :
Upper bound of relation extraction with Mirror and TANL position indexing strategies.
Table 14, 16, 17, 18 and 15.For the sampling number N max of each kind of dataset, please refer to Table 2.When collecting pretraining data, we refer to the datasets mentioned in Therasa and Mathivanan (2022) and Yang et al.
The model inputs & outputs are presented in Table 20.

Table 14 :
Pretraining data statistics on classification.

Table 15 :
Pretraining data statistics on EE.Due to the scarcity of EE datasets, we sample all the instances (N max = ∞).

Table 16 :
Pretraining data statistics on NER.The maximal sampling number N max for each dataset is 20,000.

Table 17 :
Pretraining data statistics on RE.The maximal sampling number N max for each dataset is 20,000.

Table 18 :
Pretraining data statistics on MRC.The maximal sampling number N max for each dataset is 20,000.

Table 19 :
Data statistics on downstream tasks.Included in PT stands for whether the dataset is included in the data pretraining corpus.Mirror direct is the model trained on the pretraining corpus.Mirror Mirror on the wall, who's the fairest of them all?[LC] Evil Queen [LC] Snow White Output [LC] Snow White Mirror Mirror on the wall, who's the fairest of them all?[TP] Evil Queen is jealous of Snow White's beauty.Mirror Mirror, please help me extract all the model names.[LM] model name [TL] LLaMA and OPT are open-sourced large language models.Output [LM] LLaMA , [LM] OPT Mirror Mirror, please help me extract the entity relationship triplet.[LR]