Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. Traditional methods usually extract event records by decomposing the complex structure prediction task into multiple subtasks. In this paper, we propose Text2Event, a sequence-to-structure generation paradigm that can directly extract events from the text in an end-to-end manner. Specifically, we design a sequence-to-structure network for unified event extraction, a constrained decoding algorithm for event knowledge injection during inference, and a curriculum learning algorithm for efficient model learning. Experimental results show that, by uniformly modeling all tasks in a single model and universally predicting different labels, our method can achieve competitive performance using only record-level annotations in both supervised learning and transfer learning settings.

Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. First, an event record contains event type, trigger, and arguments, which * Corresponding authors.
The man returned to Los Angeles from Mexico following his capture Tuesday by bounty hunters. Sequence-to-Structure Network Constraint Figure 1: The framework of TEXT2EVENT. Here, TEXT2EVENT takes raw text as input and generates a Transport event and an Arrest-Jail event.

Arrest-Jail
form a table-like structure. And different event types have different structures. For example, in Figure 1, Transport and Arrest-Jail have entirely different structures. Second, an event can be expressed using very different utterances, such as diversified trigger words and heterogeneous syntactic structures. For example, both "the dismission of the man" and "the man departed his job" express the same event record {Type: End-Position, Arg1 Role: PERSON, Arg1: the man}. Currently, most event extraction methods employ the decomposition strategy (Chen et al., 2015;Nguyen and Nguyen, 2019;Wadden et al., 2019;Zhang et al., 2019b;Du and Cardie, 2020;Paolini et al., 2021), i.e., decomposing the prediction of complex event structures into multiple separated subtasks (mostly including entity recognition, trigger detection, argument classifica-tion), and then compose the components of different subtasks for predicting the whole event structure (e.g., pipeline modeling, joint modeling or joint inference). The main drawbacks of these decomposition-based methods are: (1) They need massive and fine-grained annotations for different subtasks, often resulting in the data inefficiency problem. For example, they need different finegrained annotations for Transport trigger detection, for Person entity recognition, for Transport.Artifact argument classification, etc. (2) It is very challenging to design the optimal composition architecture of different subtasks manually. For instance, the pipeline models often lead to error propagation. And the joint models need to heuristically predefine the information sharing and decision dependence between trigger detection, argument classification, and entity recognition, often resulting in suboptimal and inflexible architectures.
In this paper, we propose a sequence-tostructure generation paradigm for event extraction -TEXT2EVENT, which can directly extract events from the text in an end-to-end manner. Specifically, instead of decomposing event structure prediction into different subtasks and predicting labels, we uniformly model the whole event extraction process in a neural network-based sequence-tostructure architecture, and all triggers, arguments, and their labels are universally generated as natural language words. For example, we generate a subsequence "Attack fire" for trigger extraction, where both "Attack" and "fire" are treated as natural language words. Compared with previous methods, our method is more data-efficient: it can be learned using only coarse parallel text-record annotations, i.e., pairs of sentence, event records , rather than fine-grained token-level annotations. Besides, the uniform architecture makes it easy to model, learn and exploit the interactions between different underlying predictions, and the knowledge can be seamlessly shared and transferred between different components.
Furthermore, we design two algorithms for effective sequence-to-structure event extraction. First, we propose a constrained decoding algorithm, which can guide the generation process using event schemas. In this way, the event knowledge can be injected and exploited during inference on-thefly. Second, we design a curriculum learning algorithm, which starts with current pre-trained language models (PLMs), then trains them on simple event substructure generation tasks such as trigger generation and independent argument generation, finally trains the model on the full event structure generation task.
We conducted experiments 1 on ACE and ERE datasets, and the results verified the effectiveness of TEXT2EVENT in both supervised learning and transfer learning settings. In summary, the contributions are as follows: 1. We propose a new paradigm for event extraction --sequence-to-structure generation, which can directly extract events from the text in an end-to-end manner. By uniformly modeling all tasks in a single model and universally predicting different labels, our method is effective, data-efficient, and easy to implement.
2. We design an effective sequence-to-structure architecture, which is enhanced with a constrained decoding algorithm for event knowledge injection during inference and a curriculum learning algorithm for efficient model learning.
3. Many information extraction tasks can be formulated as structure prediction tasks. Our sequence-to-structure method can motivate the learning of other information extraction models.

TEXT2EVENT: End-to-end Event Extraction as Controllable Generation
Given the token sequence x = x 1 , ..., x |x| of the input text, TEXT2EVENT directly generate the event structures E = e 1 , ..., e |E| via an encoderdecoder architecture. For example, in Figure 1, TEXT2EVENT take the raw text as input and output two event records including {Type: Transport, Trigger: returned, Arg1 Role: Artifact, Arg1: The man, ...} and {Type: Arrest-Jail, Trigger: capture, ..., Arg2 Role: Agent, Arg2: bounty hunters, ...}. For end-to-end event extraction, TEXT2EVENT first encodes input text, then generates the linearized structure using the constrained decoding algorithm. In the following, we first introduce how to reformulate event extraction as structure generation via structure linearization, then describe the sequence-to-structure model and the constrained decoding algorithm.
The man returned to Los Angeles from Mexico following his capture Tuesday by bounty hunters.

Event Type Transport
Trigger returned

Artifact
The man

Los Angeles
Origin Mexico

Event Type Arrest-Jail
Trigger capture

Person
The man  Figure 2: Examples of three event representations. The red solid line indicates the event-role relation; the blue dotted line indicates the label-span relation where the head is a label and the tail is a text span. For example, "Transport-returned" is a label-span relation edge, which head is "Transport" and tail is "returned".

Event Extraction as Structure Generation
This section describes how to linearize event structure so that events can be generated in an end-toend manner. Specifically, the linearized event representations should: (1) be able to express multiple event records in a text as one expression; (2) be easy to reversibly converted to event records in a deterministic way; (3) be similar to the token sequence of general text generation tasks so that text generation models can be leveraged and transferred easily.
Concretely, the process of converting from record format to linearized format is shown in Figure 2. We first convert event records ( Figure 2a) into a labeled tree (Figure 2b) by: 1) first labeling the root of the tree with the type of event (Root -Transport, Root -Arrest-Jail), 2) then connecting multiple event argument role types with event types (Transport -Artifact, Transport -Origin, etc.), and 3) finally linking the text spans from the raw text to the corresponding nodes as leaves (Transportreturned, Transport -Origin -Mexico, Transport -Artifact -The man, etc.). Given the converted event tree, we linearize it into a token sequence ( Figure 2c) via depth-first traversal (Vinyals et al., 2015), where "(" and ")" are structure indicators used to represent the semantic structure of linear expressions. The traversal order of the same depth is the order in which the text spans appear in the text, e.g., first "return" then "capture" in Figure 2b. Noted that each linearized form has a virtual root -Root. For a sentence that contains multiple event records, each event links to Root directly. For a sentence that doesn't express any event, its tree format will be linearized as "()".

Sequence-to-Structure Network
Based on the above linearization strategy, TEXT2EVENT generates the event structure via a transformer-based encoder-decoder architecture (Vaswani et al., 2017). Given the token sequence x = x 1 , ..., x |x| as input, TEXT2EVENT outputs the linearized event representation y = y 1 , ..., y |y| . To this end, TEXT2EVENT first computes the hidden vector representation H = h 1 , ..., h |x| of the input via a multi-layer transformer encoder: where each layer of Encoder(·) is a transformer block with the multi-head attention mechanism. After the input token sequence is encoded, the decoder predicts the output structure token-bytoken with the sequential input tokens' hidden vectors. At the step i of generation, the self-attention decoder predicts the i-th token y i in the linearized form and decoder state h d i as: where each layer of Decoder(·) is a transformer block that contains self-attention with decoder state h d i and cross-attention with encoder state H. The generated output structured sequence starts from the start token " bos " and ends with the end token " eos ". The conditional probability of the whole output sequence p(y|x) is progressively combined by the probability of each step p(y i |y <i , x): where y <i = y 1 ...y i−1 , and p(y i |y <i , x) is the probability over the target vocabulary V normalized by softmax(·) . Because all tokens in linearized event representations are also natural language words, we adopt the pre-trained language model T5 (Raffel et al., 2020) as our transformer-based encoder-decoder architecture. In this way, the general text generation knowledge can be directly reused.

Constrained Decoding
Given the hidden sequence H, the sequence-tostructure network needs to generate the linearized event representations token-by-token. One straightforward solution is to use a greedy decoding algorithm, which selects the token with the highest predicted probability p(y i |y <i , x) at each decoding step i. Unfortunately, this greedy decoding algorithm cannot guarantee the generation of valid event structures. In other words, it could end up with invalid event types, mismatch of argumenttype, and incomplete structure. Furthermore, the greedy decoding algorithm ignores the useful event schema knowledge, which can be used to guide the decoding effectively. For example, we can constrain the model to only generate event type tokens in the type position.
To exploit the event schema knowledge, we propose to employ a trie-based constrained decoding algorithm (Chen et al., 2020a;Cao et al., 2021) for event generation. During constrained decoding, the event schema knowledge is injected as the prompt of the decoder and ensures the generation of valid event structures.
Concretely, unlike the greedy decoding algorithm that selects the token from the whole target vocabulary V at each step, our trie-based constrained decoding method dynamically chooses and prunes a candidate vocabulary V based on the current generated state. A complete linearized form decoding process can be represented by executing a trie tree search, as shown in Figure 3a. Specifically, each generation step of TEXT2EVENT has three kinds of candidate vocabulary V : • Event schema: label names of event types T and argument roles R; • Mention strings: event trigger word and argument mention S, which is the text span in the raw input; • Structure indicator: "(" and ")" which are used to combine event schemas and mention strings.
The decoding starts from the root " bos " and ends at the terminator " eos ". At the generation step i, the candidate vocabulary V is the children nodes of the last generated node. For instance, at the generation step with the generated string " bos (", the candidate vocabulary V is {"(", ")"} in Figure 3a. When generating the event type name  T , argument role name R and text span S, the decoding process can be considered as executing search on a subtree of the trie tree. For example, in Figure 3b, the candidate vocabulary V for "( Transfer" is {"Ownership", "Money"}. Finally, the decoder's output will be transformed to event records and used as final extraction results.

Learning
This section describes how to learn the TEXT2EVENT neural network in an end-toend manner. Our method can be learned using only the coarse parallel text-record annotations, i.e., pairs of sentence, event records , with no need for fine-grained token-level annotation used in traditional methods. Given a training dataset D = {(x 1 , y 1 ), ...(x |D| , y |D| )} where each instance is a sentence, event records pair, the learning objective is the negative log-likelihood function as: where θ is model parameters.
Unfortunately, unlike general text-to-text generation models, the learning of sequence-to-structure generation models is more challenging: 1) There is an output gap between the event generation model and the text-to-text generation model. Compared with natural word sequences, the linearized event structure contains many non-semantic indicators such as "(" and ")", and they don't follow the syntax constraints of natural language sentences. 2) The non-semantic indicators "(" and ")" appear very frequently but contain little semantic information, which will mislead the learning process.
To address the above challenges, we employ a curriculum learning (Bengio et al., 2009; strategy. Specifically, we first train PLMs using simple event substructure generation tasks so that they would not overfit in non-semantic indicators; then we train the model on the full event structure generation task.
Substructure Learning. Because event representations often have complex structures and their token sequences are different from natural language word sequences, it is challenging to train them with the full sequence generation task directly. Therefore, we first train TEXT2EVENT on simple event substructures.
Specifically, we learn our model by starting from generating only "(label, span)" substructures, including "(type, trigger words)" and "(role, argument words)" substructures. For example, we will extract substructure tasks in Figure 2c in this stage as: (Transport returned) (Artifact The man) (Arrest-Jail capture), etc. We construct a sentence, substructures pair for each extracted substructures, then train our model using the loss in equation 4.
Full Structure Learning. After the substructure learning stage, we further train our model for the full structure generation task using the loss in equation 4. We found the curriculum learning strategy uses data annotation more efficiently and makes the learning process more smooth.

Experiments
This section evaluates the proposed TEXT2EVENT model by conducting experiments in both supervised learning and transfer learning settings.

Experimental Settings
Datasets. We conducted experiments on the event extraction benchmark -ACE2005 (Walker et al., 2006), which has 599 English annotated documents and 33 event types. We used the same split and preprocessing step as the previous work (Zhang et al., 2019b;Wadden et al., 2019;Du and Cardie, 2020), and we denote it as ACE05-EN. In addition to ACE05-EN, we also conducted experiments on two other benchmarks: ACE05-EN + and ERE-EN, using the same split and preprocessing step in the previous work (Lin et al., 2020). Compared to ACE05-EN, ACE05-EN + and ERE-EN further consider pronoun roles and multi-token event triggers. ERE-EN contains 38 event categories and 458 documents.
Statistics of all datasets are shown in Table 1. For evaluation, we used the same criteria in previous work (Zhang et al., 2019b;Wadden et al., 2019;Lin et al., 2020). Since TEXT2EVENT is a text generation model, we reconstructed the offset of predicted trigger mentions by finding the matched utterance in the input sequence one by one. For argument mentions, we found the nearest matched utterance to the predicted trigger mention as the predicted offset.
Baselines. Currently, event extraction supervision can be conducted at two different levels: 1) Token-level annotation, which labels each token in a sentence with event labels, e.g., "The/O dismission/B-End-Position of/O .."; 2) Parallel textrecord annotation, which only gives sentence, event pairs but without expensive token-level annotations, e.g., The dismission of ..., {Type: End-Position, Trigger: dismission, ...} . Furthermore, some previous works also leverage golden entity annotation for model training, which marks all entity mentions with their golden types, to facilitate event extraction. Introducing more supervision knowledge will benefit the event extraction but is more label-intensive. The proposed Text2Event only uses parallel text-record annotation, which makes it more practical in a real-world application.
To verify TEXT2EVENT, we compare our method with the following groups of baselines: 1. Baselines using token annotation: TANL is the  SOTA sequence generation-based method that models event extraction as a trigger-argument pipeline manner (Paolini et al., 2021); Multi-task TANL extends TANL by transferring structure knowledge from other tasks; EEQA (Du and Cardie, 2020) and MQAEE  are QA-based models which use machine reading comprehension model for trigger detection and argument extraction. 2. Baselines using both token annotation and entity annotation: Joint3EE is a joint entity, trigger, argument extraction model based on the shared hidden representations (Nguyen and Nguyen, 2019); DYGIE++ is a BERT-based model which captures both within-sentence and cross-sentence context (Wadden et al., 2019); GAIL is an inverse reinforcement learning-based joint entity and event extraction model (Zhang et al., 2019b); OneIE is an end-to-end IE system which employs global feature and beam search to extract globally optimal event structures (Lin et al., 2020).
Implementations. We optimized our model using label smoothing (Szegedy et al., 2016;Müller et al., 2019) and AdamW (Loshchilov and Hutter, 2019) with learning rate=5e-5 for T5-large, 1e-4 for T5-base. For curriculum learning, the epoch of substructure learning is 5, and full structure learn-ing is 30. We conducted each experiment on a single NVIDIA GeForce RTX 3090 24GB. Due to GPU memory limitation, we used different batch sizes for different models: 8 for T5-large and 16 for T5-base; and truncated the max length of raw text to 256 and linearized form to 128 during training. We added the task name as the prefix for the T5 default setup. Table 2 presents the performance of all baselines and TEXT2EVENT on ACE05-EN. And Table 3 shows the performance of SOTA and TEXT2EVENT on ACE05-EN + and ERE-EN. We can see that:

Results in Supervised Learning Setting
1) By uniformly modeling all tasks in a single model and predicting labels universally, TEXT2EVENT can achieve competitive performance with weaker supervision and simpler architecture. Our method, only using the weak parallel text-record annotations, surpasses most of the baselines using token and entity annotations and achieves competitive performance with SOTA. Furthermore, using the simple encoder-decoder architecture, TEXT2EVENT outperforms most of the counterparts with complicated architectures.  2) By directly generating event structure from the text, TEXT2EVENT can significantly outperform sequence generation-based methods. Our method improves Arg-C F1 by 4.6% and 2.7% over the SOTA generation baseline and its extended multitask TANL. Compared with sequence generation, structure generation can be effectively guided using event schema knowledge during inference, and there is no need to generate irrelevant information.
3) By uniformly modeling and sharing information between different tasks and labels, the sequence-to-structure framework can achieve robust performance. From Table 2 and Table 3, we can see that the performance of OneIE decreases on the harder dataset ACE05-EN + , which has more pronoun roles and multi-token triggers. By contrast, the performance of TEXT2EVENT remains nearly the same on ACE05-EN. We believe this may be because the proposed sequence-to-structure model is a universal model that doesn't specialize in labels and can better share information between different labels.

Results in Transfer Learning Setting
TEXT2EVENT is a universal model, therefore can facilitate the knowledge transfer between different labels. To verify the transfer ability of TEXT2EVENT, we conducted experiments in the transfer learning setting, and the results are shown in Table 4. Specifically, we first randomly split the sentences which length larger than 8 in ACE05-EN + into two equal-sized subsets src and tgt: src only retains the annotations of the top 10 frequent event types, and tgt only retains the annotations of the remaining 23 event types. For both src and tgt, we use 80% of the dataset for model training and   Table 4, we can see that: 1) Data-efficient TEXT2EVENT can make better use of supervision signals. Even training on tgt from scratch, the proposed method also outperforms strong baselines. We believe that this may because baselines using token and entity annotation require massive fine-grained data for model learning. Different from baselines, TEXT2EVENT uniformly models all subtasks, thus the knowledge can be seamlessly transferred, which is more dataefficient.
2) TEXT2EVENT can effectively transfer knowledge between different labels. Compared with the non-transfer setting, which is directly trained on tgt training set, the transfer setting of TEXT2EVENT can achieve significant F1 improvements of 3.7 and 3.2 on Trig-C and Arg-C, respectively. By contrast, the other two baselines cannot obtain significant F1 improvements of both Trig-C and Arg-C via transfer learning. Note that the information of entity annotation is shared across src and tgt. As a result, OneIE can leverage such information to better argument prediction even with worse trigger prediction. However, even without using entity annotation, the proposed method can still achieve a similar improvement in the transfer learning setting. This is because the labels are provided universally in TEXT2EVENT, so the parameters are not labelspecific.

Detailed Analysis
This section analyzes the effects of event schema knowledge, constrained decoding, and curriculum learning algorithm in TEXT2EVENT. We designed four ablated variants based on T5-base: • "TEXT2EVENT" is the base model that is directly trained with the full structure learning.
• "+ CL" indicates training TEXT2EVENT with the proposed curriculum learning algorithm.
• "w/o CD" discards the constrained decoding during inference and generates event structures as an unconstrained generation model.
• "w/o ES" replaces the names of event types and roles with meaningless symbols, which is used to verify the effect of event schema knowledge. Table 5 shows the results on the development set of ACE05-EN using different training data sizes. We can see that: 1) Constrained decoding can effectively guide the generation with event schemas, especially in low-resource settings. Comparing to "w/o CD", constrained decoding improves the performance of TEXT2EVENT, especially in lowresource scenarios, e.g., using 1%, 5% training set. 2) Curriculum learning is useful for model learning. Substructure learning improves 4.7% Trig-C F1 and 5.8% Arg-C F1 on average. 3) It is crucial to encode and generate event labels as words, rather than meaningless symbols. Because by encoding labels as natural language words, our method can effectively transfer knowledge from pre-trained language models.

Related Work
Our work is a synthesis of two research directions: event extraction and structure prediction via neural generation model. Event extraction has received widespread attention in recent years, and mainstream methods usually use different strategies to obtain a complete event structure. These methods can be divided into: 1) pipeline classification (Ahn, 2006;Ji and Grishman, 2008;Liao and Grishman, 2010;Hong et al., 2011Hong et al., , 2018Huang and Riloff, 2012;Chen et al., 2015;Sha et al., 2016;Lin et al., 2018;Yang et al., 2019;Ma et al., 2020;Zhang et al., 2020c), 2) multi-task joint models (McClosky et al., 2011;Li et al., 2013Li et al., , 2014Yang and Mitchell, 2016;Nguyen et al., 2016;Zhang et al., 2019a;Zheng et al., 2019), 3) semantic structure grounding (Huang et al., 2016Zhang et al., 2020a), and 4) question-answering (Chen et al., 2020b;Du and Cardie, 2020;. Compared with previous methods, we model all subtasks of event extraction in a uniform sequenceto-structure framework, which leads to better decision interactions and information sharing. The neural encoder-decoder generation architecture (Sutskever et al., 2014;Bahdanau et al., 2015) has shown its strong structure prediction ability and has been widely used in many NLP tasks, such as machine translation (Kalchbrenner and Blunsom, 2013), semantic parsing (Dong and Lapata, 2016;Song et al., 2020), entity extraction (Straková et al., 2019), relation extraction (Zeng et al., 2018;Zhang et al., 2020b), and aspect term extraction (Ma et al., 2019). Like TEXT2EVENT in this paper, TANL (Paolini et al., 2021) and GRIT (Du et al., 2021) also employ neural generation models for event extraction, but they focus on sequence generation, rather than structure generation. Different from previous works that extract text span via labeling (Straková et al., 2019) or copy/pointer mechanism (Zeng et al., 2018;Du et al., 2021), TEXT2EVENT directly generate event schemas and text spans to form event records via constrained decoding (Cao et al., 2021;Chen et al., 2020a), which allows TEXT2EVENT to handle various event types and transfer to new types easily.

Conclusions
In this paper, we propose TEXT2EVENT, a sequence-to-structure generation paradigm for event extraction. TEXT2EVENT directly learns from parallel text-record annotation and uniformly models all subtasks of event extraction in a sequence-to-structure framework. Concretely, we propose an effective sequence-to-structure network for event extraction, which is further enhanced by a constrained decoding algorithm for event knowledge injection during inference and a curriculum learning algorithm for efficient model learning. Experimental results in supervised learning and transfer learning settings show that TEXT2EVENT can achieve competitive performance with the previous SOTA using only coarse text-record annotation.
For future work, we plan to adapt our method to other information extraction tasks, such as N-ary relation extraction.