Learning Constraints and Descriptive Segmentation for Subevent Detection

Event mentions in text correspond to real-world events of varying degrees of granularity. The task of subevent detection aims to resolve this granularity issue, recognizing the membership of multi-granular events in event complexes. Since knowing the span of descriptive contexts of event complexes helps infer the membership of events, we propose the task of event-based text segmentation (EventSeg) as an auxiliary task to improve the learning for subevent detection. To bridge the two tasks together, we propose an approach to learning and enforcing constraints that capture dependencies between subevent detection and EventSeg prediction, as well as guiding the model to make globally consistent inference. Specifically, we adopt Rectifier Networks for constraint learning and then convert the learned constraints to a regularization term in the loss function of the neural model. Experimental results show that the proposed method outperforms baseline methods by 2.3% and 2.5% on benchmark datasets for subevent detection, HiEve and IC, respectively, while achieving a decent performance on EventSeg prediction.


Introduction
Since real-world events are frequently conveyed in human languages, understanding their linguistic counterparts, i.e. event mentions in text, is of vital importance to natural language understanding (NLU). One key challenge to understanding event mentions is that they refer to real-world events with varied granularity  and form event complexes . For example, when speaking of a coarse-grained event "publishing a paper", it can involve a complex of Figure 1: An example of PARENT-CHILD relations and EVENTSEGs from the HiEve dataset . The blue and yellow segments denote the textual spans of event complexes "posted" and "scandal" respectively. Curved arrows denote PARENT-CHILD relations within a text segment, whereas the dotted arrows denote cross-segment PARENT-CHILD relations. more fine-grained events such as "writing the paper," "passing the peer review," and "presenting at the conference." Naturally, understanding events requires to resolve the granularity of events and infer their memberships, which corresponds to the task of subevent detection (a.k.a. event hierarchy extraction). Practically, subevent detection is a key component of event-centric NLU (Chen et al., 2021), and is beneficial to various applications, such as schema induction Li et al., 2020a), task-oriented dialogue agents (Andreas et al., 2020), summarization Zhao et al., 2020), and risk detection (Pohl et al., 2012).
As a significant step towards inducing event complexes (graphs that recognize the relationship of multi-granular events) in documents, subevent detection has started to receive attention recently Han et al., 2021). It is natu-ral to perceive that in documents, there might be several different event complexes and they often span in different descriptive contexts that form relatively independent text segments. Consider the example in Fig. 1, where the two membership relations in the event complex (graph consisting of "scandal (e7)," "charges (e6)," "ousting (e8)," and relations) are both within the segment marked in yellow that describes the event complex. As can be seen in the paragraph, though we cannot deny the existence of cross-segment subevent relations (dotted arrows), events belonging to the same membership are much more often to co-occur in a text segment. This correlation has been overlooked by existing data-driven methods (Zhou et al., 2020;Yao et al., 2020), which formulate subevent detection as pairwise relation extraction. On the other hand, while prior studies have demonstrated the benefits of incorporating logical constraints among event memberships and other relations (such as coreference) , the constraints between the memberships and event co-occurences in text segments remain uncertain. Hence, how to effectively learn and enforce hard-to-articulate constraints as in the case of subevent detection and segmentation of text is another challenge.
Our first contribution is to improve subevent detection based on an auxiliary task of EVENTSEG prediction. By EVENTSEG prediction, we seek to segment a document into descriptive contexts of different event complexes. Evidently, with EVENTSEG information, it would be relatively easy to infer the memberships of events in the same descriptive context. Using annotations for subevent detection and EVENTSEG prediction, we aim to adopt a neural model to jointly learn these two tasks along with the (soft) logical constraints that bridge their labels together. In this way, we incorporate linear discourse structure of segments into membership relation extraction, avoiding complicated feature engineering in the previous work (Aldawsari and Finlayson, 2019). From the learning perspective, adding EVENTSEG prediction as an auxiliary task seeks to provide effective incidental supervision signals (Roth, 2017) to the subevent detection task. This is especially important in the current scenario where annotated learning resources for subevents are typically limited (Hovy et al., 2013;O'Gorman et al., 2016).
To capture the logical dependency between subevent structure and EVENTSEG, our second contribution is an approach to automatically learning and enforcing logical constraints. Motivated by Pan et al. (2020), we use Rectifier Networks to learn constraints in the form of linear inequalities, and then convert the constraints to a regularization term that can be incorporated into the loss function of the neural model. This allows any hard-to-articulate constraints to be automatically captured for interrelated tasks, and efficiently guides the model to make globally consistent inference. By learning and enforcing task-specific constraints for subevent relations, the proposed method achieves comparable results with SOTA subevent detection methods on the HiEve and IC dataset. Moreover, by jointly learning with EVENTSEG prediction, the proposed method surpasses previous methods on subevent detection by relatively 2.3% and 2.5% in F 1 on HiEve and IC, while achieving decent results on EVENTSEG prediction.

Related Work
We discuss three lines of relevent research. Subevent Detection. Several approaches to extracting membership relations have been proposed, which mainly fall into two categories: statistical learning methods and data-driven methods. Statistical learning methods Araki et al., 2014;Aldawsari and Finlayson, 2019) collect a variety of features before feeding into classifiers for pairwise decision. Nevertheless the features often require costly human effort to obtain, and are often dataset-specific. Data-driven methods, on the other hand, automatically characterize events with neural language models like BERT (Devlin et al., 2019), and can simultanously incorporate various signals such as event time duration (Zhou et al., 2020), joint constraints with event temporal relations  and subevent knowledge (Yao et al., 2020). Among recent methods, only Aldawsari and Finlayson (2019) utilize discourse features like discourse relations between elementary discourse units, but still document-level segmentation signals are not incorporated into the task of subevent detection. Actually, research on event-centric NLU (Chen et al., 2021) has witnessed the usage of documentlevel discourse relations: different functional discourse structures around the main event in news articles have been studied in Choubey et al. (2020). Hence, we attempt to capture the interdependencies between subevent detection and segmentation of text, in order to enhance the model performance for event hierarchy extraction.
Text Segmentation. Early studies in this line have concentrated on unsupervised text segmentation, quantifying lexical cohesion within small text segments (Choi, 2000), and unsupervised Bayesian approaches have also been successful in this task (Eisenstein and Barzilay, 2008;Eisenstein, 2009;Newman et al., 2012;Mota et al., 2019). Given that unsupervised algorithms are difficult to specialize for a particular domain, Koshorek et al. (2018) formulate the problem as a supervised learning task. Lukasik et al. (2020) follow this idea by using transformer-based architectures with cross segment attention to achieve state-of-the-art performance. Focusing on creating logically coherent sub-document units, these prior work do not cover segmentation of text regarding descriptive contexts of event complexes, which is the focus of the auxiliary task in this work.
Learning with Constraints. In terms of enforcing declarative constraints in neural models, early efforts (Roth and Yih, 2004; formulate the inference process as Integer Linear Programming (ILP) problems. Pan et al. (2020) also employ ILP to enforce constraints learned automatically from Rectifier Networks with strong expressiveness (Pan and Srikumar, 2016). Yet the main drawback of solving an ILP problem is its inefficiency in a large feasible solution space. Recent work on integrating neural networks with structured outputs has emphasized the importance of the interaction between constraints and representations (Rocktäschel and Riedel, 2017;Niculae et al., 2018;Li et al., , 2020b. However there has been no automatic and efficient ways to learn and enforce constraints that are not limited to first-order logic, e.g., linear inequalities learned via Rectifier Networks. And this is the research focus of our paper.

Preliminaries
A document D consists of a collection of m sentences D = [s 1 , s 2 , · · · , s m ], and each sentence, say s k , contains a sequence of tokens s k = [w 1 , w 2 , · · · , w n ]. Some tokens in sentences belong to the set of annotated event triggers, i.e., E D = {e 1 , e 2 , · · · , e l }. Following the notation by Koshorek et al. (2018), a segmentation of doc-ument D is represented as a sequence of binary values: Q D = {q 1 , q 2 , · · · , q m−1 }, where q i indicates whether sentence s i is the end of a segment.
Subevent Detection is to identify membership relations between events, given event mentions in documents. Particularly, R denotes the set of relation labels as defined in Hovy et al. (2013) and  (i.e., PARENT-CHILD, CHILD-PARENT, COREF, and NOREL). For a relation r ∈ R, we use a binary indicator Y r i,j to denote whether an event pair (e i , e j ) has relation r, and use y r i,j to denote the model-predicted possibility of an event pair (e i , e j ) to have relation r.
EventSeg prediction aims at finding an optimal segmentation of text that breaks the document into several groups of consecutive sentences, and each sequence is a descriptive context of an event complex . Being different from the traditional definition of text segmentation, EVENTSEG focuses on the change of event complex (which is not necessarily the change of topic). For a pair of events (e i , e j ), we use a binary indicator Z i,j to denote whether the two events are within the same descriptive context of event complexes, and z i,j to denote the model-predicted possibility of two events to belong to the same segment. Details on how to obtain EVENTSEG are described in §5.1.
Connections between Two Tasks. Statistically, through an analysis of the HiEve and IC corpus, PARENT-CHILD and CHILD-PARENT relations appear within the same descriptive context of event complex with a probability of 65.13% (see Tab. 1). On the other hand, the probability for each of the two other non-membership relations (i.e., COREF and NOREL) to appear within the same segment approximately equals that of its appearence across segments. This demonstrates that subevent relations tend to appear within the same EVENTSEG. Since this is not an absolute logical constraint, we adopt an automatic way of modeling such constraints instead of manually inducing them, which is described in the next section.

Methods
We now present the framework for learning and enforcing constraints for the main task of subevent detection and the auxiliary EVENTSEG prediction. We start with learning the hard-to-articulate constraints ( §4.1), followed by details of joint learning ( §4.2) and inference ( §4.3) for the two tasks.

Learning Constraints
From the example shown in Fig. 1 we can construct an event graph G with all the events, membership relations, and EVENTSEG information. Fig. 2 shows a three-event subgraph of G. The goal of constraint learning is as follows: given membership relations Y r i,j , Y r j,k and segmentation information Z i,j , Z j,k about event pairs (e i , e j ) and (e j , e k ), we would like to determine whether a certain assignment of Y r i,k , and Z i,k is legitimate. Feature Space for Constraints. We now define the feature space for constraint learning. Let X p = {Y r p , r ∈ R} ∪ {Z p } denote the set of features for an event pair p. Given features X i,j and X j,k , we would like to determine the value of X i,k , yet the mapping from the labels of (e i , e j ), (e j , e k ) to the labels of (e i , e k ) is a one-to-many relationship. For instance, if r = PARENT-CHILD, Y r i,j = Y r j,k = 1, and Z i,j = Z j,k = 0, then due to the transitivity of PARENT-CHILD, we should enforce Y r i,k = 1. Yet we cannot tell whether e i and e k are in the same EVENTSEG, i.e., both Z i,k = 1 and Z i,k = 0 could be legitimate. In other words, we actually want to determine the set of possible values of X i,k and thus we need to expand the constraint features to better capture relationship legitimacy. We employ the power set of X i,k , P(X i,k ), as our new features for event pair (e i , e k ). And now a subgraph with three events e i , e j , and e k can be featurized as (1)

Constraint Learning with Rectifier Network.
When we construct three-event subgraphs from documents, a binary label t for structure legitimacy is created for each subgraph. Inspired by how constraints are learned for several structured prediction tasks (Pan et al., 2020), we represent constraints for a given subgraph-label pair (X, t) as K linear inequalities. 2 Formally, t = 1 if X satisfies constraints c k for all k = 1, · · · , K. And the k th constraint c k is expressed by a linear inequality whose weights w k and bias b k are learned. Since a system of linear inequalities is proved to be equivalent to the Rectifier Network proposed in Pan et al. (2020), we adopt a two-layer rectifier network for 2 Here we assume K constraints is the upper bound for all the rules to be learned. e 6 e 2 e 7 P a r e n t -C h il d P a r e n t -C h il d Parent-Child S a m e S e g m e n t D iff e r e n t S e g m e n t Different Segment Figure 2: A legitimate structure for three-event subgraph obtained from the example shown in Fig. 1. The constraint features for the subgraph can be expressed by X = X 7,6 ∪ X 6,2 ∪ P(X 7,2 ), and the label t for this structure is 1.
learning constraints where p denotes the possibility of t = 1 and σ(·) denotes the sigmoid function. We train the parameters w k 's and b k 's of the rectifier network in a supervised setting. The positive examples are induced from subgraph structures that appear in the training corpus, while the negative examples are randomly chosen from the rest possibilities that do not exist in the training corpus.

Joint Task Learning
After learning the constraints using Rectifier Networks, we introduce how to jointly model membership relations and EVENTSEG with neural networks and how to integrate the learned constraints into the model. The model architecture is shown in Fig. 3.
Local Classifier. To characterize event pairs in documents, we employ a neural encoder, which obtains contextualized representations for event triggers from the pre-trained transformer-based language model RoBERTa (Liu et al., 2019). As the context of event pairs, the sentences where two event mentions appear are concatenated using [CLS] and [SEP]. We then calculate the elementwise average of subword-level contextual representations as the representation for each event trigger.
To obtain event pair representation for (e i , e j ), we concatenate the two contextual representations, together with their element-wise Hadamard product and subtraction as in . The event pair representation is then sent to a multi-layer perceptron (MLP) with |R| outputs for estimation of the confidence score y r i,j for each relation r. To make EVENTSEG as an auxiliary task, the model also predicts whether two events belong to the same segment using another separate MLP with a singlevalue output z i,j . In accordance with the learned constraints in §4.1, the model takes three pairs of events at a time. The annotation loss in Fig. 3 is a linear combination of a four-class cross-entropy loss L A,sub for subevent detection and a binary cross-entropy loss L A,seg for EVENTSEG.
Incorporating Subgraph Constraints. The K constraints learned in §4.1 are encoded into the weights w k and bias b k , k = 1, · · · , K. Now that the input X is considered valid if it satisfies all K constraints, we obtain the predicted probability p of X being valid from Eq. 2. To add the constraints as a regularization term in the loss function of the neural model, we convert p into the negative log space  which is same as the crossentropy loss. And thus the loss corresponding to the learned constraints is And the loss function of the neural model is where the λ's are non-negative coefficients to control the influence of each loss term. With the loss function in Eq. 3, we train the model in a supervised way to fine-tune RoBERTa.

Inference
At inference time, to extract relations in the subevent detection task, we input a pair of events into the model and compare the predicted possibility for each relation, leaving the other two input pairs blank. For EVENTSEG prediction, we let the model predict z i,i+1 for each pair of adjacent events (e i , e i+1 ) that appear in different sentences. If z i,i+1 = 1, it means there is a segment break between e i and e i+1 . When there are intermediate sentences between the two adjacent event mentions, we treat the sentence that contains e i as the end of a previous segment. In this way, we provide an approach to solving two tasks together via automatically learning and enforcing constraints in the neural model. We provide in-depth experimentation for the proposed method in the next section.

Experiments
Here we describe the experiments on subevent detection with EVENTSEG prediction as an auxiliary task. We first introduce the corpora used ( §5.1), followed by evaluation for subevent detection and an ablation study for illustrating the importance of each model component ( §5.2- §5.4). We also provide a case study on EVENTSEG prediction ( §5.5) and an analysis of the constraints learned in the model ( §5.6).

Datasets
HiEve The HiEve corpus   the HiEve dataset has an IAA of 0.69 F1.
Intelligence Community (IC) The IC corpus (Hovy et al., 2013) also contains 100 news articles annotated with membership relations. The articles report violence events such as attack, war, etc. We discard those relations involving implicit events annotated in IC, and calculate transitive closure for both subevent relations and co-reference to get annotations for all event pairs in text order as it is done for HiEve .
Labeling EVENTSEG We explain how to segment the document using annotations for subevent relations. First, we use the annotated subevent relations (PARENT-CHILD and CHILD-PARENT only) to construct a directed acyclic event graph for each document. Due to the property of subevent relations, each connected component in the graph is actually a tree with one root node, which forms an event complex. If the graph constructed from document has one connected component, we remove the root node and separate the event graph into more than one event complexes. Since each event complex has a textual span in the document, we obtain several descriptive contexts that may or may not overlap with each other. For those documents with non-overlapping descriptive contexts, their segmentations are therefore obtained. In cases where two descriptive contexts of event complexes overlap with each other, if there exists such an event whose removal results in non-overlapping contexts, then we segment the contexts assuming this event is not considered. Otherwise, we merge the contexts into one segment. Through this eventbased text segmentation, on average we obtain 3.99 and 4.29 EVENTSEGs in the HiEve and IC corpus, respectively.
We summarize the data statistics in Tab. 1.

Baselines and Evaluation Protocols
On IC dataset, we compare with two baseline approaches. Araki et al. (2014) propose a logistic regression model along with a voting algorithm for parent event detection.  use a data-driven model that incorporates handcrafted constraints with event temporal attributes to extract event-event relations. On Hieve 3 , we compare with a transformer-based language model TACOLM (Zhou et al., 2020) that fine-tunes on a temporal common sense corpora, and the method proposed by  which also serves as the second baseline for IC. We use the same evaluation metric on HiEve as previous methods (Zhou et al., 2020), leaving 20% of the documents out for testing 4 . The F 1 scores of PARENT-CHILD and CHILD-PARENT and the micro-average of them are reported. In accordance with HiEve, the IC dataset is also evaluated with F 1 scores of membership relations instead of BLANC (Araki et al., 2014), while the other settings remain the same with previous works.

Experimental Setup
We fine-tune the pre-trained 1024 dimensional RoBERTa (Liu et al., 2019) to obtain contextual representations of event triggers in a supervised way given labels for membership relations and EVENTSEG. Additionally, we employ 18 dimensional one-hot vectors for part-of-speech tags for tokens in documents to include explicit syntactic features in the model. For each MLP we set the dimension to the average of the input and output neurons, following Chen et al. (2018). The parameters of the model are optimized using AMSGrad (Reddi et al., 2018), with the learning rate set to 10 −6 . The training process is limited to 40 epochs since it is sufficient for convergence.

Results
We report the results for subevent detection on two benchmark datasets, HiEve and IC, in Tab. 2. Among the baseline methods,  has the best results in terms of F 1 on both datasets. They integrate event temporal relation extraction, 3 Despite carefully following the details described in Aldawsari and Finlayson (2019) and communicating with the authors, we were not able to reproduce their results. Therefore, we choose to compare with other methods. 4 To make predictions on event complexes, we keep all negative NOREL instances in our experiments instead of strictly following Zhou et al. (2020) and  where negative instances are down-sampled with a probability of 0.4.  common sense knowledge and handcrafted logical constraints into their approach. In contrast, our proposed method does not require constraints induced by domain experts, but still outperforms their F 1 score by 2.3 -2.5%. We attribute this superiority to the use of connections between subevent relations and the linear discourse structure of segments. Thanks to the strong expressiveness of Rectifier Networks, we utilize these connections via the learning of linear constraints, thus incorporating incidental supervision signal from EVENTSEG. Furthermore, the event pair representation in our model is obtained from broader contexts than the local sentence-level contexts for events in . The new representation not only contains more information on events but naturally provides necessary clues for determining whether there is a break for EVENTSEG.
We further perform an ablation analysis to aid the understanding of the model components and report our findings in Tab. 3. Without any constraints, integrating EVENTSEG prediction as an auxiliary task brings along an absolute gain of 0.2% and 0.6% in F 1 on HiEve and IC respectively over the vanilla single-task model with RoBERTa finetuning. This indicates that EVENTSEG information is beneficial to the extraction of membership relations. When membership constraints are added via the regularization term into the loss function, the model's performance on subevent detection is significantly improved by 2.1% in F 1 on HiEve dataset. Incorporating constraints involving two tasks further enhances the model performance by 0.5% -1.1%. This indicates that the global consistency ensured within and across EVENTSEGs is important for enhancing the comprehension for subevent memberships.

Case Study for EVENTSEG Prediction
Here we provide an analysis of model performance on the task of EVENTSEG prediction. Though EVENTSEG prediction is somewhat different from text segmentation in concept, we can use methods for text segmentation as baselines for EVENTSEG prediction. We train a recent BERT-based model (Lukasik et al., 2020) for text segmentation based on annotations for EVENTSEG in the HiEve and IC corpora and compare our method with this baseline. In Tab. 4 we show the performances of the baseline model and ours for EVENTSEG prediction in terms of F 1 on HiEve and IC. Since our solution for EVENTSEG prediction is essentially similar to the cross-segment BERT model in terms of representations of segments, our performance is on par with the baseline model.

Analysis on Constraint Learning
We further provide an in-depth qualitative analysis on different types of logical constraints captured by the constraint learning.

Types of Learned Constraints
We expect that both task-specific constraints (membership relations only) in previous works  and crosstask constraints can be automatically captured in our framework. Accordingly, we separately analyze these two constraints.
Task-specific Constraints. Since we are using three-event subgraph for constraint learning, apparently, transitivity constraints for membership relations like can be learned; whereas constraints that typically involve two events, e.g., symmetry constraints for membership relations like Y r i,j =Yr j,i , r ∈ {PARENT-CHILD, CHILD-PARENT}, can also be learned by assigning the third event e k to the same event as e i and treating the relation of (e i , e k ) as COREF.
Cross-task Constraints. Here we provide an analysis of cross-task constraints for both membership relations and EVENTSEG information learned in  Table 3: Ablation study results for subevent detection. The results on both datasets are the micro-average of PARENT-CHILD and CHILD-PARENT in terms of precision, recall, and F 1 . "+ Membership Constraints" denotes adding automatically learned constraints for membership relations upon the joint training model. The row of "+ Membership + EVENTSEG" shows the results of the complete model.

Model
HiEve IC Cross-segment BERT (Lukasik et al., 2020) +0.09x 5 + 0.13x 6 + 0.25x 7 + 0.04x 8 − 0.18x 9 + · · · + 0.02x 18 + 0.07x 19 + · · · + 0.05 ≥ 0, where x 1 and x 6 denote the variables for Y r i,j = 1 and Y r j,k = 1 (r = CHILD-PARENT) respectively, and they both have positive coefficients. If we look at expected labels for P(X i,k ), we can see that x 18 and x 19 which denote the variables for Y r i,k = 1, Z i,k = 0 and Y r i,k = 1, Z i,k = 1 have coefficients of 0.02 and 0.07, respectively. The two positive coefficients for x 18 and x 19 indicate that (a) (e i , e k ) is possible to have CHILD-PARENT relation, and (b) the possibility of (e i , e k ) being in the same EVENTSEG is greater than two events being in different EVENTSEGs.

Qualitative Analysis
We set K to 10 since we observe less number of constraints will decrease the performance of learning accuracy while increasing K does not cause noticeable influence. We optimize the parameters using Adam with a learning rate of 0.001 and the training process is limited to 1,000 epochs. We show the performance of constraint learning in Tab. 5. Since the constraints for membership relations should be declarative hard constraints like symmetry and transitivity constraints in §5.6.1, the accuracy of constraint learning is equal or close to 100%. Yet, those hard-to-articulate constraints that incorporate EVENTSEG information are more difficult to learn, and thus the Rectifier Network has a

Conclusion
In this work we propose an automatic and efficient way of learning and enforcing constraints for subevent detection. By noticing the connections between subevent dection and EVENTSEG, we adopt EVENTSEG prediction as an auxiliary task which provides effective incidental supervision signals. Through learning and enforcing constraints that can express hard-to-articulate constraints, the logical rules for both tasks are captured to regularize the model towards consistent inference. The proposed approach outperforms SOTA data-driven methods on benchmark datasets and provides comparable results with recent text segmentation methods on EVENTSEG prediction. This demonstrates the effectiveness of the framework on subevent detection and the potential of solving other structured predictions tasks in NLP.

Ethical Considerations
This work does not present any direct societal consequence. The proposed method aims at supporting high-quality extraction of event complexes from documents with the awareness of discourse structures and automated constraint learning. We believe this study leads to intellectual merits of developing robust event-centric information extraction technologies. It also has broad impacts, since constraints and dependencies can be broadly in-vestigated for label structures in various natural language classification tasks. The acquired eventually knowledge, on the other hand, can potentially benefit various downstream NLU and NLG tasks. For any information extraction methods, realworld open source articles to extract information from may include societal biases. Extracting event complexes from articles with such biases may potentially propagate the bias into acquired knowledge representation. While not specifically addressed in this work, the ability to incorporate logical constraints and discourse consistency can be a way to mitigate societal biases.