Probing Representations for Document-level Event Extraction

The probing classifiers framework has been employed for interpreting deep neural network models for a variety of natural language processing (NLP) applications. Studies, however, have largely focused on sentencelevel NLP tasks. This work is the first to apply the probing paradigm to representations learned for document-level information extraction (IE). We designed eight embedding probes to analyze surface, semantic, and event-understanding capabilities relevant to document-level event extraction. We apply them to the representations acquired by learning models from three different LLM-based document-level IE approaches on a standard dataset. We found that trained encoders from these models yield embeddings that can modestly improve argument detections and labeling but only slightly enhance event-level tasks, albeit trade-offs in information helpful for coherence and event-type prediction. We further found that encoder models struggle with document length and cross-sentence discourse.


Introduction
Relation and event extraction (REE) focuses on identifying clusters of entities participating in a shared relation or event from unstructured text, that frequently contains a fluctuating number of such instances.While the field of information extraction (IE) started out building training and evaluation REE datasets primarily concerned with documents, researchers have been overwhelmingly focusing on sentence-level datasets (Li et al., 2013;Du and Cardie, 2020).Nevertheless, many IE tasks require a more comprehensive understanding that often extends to the entire input document, leading to challenges such as length and multiple events when embedding full documents.Consequently, document-level datasets continue to pose challenges for even the most advanced models today (Das et al., 2022).
REE is considered an essential and popular task, encompassing various variations.One particularly general approach is template filling1 , which can subsume certain other IE tasks by formatting.In this regard, our focus lies on templateextraction methods with an end-to-end training scheme, where texts serve as the sole input.
Multi-task NLP models often support and are evaluated on the task.As a result, we have seen frameworks of diverse underlying assumptions and architectures for the task.Nevertheless, high-performing modern models all leverage and fine-tune on pre-trained neural contextual embedding models, like variations of BERT (Devlin et al., 2019), due to the generalized performance leap introduced by transformers and pretraining.
It is crucial to understand these representations of the IE frameworks, as doing so reveals model strengths and weaknesses.However, unlike lookup style embeddings such as GloVe (Pennington et al., 2014), these neural contextualized representations are inherently difficult to interpret, leading to ongoing research efforts focused on analyzing their encoded information (Tenney et al., 2019b;Zhou and Srikumar, 2021;Belinkov, 2022).This work is inspired by various sentencelevel embedding interpretability works, including Conneau et al. (2018) and Alt et al. (2020).
Yet, to the best of our knowledge, no prior work has been done to understand the embedding of features that exist only at the document-level scale.Hence, our work aims to fill this gap by investigating the factors contributing to the model performance.Specifically, we analyze the impact of three key elements: contextualization of encoding, fine-tuning, and encoder and post-encoding architectures.Our contributions can be summarized as follows: […] FMLN detachments have conducted the largest military operation in the entire history of the Salvadoran conflict in the country's capital.An offensive was launched […] According to Reuter, attempts were made to storm President Alfredo Cristiani's official and personal residences; however, it is reported that the president was not hurt.[…]The third round of these talks should have been held recently in Caracas, but opposition representatives refused to take part in them after a left-wing trade union's headquarters was subjected to artillery bombardment resulting in the deaths of at least 10 people.According to the insurgents, […] Expected Output Templates                                                                                                                                                                                                                                                                                                  • We identified the necessary document-level IE understanding capabilities and created a suite of probing tasks2 corresponding to each.• We present a fine-grained analysis of how these capabilities relate to encoder layers, full-text contextualization, and fine-tuning.• We compare IE frameworks of different input and training schemes and discuss how architectural choices affect model performances.

Probing and Probing Tasks
The ideal learned embedding for spans should include features and patterns (or generally, "information") that independently show similarity to other span embeddings of the same entity mentions, in the same event, etc., and we set out to test if that happens for trained encoders.
Probing uses simplified tasks and classifiers to understand what information is encoded in the embedding of the input texts.We train a given document-level IE model (which finetunes its encoder in training), and at test time capture the output of its encoder (the document representations) before they are further used in the model-specific extraction process.We then train and run our probing tasks, each assessing an encoding capability of the encoder.
Drawing inspiration from many sentence-level probing works, we adopt some established setups and tasks, but with an emphasis on probing tasks pertaining to document and event understanding.
We use the MUC document-level IE dataset   (muc, 1991) as our base dataset (details in Section 3.3) to develop evaluation probing tasks.We present our probing tasks in Figure 2, This section outlines the probing tasks used for assessing the effectiveness of the learned document representations.
When designing these probing tasks, our goal is to ensure that each task accurately measures a specific and narrow capability, which was often a subtask in traditional pipelined models.Additionally, we want to ensure fairness and generalizability in these tasks.Therefore, we avoid using event triggers in our probing tasks, especially considering that not all models use them during training.
We divide our probing tasks into three cate-gories: surface information, generic semantic understanding, and event understanding.
Surface information These tasks assess if text embeddings encode the basic surface characteristics of the document they represent.Similar to the sentence length task proposed by Adi et al. (2017), we employed a word count (WordCt) task and a sentence count (SentCt) task, each predicts the number of words and sentences in the text respectively.Labels are grouped into 10 count-based buckets and ensured a uniform distribution.
Semantic information These tasks go beyond surface and syntax, capturing the conveyed meaning between sentences for higher-level understanding.Coreference (Coref) is the binaryclassification task to determine if the embeddings of two spans of tokens ("mentions") refer to the same entity.Due to the annotation of MUC, all used embeddings are all known role-fillers, which are strictly necessary for the downstream document-level IE to avoid duplicates.To handle varying mention span lengths in MUC, we utilize the first token's embedding for effective probing classifier training, avoiding insufficient probe training at later positions.A similar setup applies to role-filler detection (IsArg), which predicts if a span embedding is an argument of any template.This task parallels Argument Detection in classical models.Furthermore, the role classification task (ArgTyp) involves predicting the argument type of the role-filler span embedding.This task corresponds to the argument extraction (argumentation) step in classical pipelines.
Event understanding The highest level of document-level understanding is event understanding.
To test the model's capability in detecting events, we used an event count task (EvntCt) where the probing classifier is given the full-text embedding and asked to predict the number of events that occurred in the text.We split all count labels into three buckets for class balancing.To understand how word embeddings are helpful to event deduplication or in argument linking, our Co-event task (CoEvnt) takes two argument span embeddings and predicts whether they are arguments to the same event or different ones.Additionally, the event type task (EvntTyp n ) involves predicting the type of the event template based on the embeddings of n role filler first tokens.This task is similar to the classical event typing subtask, which often uses triggers as inputs.By performing this task, we can assess whether fine-tuning makes event-type information explicit.
Although syntactic information is commonly used in probing tasks, document-level datasets have limited syntactic annotations due to the challenges of accurately annotating details like treedepth data at scale.While the absence of these tasks is not ideal, we believe it would not significantly impact our overall analysis.
3 Experiment Setup

IE Frameworks
We train the following document-level IE frameworks for 5, 10, 15, 20 epochs on MUC, and we observe the lowest validation loss or highest event F1 score at epoch 20 for all these models.
DyGIE++ (Wadden et al., 2019) is a framework capable of named entity recognition, relation extraction, and event extraction tasks.It achieves all tasks by enumerating and scoring sections (spans) of encoded text and using the relations of different spans to detect triggers and construct event outputs.
GTT (Du et al., 2021) is a sequence-tosequence event-extraction model that perform the task end-to-end, without the need of labeled triggers.It is trained to decode a serialized template, with tuned decoder constraints.
TANL (Paolini et al., 2021) is a multi-task sequence-to-sequence model that fine-tunes T5 model (Raffel et al., 2020) to translate text input to augmented natural languages, with the in-text augmented parts extracted to be triggers and roles.It uses a two stage approach for event extraction, by first decoding (translating) the input text to extract trigger detection, then decoding related arguments for each trigger predicted.

Probing Model
We use a similar setup to SentEval (Conneau et al., 2018) with an extra layer.While sentence-level probing can use all dimensions of embeddings as input, we added an attention-weighted layer right after the input layer, as to simulate a response to a trained query and to reduce dimensions.The 768-dimension layer-output is then trained using the same structure as SentEval.Specific training detail can be found in Appendix D.

Dataset
We use MUC-3 and MUC-4 as our documentlevel data source to create probing tasks, thanks to its rich coreference information.

Result and Analysis
We present our data in Table 1, with results in more epochs available in Table 7 in Appendix E.

Document-level IE Training and Embeddings
Figure 3 shows that embedded semantic and event information fluctuate during IE training, but steadily differ from the untrained BERT-base baseline.For the document representation, trained encoders significantly enhance embeddings for event detection as suggested by the higher accuracy in event count predictions (EvntCt↑ Probing Performance of Different Models Table 1 highlights the strengths and weaknesses of encoders trained using different IE frameworks.In addition to above observations, we see that DyGIE++ and GTT document embeddings capture event information (EvntCt↑) only marginally better than the baseline, whereas the TANLfinetuned encoder often has subpar performance across tasks.This discrepancy may be attributed to TANL's usage of T5 instead of BERT, which might be more suitable for the task, and that TANL employs the encoder only once but the decoder multiple times, resulting in less direct weight updates for the encoder and consequently lower its perfor-  mance in probing tasks (and the document-level IE task itself).Surface information encoding (Figure 6 in Appendix E) differ significantly by models.

Sentence and Full Text Embedding
As demonstrated in Table 1, embedding sentences individually and then concatenating them can be more effective for IE tasks than using embeddings directly from a fine-tuned encoder designed for entire documents.Notably, contextually encoding the full text often results in diminished performance in argument detection (IsArg↓), labeling (ArgTyp↓), and particularly in Event detection (EvntCt↓) for shorter texts, as highlighted in Table 2.These results suggest that encoders like BERT might not effectively utilize cross-sentence discourse information, and a scheme that can do so remains an open problem.However, contextualized embedding with access to the full text does encode more event information in its output representation for spans (CoEvnt↑).
Encoding layers Lastly, we experiment to locate the encoding of IE information in different layers of the encoders, a common topic in previous works (Tenney et al., 2019a).Using GTT with the same hyperparameter in its publication, its finetuned encoder shows semantic information encoding mostly (0-indexed) up to layer 7 (IsArg↑, ArgTyp↑), meanwhile, event detection capability increases throughout the encoder (CoEvnt↑, EvntCt↑).Surface information (Figure 5 in Appendix E) generally remains the same.

Conclusion
Our work pioneers the application of probing to the representation used at the document level, specifically in event extraction.We observed semantic and event-related information embedded in representations varied throughout IE training.While encoding improves on capabilities like event detection and argument labeling, training often compromises embedded coreference and event typing information.Comparisons of IE frameworks uncovered that current models marginally outperformed the baseline in capturing event information at best.Our analysis also suggested a potential shortcoming of encoders like BERT in utilizing cross-sentence discourse information effectively.In summary, our work provides the first insights into document-level representations, suggesting new research directions for optimizing these representations for event extraction tasks.

Limitations
Dataset While other document-level IE datasets are possible, none of them offer rich details like MUC.For example, document-level n_ary_relations datasets like SciREX (Jain et al., 2020) can only cover three out of the six semantic and event knowledge probing tasks, and the dataset has issues with missing data.Additionally, we focus on template-fillingcapable IE frameworks as they show more generality in applications (and is supported by more available models like GTT), barring classical relation extraction task dataset like the DocRED (Yao et al., 2019).
Scoping While we observe ways to improve document-level IE frameworks, creating new frameworks and testing them are beyond the scope of this probing work.
Embedding length and tokenizer All models we investigated use an encoder that has an input cap of 512 tokens, leaving many entities inaccessible.In addition, some models use tokenizers that tokenize words into fewer tokens and as a result, may access more content in full-text embedding probing tasks.Note that also because of tokenizer difference, despite our effort to make sure all probing tasks are fair, some models might not see up to 2.1% training data while others do.

A Definition of Template Filling
Assume a predefined set of event types, T 1 , ..., T m , where m represents the total number of template types.Every event template comprises a set of k roles, depicted as r 1 , ..., r k .For a document made up of n words, represented by x 1 , x 2 , ..., x n , the template filling task is to extract zero or more templates.The number of templates are not given as an input, and each template can represent a n-ary relation or an event.
Each extracted template contains k + 1 slots: the first slot is dedicated to the event type, which is one of the event types from T 1 , ..., T m .The subsequent k slots represent an event role, which will be one of the roles r 1 , ..., r k .The system's job is to assign zero or more entities (role-fillers) to the corresponding role in each slot.

B MUC dataset
The MUC 3 dataset (1991) comprises news articles and documents manually annotated for coreference resolution and for resolving ambiguous references in the text.The MUC 4 dataset(1992), on the other hand, expanded the scope to include named entity recognition and template-based information extraction.
We used a portion of the MUC 3 and 4 datasets for template filling and labeled the dataset with triggers based on event types for our probing tasks.The triggers were added to make the dataset compatible with TANL so that we could perform multi-template prediction.

C IE Framework Parameters
See Table 3

D Probing Model Details
See Table 6.

E Additional Results
See Figure 5 and Figure 6 for more probing results on MUC.See Table 8

Figure 1 :
Figure 1: Overview of an ideal event extraction example and probing.Ideally, after contextualization by a finetuned encoder on the IE task, the per-token embedding can capture richer semantic and related event information, thereby facilitating an easier model-specific extraction process.Our probing tasks test how different frameworks and conditions (e.g.IE training, coherence information access) affect information captured by the embeddings.

Figure 2 :
Figure 2: Probing Task Illustrations.Each • refers to a span embedding (which is an embedding of a token in our experiment), and non-gray • means embeddings are known to be a role filler.See Section 2 for full descriptions.

Figure 3 :
Figure 3: Probing accuracy on event (left) and semantic (right) information over document-level IE training epoch.5 random seed results averaged (with standard deviation error bars).Color-coded by probing tasks.Trained encoder gain and lose information in their generated embeddings as they are trained for the IE tasks.

Figure 4 :
Figure 4: Probing accuracy on event (upper) and semantic (lower) information over encoder layers from GTT trained over 18 epoch and BERT-base.

Figure 5 :
Figure 5: Probing accuracy on semantic surface information over encoder layers from GTT trained over 18 epochs and BERT-base.

Figure 6 :
Figure 6: Probing accuracy on surface information over document-level IE training epoch.

• Event Info? Number of interested events is: 1 … Ideal document embedding for template extraction
••

Table 1 :
Probing Task Test Average Accuracy.IE frameworks trained for 20 epochs on MUC, and we run probing tasks on the input representations.We compare the 5-trial averaged test accuracy on full-text embeddings and concatenation of sentence embeddings from the same encoder to the untrained BERT baseline.IE-F1 refers to the model's F1 score on MUC test.Underlined data are the best in same embedding method, while bold, overall.We further report data over more epochs in Table7, and results on WikiEvents in Table8in Appendix E.

Table 4 :
GTT Model Parameters

Table 5 :
TANL Model Training Parameters for more results on WikiEvents.WikiEvents is a smaller (246-example) dataset.

Table 6 :
Probing model parameters values in the parenthesis are tested by not used, often due to lower performances.