How Good Is the Model in Model-in-the-loop Event Coreference Resolution Annotation?

Annotating cross-document event coreference links is a time-consuming and cognitively demanding task that can compromise annotation quality and efficiency. To address this, we propose a model-in-the-loop annotation approach for event coreference resolution, where a machine learning model suggests likely corefering event pairs only. We evaluate the effectiveness of this approach by first simulating the annotation process and then, using a novel annotator-centric Recall-Annotation effort trade-off metric, we compare the results of various underlying models and datasets. We finally present a method for obtaining 97% recall while substantially reducing the workload required by a fully manual annotation process.


Introduction
Event Coreference Resolution (ECR) is the task of identifying mentions of the same event either within or across documents.Consider the following excerpts from three related documents: e 1 : 55 year old star will replace m 1 Matt Smith, who announced in June that he was leaving the sci-fi show.
e 2 : Matt Smith, 26, will make his debut in 2010, replacing m 2 David Tennant, who leaves at the end of this year.e 1 , e 2 , and e 3 are example sentences from three documents where the event mentions are highlighted and sub-scripted by their respective identifiers (m 1 through m 4 ).The task of ECR is to automatically form the two clusters {m 1 , m 3 , m 4 }, and {m 2 }.We refer to any pair between the mentions of a cluster, e.g., (m 1 , m 3 ) as an ECR link.Any pair formed across two clusters, e.g., (m 1 , m 2 ) is referred to as non-ECR link.
Annotating ECR links can be challenging due to the large volume of mention pairs that must be compared.The annotating task becomes increasingly time-consuming as the number of events in the corpus increases.As a result, this task requires a lot of mental effort from the annotator and can lead to poor quality annotations (Song et al., 2018;Wright-Bettner et al., 2019).Indeed, an annotator has to examine multiple documents simultaneously often relying on memory to identify all the links which can be an error-prone process.
To reduce the cognitive burden of annotating ECR links, annotation tools can provide integrated model-in-the-loop for sampling likely coreferent mention pairs (Pianta et al., 2008;Yimam et al., 2014;Klie et al., 2018).These systems typically store a knowledge base (KB) of annotated documents and then use this KB to suggest relevant candidates.The annotator can then inspect the candidates and choose a coreferent event if present.
The model's querying and ranking operations are typically driven by machine learning (ML) systems that are trained either actively (Pianta et al., 2008;Klie et al., 2018;Bornstein et al., 2020;Yuan et al., 2022) or by using batches of annotations (Yimam et al., 2014).While there have been advances in suggestion-based annotations, there is little to no work in evaluating the effectiveness of these systems, particularly in the use case of ECR.Specifically, both the overall coverage, or recall, of the annotation process as well as the degree of annotator effort needed depend on the performance of the model.In order to address this shortcoming, we offer the following contributions:

Annotator's Decisions
Target Mention (m 1 ) 55 year old star will replace Matt Smith, who announced in June that he was leaving the sci-fi show.

Candidate 2 (m 4 )
Peter Capaldi stepped into Matt Smith's soon to be vacant Doctor Who shoes.

Annotated Event Store
Candidate 1 (m 2 ) Matt Smith, 26, will make his debut in 2010, replacing David Tennant, who leaves at the end of this year.

… Skipped
Figure 1: For the target mention (m 1 ), the Annotated Event Cluster store presents three potential coreferent candidates (m 2 , m 4 and m * ).The ranking module (an ECR scorer) then ranks them based on their semantic similarity to m 1 .The annotator reviews each candidate one-at-a-time and makes decisions on coreference.m * is skipped after finding m 4 as coreferent.The cluster store is then updated based on these decisions.

Annotation Methodology
We implement an iterative model-in-the-loop methodology 3 for annotating ECR links in a corpus containing annotated event triggers.This approach has two main components -(1) the storage and retrieval of annotated event clusters, which are then compared with each new target event, and (2), an ML model that ranks and prunes the sampled candidate clusters by evaluating their semantic similarity to the target mention.
As illustrated in Figure 1, our annotation workflow queries the Annotated Event Store for the target event (m 1 ), retrieving three potential coreferring candidates (m 2 , m * , and m 4 ).The ranking module then evaluates these candidates based on their lexical and semantic similarities to m 1 .The annotator then compares each candidate to the target and determines if they are coreferent.Upon finding a coreferent candidate, the target is merged into the coreferent cluster, and any remaining option(s) (m * ) are skipped.

Ranking
We investigate three separate methods to drive the ranking of candidates distinguished by their computational cost.We use these methods to generate the average pair-wise coreference scores between mentions of the candidate and target events, then Cross-encoder (CDLM): In this method, we use the fine-tuned cross-encoder ECR system of Caciularu et al. (2021) to generate pairwise mention scores4 .Their state of the art system uses a modified Longformer (Beltagy et al., 2020) as the underlying LM to generate document-level representations of the mention pairs (detailed in §B.1).More specifically, we generate a unified representation (Eq. 1) of the mention pair (m i , m j ) by concatenating the pooled output of the transformer (E CLS ), the outputs of the individual event triggers (E m i , E m j ), and their element-wise product.Thereafter, pairwise-wise scores are generated for each mention-pair after passing the above representations through a Multi-Layer Perceptron (mlp) (Eq.2) that was trained using the gold-standard labels for supervision.
To calculate the BERTScore between the mentions, we first construct a combined sentence (S bert (m); Shi and Lin ( 2019)) for a mention (m) by concatenating the mention text (t m ) and its corresponding sentence (S m ), as depicted in Equation 3. Subsequently, we compute the BS for each mention pair using S bert (m) and t m separately, then extract the F1 from each.We then take the weighted average of the two scores as shown in Equation 4as our ranking metric.This process, carried out using the distilbert − base − uncased (Sanh et al., 2019) model, requires approximately seven seconds to complete on each test set.Lemma Similarity (Lemma): The lemma5 similarity method emulates the annotation process carried out by human annotators when determining coreference based on keyword comparisons between two mentions.To estimate this similarity, we compute the token overlap (Jaccard similarity; JS) between the triggers and sentences containing the respective mentions and take a weighted average of the two similarities (like Eq 4) as shown in Eq 56 .
No Ranking (Random): For our baseline approach, we employ a method that directly picks the candidate-mention pairs through random sampling and without ranking, providing a reference point for evaluating the effectiveness of the above three ranking techniques.

Pruning
To control the comparisons between candidate and target events, we restrict our selection to the topk ranked candidates.To refine our analysis, we employ non-integer k values, allowing for the inclusion of an additional candidate with a probability equal to the decimal part of k.We vary the values of k from 2 to 20 on increments of 0.5 and then investigate its relation to recall and effort in §4.

Simulation
To evaluate the ranking methods, we conduct annotation simulations on the events in the ECB+ and GVC development and test sets.These simulations follow the same annotation methodology of retrieving and ranking candidate events for each target but utilize ground-truth for clustering.By executing simulations on different ranking methods and analyzing their performance, we effectively isolate and assess each approach.

Evaluation Methodology
We evaluate the performance of the model-in-theloop annotation with the ranking methods through simulation on two aspects: (1) how well it finds the coreferent links, and (2) how much effort it would take to annotate the links using the ranking method.

Recall-Annotation Effort Tradeoff
Recall: The recall metric evaluates the percentage of ECR links that are correctly identified by the suggestion model.It is calculated as the ratio of the number of times the true coreferent candidate is among the suggested candidates.The recall error is introduced when the coreferent candidate is erroneously removed based on the top-k value7 .
Comparisons: A unit effort represents the comparison between a candidate and target mentions that an annotator would have to make in the annotation process.We count the sampled candidates for each target and stop counting when the coreferent candidate is found.For example, the number of comparisons for the target m 1 , in Figure 1, is 2 (m 2 and m 4 ).We count this number for each target event and present the sum as Comparisons.

Analysis and Discussion
We present an analysis of the various ranking methods employed in our study, highlighting the performance and viability of each approach.We employ the ranking methods on the test sets of ECB+ and GVC.Then, estimate the Recall and Comparisons measures for different k values, and collate them into the plots as shown in Figure 2. Performance Comparison: The performance improvement of CDLM over BERT and BERT over Lemma can be quantified by examining the graph for the ECB+ and GVC datasets.For example, when targeting a 95% recall for the ECB+ corpus, CDLM provides an almost 100 percent improvement over BERT reducing the number of comparisons to almost half of the latter.However, both CDLM and BERT outperform Lemma by a significant margin while being drastically better than the Random baseline (See Fig. 2).Interestingly, for GVC, the performance gap between CDLM and BERT is quite close, both needing at least three-fourths as many comparisons as the Lemma and crucially outperforming the Random baseline.CDLM's inconsistent performance on GVC suggests that a corpus-fine-tuned model such as itself is more effective when applied to a dataset similar to the one it was trained on.Efficiency and Generalizability of BERT: BERT offers a compelling advantage in terms of efficiency, as it can be run on low-compute settings.Moreover, BERT exhibits greater generalizability out-of-the-box when comparing its performance on both the ECB+ and GVC datasets.This makes it an attractive option for ECR annotation task especially when compute resources are limited or when working with diverse corpora.

Conclusion
We introduced a model-in-the-loop annotation method for annotating ECR links.We compared three ranking models through a novel evaluation methodology that answers key questions regarding the quality of the model in the annotation loop (namely, recall and effort).Overall, our analysis demonstrates the viability of the models, with CDLM exhibiting the best performance on the ECB+ dataset, followed by BERT and Lemma.The choice of ranking method depends on the specific use case, dataset, and resource constraints, but all three methods offer valuable solutions for different scenarios.

Limitations
It is important to note that the approaches presented in this paper have several constraints.Firstly, the methods presented are restricted to English language only, as Lemma necessitates a lemmatizer and, BERT and CDLM rely on models trained exclusively on English corpora.Secondly, the utilization of the CDLM model demands at least a single GPU, posing potential accessibility issues.Thirdly, ECR annotation is susceptible to errors and severe disagreements amongst annotators, which could entail multiple iterations before achieving a goldstandard quality.Lastly, the generated corpora may be biased to the model used during the annotation process, particularly for smaller values of k.

Ethics Statement
We use publicly-available datasets, meaning any bias or offensive content in those datasets risks being reflected in our results.By its nature, the Gun Violence Corpus contains violent content that may be troubling for some.to encode much longer documents at finetuning that are usually seen in coreference corpora like the ECB+.As seen in Fig. 3, apart from the documentseparator tokens like <doc-s> and <doc-s/> that help contextualize each document in a pair, it adds two special tokens (<m> and </m>) to the model vocabulary while pretraining to achieve a greater level of contextualization of a document pair while attending to the event triggers globally at finetuning.Apart from the event-trigger words, the finetuned CDLM model also applies the global attention mechanism on the [CLS] token resulting in a more refined embedding for that document pair while maintaining linearity in the transformer's selfattention.

B.2 BERTScore
BERT-Score is an easy-to-use, low-compute scoring metric that can be used to evaluate NLP tasks that require semantic-similarity matching.This task-agnostic metric uses a base language model like BERT to generate token embeddings and leverages the entire sub-word tokenized reference and candidate sentences (x and x in Fig. 4) to calculate the pairwise cosine similarity between the sentence pair.It uses a combination of a greedy-matching subroutine to maximize the similarity scores while l E P M X S H H t A T e r b u r U f r v / X y 1 t q w 6 p k V 9 E G N x i s x M K q 0 < / l a t e x i t >     normalizing the generated scores based on the IDF (Inverse Document Frequency) of the sub-tokens thereby resulting in more human-readable scores.The latter weighting parameter takes care of rareword occurrences in sentence pairs that are usually more indicative of how semantically similar such pairs are.In our experiments, we use the distilbert − base − uncased model to get the pairwise coreference scores, consistent with our goal of deploying an annotation workflow suitable for resource-constrained settings.Such lighter and 'distilled' encoders allow us to optimize resources at inference with minimal loss in performance.

C λ Hyper-parameter Tuning
We employ the evaluation methodology detailed in §4 to determine the optimal value of λ (the weight for trigger similarity and sentence similarity) for both BERT and Lemma approaches.By conducting incremental annotation simulations on the development sets of ECB+ and GVC, we assess λ values ranging from 0 to 1.The recall-effort curve is plotted for each λ value, as shown in Figure 5, allowing us to identify the one that consistently achieves the highest recall with the fewest comparisons.Remarkably, the optimal value for both methods is found to be 0.7, and this value remains consistent across the two datasets and approaches.

D Annotation Interface using Prodigy
Figure 6 illustrates the interface design of the annotation methodology on the popular modelin-the-loop annotation tool -Prodigy (prodi.gy).We use this tool for the simplicity it offers in plugging in the various ranking methods we explained.The recipe for plugging it in to the tool along with other experiment code: github.com/ahmeshaf/model_in_coref.

e 3 :
Peter Capaldi takes over m 3 Doctor Who . . .Peter Capaldi stepped into m 4 Matt Smith's soon to be vacant Doctor Who shoes.

Figure 2 :
Figure 2: Recall and Comparisons achieved upon varying the k for each ranking method in the ECR annotation simulation.The three methods result in significantly fewer comparisons than the no-ranking Random baseline.

R
BERT = (0.713⇥1.27)+(0.515⇥7.94)+... 1.27+7.94+1.82+7.90+8.88< l a t e x i t s h a 1 _ b a s e 6 4 = " O J y o K l m B A g U A 0 K D t U c s H / d i 5 B l I = " > A A A C S H i c b Z D L a t t A F I a P n R 6 M j 2 I p f + M P D z n X M 4 Z / 6 4 U N K g 7 1 8 7 r a X l R 4 9 X V t f a T 5 4 + e / 6 i s / 7 y 0 O S l 5 m L I c 5 X r 4 5g Z o W Q m h i h R i e N C C 5 b G S h z F Z x + a + t G 5 0 E b m 2 Q F O C z F O 2 U k m E 8 k Z W h R 1 o v 2 o C l F c Y P X + 4 / 5 B X Z N 3 J E w 0 4 9 W m 7 / X p d o g y F Y Z Q r 9 f f c i 3 a o T s L 1 P d 2 3 2 6 5 o Z r k a O q q a X A b 5 F I v 6 D X O d w M v C O q o 0 / U 9 f y b y 0 N C F 6 Q 5 2 / 1 5 + B Y C 9 q H M V T n J e p i J D r p g x I + o X O K 6 Y R s m V q N t h a U T B + B k 7 E S N r M 2 a P G V e z I G r y x p I J S X J t X 4 Z k R m 9 P V C w 1 Z p r G t j N l e G r u 1 x r 4 v 9 q o x C Q Y V z I r S h Q Z n y 9 K S k U w J 0 2 q Z C K 1 4 K i m 1 j C u p b 2 V 8 F N m c 0 S b f d u G Q O 9 / + a E 5 7 H n U 9 + g X 2 h 1 8 h r l W 4 T V s w C Z Q 6 MM A P s E e D I H D N 7 i B n / D L + e 7 8 c H 4 7 f + a t L W c x 8 w r u q N X 6 B 8 d U r V w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " R I n T c Z k W i V B n f / n c B s t C v a t C t G 4 = " > A A A C S H i c b Z D P S h x B E M Z 7 N p r o x u g a j 1 4 a l 4 A y M E y v y o y H w G I Q P I m K q 8 L O M v T 0 9 m h j z x + 6 a 0 K W Y V 4 i L 5 E n y S X H 3 H w G L x 4 U 8 S D Y s 7 s H o / m g 4 e N X V V T 1 F + V S a H D d a 6 v x b m b 2 / Y e 5 + e b H h U + L S 6 3 l z 6 c 6 K x T j P Z b J T J 1 H V H M p U t 4 D A Z K f 5 4 r T J J L 8 L L r 6 V t f P v n O l R Z a e w C j n g 4 R e p C I W j I J B Y S s 8 D s s A + A 8 o d / e O T 6 o K f 8 V B r C g r 1 1 3 H I 5 s X S I e o i h 3 + g G 3 a F 7 6 4 9 1 a z 1 Y j 5 P W h j W d W U H / q N F 4 B k P Y r b k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " R I n T c Z k W i V B n f / n c B s t C v a t C t G 4 = " > A A A C S H i c b Z D P S h x B E M Z 7 N p r o x u g a j 1 4 a l 4 A y M E y v y o y H w G I Q P I m K q 8 L O M v T 0 9 m h j z x + 6 a 0 K W Y V 4 i L 5 E n y S X H 3 H w G L x 4 U 8 S D Y s 7 s H o / m g 4 e N X V V T 1 F + V S a H D d a 6 v x b m b 2 / Y e 5 + e b H h U + L S 6 3 l z 6 c 6 K x T j P Z b J T J 1 H V H M p U t 4 D A Z K f 5 4 r T J J L 8 L L r 6 V t f P v n O l R Z a e w C j n g 4 R e p C I W j I J B Y S s 8 D s s A + A 8 o d / e O T 6 o K f 8 V B r C g r 1 1 3 H I 5 s t e x i t s h a 1 _ b a s e 6 4 = " f 2 y z i m w b R / D g j z p 6 t Z 3 6 0 7 5 e / e o O Y p R F K w w T V u u u 5 i f E z q g x n A q e l X q o x o W x M h 9 i 1 V N I I t Z / N D 5 2 S M 6 s M S B g r W 9 K Q u f p 7 I q O R 1 p M o s J 0 R N S O 9 7 M 3 E / 7 x u a s J r P + M y S Q 1K t l g U p o K Y m M y + J g O u k B k x s Y Q y x e 2 t h I 2 o o s z Y b E o 2 B G / 5 5 V X S u q h 6 b t V r X F Z q N 3 k c R T i B U z g H D 6 6 g B n d Q h y Y w Q H i G V 3 h z H p w X 5 9 3 5 W L Q W n H z m G P 7 A + f w B 5 j m M / A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f 2 y z i m w b R / D g j z p 6 t Z 3 6 0 f H R q N I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c W 7 A e 0 o W y 2 k 3 b t Z h N 2 N 2 I J / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 H 1 F p H s t 7 M 0 n Q j + h Q 8 p A z a q z U e O q X K 2 7 V n Y O s E i 8 n F c h R 7 5 e / e o O Y p R F K w w T V u u u 5 i f E z q g x n A q e l X q o x o W x M h 9 i 1 V N I I t Z / N D 5 2 S M 6 s M S B g r W 9 K Q u f p 7 I q O R 1 p M o s J 0 R N S O 9 7 M 3 E / 7 x u a s J r P + M y S Q 1 K t l g U p o K Y m M y + J g O u k B k x s Y Q y x e 2 t h I 2 o o s z Y b E o 2 B G / 5 5 V X S u q h 6 b t V r X F Z q N 3 k c R T i B U z g H D 6 6 g B n d Q h y Y w Q H i G V 3 h z H p w X 5 9 3 5 W L Q W n H z m G P 7 A + f w B 5 j m M / A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f 2 y z i m w b R / D g j z p 6 t Z 3 6 0 f H R q N I = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c W 7 A e 0 o W y 2 k 3 b t Z h N 2 N 2 I J / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 H 1 F p H s t 7 M 0 n Q j + h Q 8 p A z a q z U e O q X K 2 7 V n Y O s E i 8 n F c h R 7 5 e / e o O Y p R F K w w T V u u u 5 i f E z q g x n A q e l X q o x o W x M h 9 i 1 V N I I t Z / N D 5 2 S M 6 s M S B g r W 9 K Q u f p 7 I q O R 1 p M o s J 0 R N S O 9 7 M 3 E / 7 x u a s J r P + M y S Q 1 K t l g U p o K Y m M y + J g O u k B k x s Y Q y x e 2 t h I 2 o o s z Y b E o 2 B G / 5 5 V X S u q h 6 b t V r X F Z q N 3 k c R T i B U z g H D 6 6 g B n d Q h y Y w Q H i G V 3 h zH p w X 5 9 3 5 W L Q W n H z m G P 7 A + f w B 5 j m M / A = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 Q T n V R V S r n y z z n V U 7 d 5 b F 5 u 0 3 I w = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K 9 g P a U D b b T b t 0 s w m 7 E 7 G E / g g v H h T x 6 u / x 5 r 9 x 0 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H J t R K w e c J p w P 6 I j J U L B K F q p 0 x 9 T z J 5 m g 2 r N r b t z k F X i F a Q G B Z q D 6 l d / G L M 0 4 g q Z p M b 0 P D d B P 6 M a B Z N 8 V u m n h i e U T e i I 9 y x V N O L G z + b n z s r 6 o e 2 7 d u 7 + s N W 6 K O M p w A q d w D h 5 c Q Q P u o A k t Y D C B Z 3 i F N y d x X p x 3 5 2 P R W n K K m W P 4 A + f z B 7 A 8 j 8 k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 Q T n V R V S r n y z z n V U 7 d 5 b F 5 u 0 3 I w = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K 9 g P a U D b b T b t 0 s w m 7 E 7 G E / g g v H h T x 6 u / x 5 r 9 x 0 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H J t R K w e c J p w P 6 I j J U L B K F q p 0 x 9 T z J 5 m g 2 r N r b t z k F X i F a Q G B Z q D 6 l d / G L M 0 4 g q Z p M b 0 P D d B P 6 M a B Z N 8 V u m n h i e U T e i I 9 y x V N O L G z + b n z s r 6 o e 2 7 d u 7 + s N W 6 K O M p w A q d w D h 5 c Q Q P u o A k t Y D C B Z 3 i F N y d x X p x 3 5 2 P R W n K K m W P 4 A + f z B 7 A 8 j 8 k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 Q T n V R V S r n y z z n V U 7 d 5 b F 5 u 0 3 I w = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K 9 g P a U D b b T b t 0 s w m 7 E 7 G E / g g v H h T x 6 u / x 5 r 9 x 0 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H J t R K w e c J p w P 6 I j J U L B K F q p 0 x 9 T z J 5 m g 2 r N r b t z k F X i F a Q G B Z q D 6 l d / G L M 0 4 g q Z p M b 0 P D d B P 6 M a B Z N 8 V u m n h i e U T e i I 9 y x V N O L G z + b n z s

Figure 5 :
Figure 5: Trigger and Sentence Similarity weight (λ) Hyper-parameter tuning on the development sets of ECB+ and GVC.We deduce λ = 0.7 is optimal for both methods for both datasets.

Figure 6 :
Figure 6: The model-in-the-loop ECR annotation using the Prodigy Annotation Tool.The target event is on the left and the Candidate cluster is on the right.
Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05, page 25-32, USA.Association for Computational Linguistics.Piek Vossen, Filip Ilievski, Marten Postma, and Roxane Segers.2018.Don't annotate, but validate: A data-to-text method for capturing event data.In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).