Verb Knowledge Injection for Multilingual Event Processing

Linguistic probing of pretrained Transformer-based language models (LMs) revealed that they encode a range of syntactic and semantic properties of a language. However, they are still prone to fall back on superficial cues and simple heuristics to solve downstream tasks, rather than leverage deeper linguistic information. In this paper, we target a specific facet of linguistic knowledge, the interplay between verb meaning and argument structure. We investigate whether injecting explicit information on verbs’ semantic-syntactic behaviour improves the performance of pretrained LMs in event extraction tasks, where accurate verb processing is paramount. Concretely, we impart the verb knowledge from curated lexical resources into dedicated adapter modules (verb adapters), allowing it to complement, in downstream tasks, the language knowledge obtained during LM-pretraining. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages: we investigate 1) zero-shot language transfer with multilingual Transformers and 2) transfer via (noisy automatic) translation of English verb-based lexical knowledge. Our results show that the benefits of verb knowledge injection indeed extend to other languages, even when relying on noisily translated lexical knowledge.


Introduction
Large Transformer-based encoders, pretrained with self-supervised language modeling (LM) objectives, form the backbone of state-of-the-art models for most NLP tasks (Devlin et al., 2019;Yang et al., 2019b;. Recent probes showed that they implicitly extract a non-negligible amount of linguistic knowledge from text corpora in an unsupervised fashion (Hewitt and Manning, 2019; Vulić et al., 2020;Rogers et al., 2020, inter alia).
In downstream tasks, however, they often rely on spurious correlations and superficial cues (Niven and Kao, 2019) rather than a deep understanding of language meaning (Bender and Koller, 2020), which is detrimental to both generalisation and interpretability (McCoy et al., 2019).
In this work, we focus on a specific facet of linguistic knowledge: reasoning about events. 1 Identifying tokens in the text that mention events and classifying the temporal and causal relations among them is crucial to understand the structure of a story or dialogue (Carlson et al., 2002;Miltsakaki et al., 2004) and to ground a text in real-world facts.
Verbs (with their arguments) are prominently used for expressing events (with their participants). Thus, fine-grained knowledge about verbs, e.g., the syntactic patterns in which they partake and the semantic frames, may help pretrained encoders to achieve a deeper understanding of text and improve their performance in event-oriented downstream tasks. There already exist some expert-curated computational resources that organise verbs into classes based on their syntactic-semantic properties (Jackendoff, 1992;Levin, 1993). In particular, here we consider English VerbNet and FrameNet as rich sources of verb knowledge.
Expanding a line of research on injecting external linguistic knowledge into pretrained LMs (Peters et al., 2019;Levine et al., 2020;Lauscher et al., 2020b), we integrate verb knowledge into the LMs for the first time. We devise a new method to distil verb knowledge into dedicated adapter modules (Pfeiffer et al., 2020b), which reduce the risk of (catastrophic) forgetting of and allow seamless modular integration with distributional knowledge.
We hypothesise that complementing pretrained LMs with verb knowledge should benefit model performance in downstream tasks that involve event extraction and processing. We first put this hypothesis to the test in English monolingual event identification and classification tasks from the TempEval (UzZaman et al., 2013) and ACE (Doddington et al., 2004) datasets. We report modest but consistent improvements in the former, and significant performance boosts in the latter, thus verifying that verb knowledge is indeed paramount for a deeper understanding of events and their structure.
Moreover, expert-curated resources are not available for most of the languages spoken worldwide. Therefore, we also investigate the effectiveness of transferring verb knowledge across languages; in particular, from English to Spanish, Arabic and Chinese. The results demonstrate the success of the transfer techniques, and also shed some light on an important linguistic question: to what extent can verb classes (and predicate-argument structures) be considered cross-lingually universal, rather than varying across languages (Hartmann et al., 2013)?
Overall, our main contributions consist in 1) mitigating the limitations of pretrained encoders regarding event understanding by supplying external verb knowledge; 2) proposing a new method to do so in a modular way through verb adapters; 3) exploring techniques to transfer verb knowledge to resource-poor languages. The performance gains across four diverse languages and several event processing tasks and datasets validate that complementing distributional knowledge with curated verb knowledge is both beneficial and cost-effective.
2 Verb Knowledge for Event Processing Figure 1 illustrates our framework for injecting verb knowledge from VerbNet or FrameNet and leveraging it in downstream event processing tasks. First, we inject the external verb knowledge, formulated as the so-called lexical constraints Ponti et al., 2019) (in our case -verb pairs, see §2.1), into a (small) additional set of adapter parameters ( §2.2) (Houlsby et al., 2019). Second ( §2.3), we combine the language knowledge encoded in the original LM parameters and the verb knowledge from verb adapters for event processing tasks. To this end, we either a) fine-tune both sets of parameters (1. pretrained LM; 2. verb adapters) or b) freeze both sets of parameters and insert an additional set of task-specific adapter pa-  2) Finetuning for an event processing task: a) full fine-tuning -LM's original parameters and verb adapters both finetuned for the task; b) task adapter (TA) fine-tuningadditional task adapter is mounted on top of the verb adapter and tuned for the task. For simplicity, we show only a single Transformer layer. Snowflakes denote frozen parameters in the respective training step.
rameters. In both cases, the task-specific training is informed both by the general language knowledge captured in the pretrained LM, and the specialised verb knowledge, captured in the verb adapters.

Sources of Verb Lexical Knowledge
Given the inter-connectedness between verbs' meaning and syntactic behaviour (Levin, 1993;Kipper Schuler, 2005), we assume that refining latent representation spaces with verb knowledge would have a positive effect on event extraction tasks that strongly revolve around verbs. Lexical classes, defined in terms of verbs' shared semanticsyntactic properties, provide a mapping between the verbs' senses and the morpho-syntactic realisation of their arguments (Jackendoff, 1992;Levin, 1993). The potential of verb classifications lies in their predictive power: for any given verb, a set of rich semantic-syntactic properties can be inferred based on its class membership. In this work, we explicitly harness this rich linguistic knowledge to aid pretrained LMs in capturing regularities in the properties of verbs and their arguments. We select two major English lexical databases -VerbNet (Kipper Schuler, 2005) and FrameNet (Baker et al., 1998) -as sources of verb knowledge at the semantic-syntactic interface, each representing a different lexical framework.
VerbNet (VN) (Kipper Schuler, 2005;Kipper et al., 2006), the largest available verb-focused lexicon, organises verbs into classes based on the overlap in their semantic properties and syntactic behaviour; it builds on the premise that a verb's predicateargument structure informs its meaning (Levin, 1993). Each entry provides a set of thematic roles and selectional preferences for the verbs' arguments; it also lists the syntactic contexts characteristic for the class members. Its hierarchical classification starts from broader classes and spans several granularity levels where each subclass further refines the semantic-syntactic properties inherited from its parent class. 2 The VN class membership is English-specific, but the underlying verb class construction principles are thought to apply cross-lingually (Jackendoff, 1992;Levin, 1993); its translatability has been indicated in previous work Majewska et al., 2018). The current English VN contains 329 main classes.
FrameNet (FN) (Baker et al., 1998) is more semantically oriented than VN. Grounded in the theory of frame semantics (Fillmore, 1976(Fillmore, , 1977(Fillmore, , 1982, it organises concepts according to semantic frames, i.e., schematic representations of situations and events, which they evoke, each characterised by a set of typical roles assumed by its participants. The word senses associated with each frame (FN's lexical units) are similar in terms of their semantic content, as well as their typical argument structures. Currently, English FN covers 1,224 frames and its annotations illustrate the typical syntactic realisations of the frame elements. Frames themselves are, however, semantically defined: this means that they may be shared even across languages with different syntactic properties. 3

Training Verb Adapters
Training Task and Data Generation. In order to inject external verb knowledge into pretrained LMs, we devise an intermediary training task: we train 2 For example, within a top-level class 'free-80', which includes verbs like liberate, discharge, and exonerate which participate in a NP V NP PP.THEME frame (e.g., It freed him of guilt), there exists a subset of verbs participating in a syntactic frame NP V NP S_ING ('free-80-1'), within which there exists an even more constrained subset of verbs appearing with prepositional phrases headed specifically by the preposition from (e.g., The scientist purified the water from bacteria).
3 For instance, descriptions of transactions will include the same frame elements Buyer, Seller, Goods, Money in most languages. Indeed, English FN has inspired similar projects in other languages: e.g., Spanish (Subirats and Sato, 2004), Japanese (Ohara, 2012), and Danish (Bick, 2011). a dedicated VN-/FN-knowledge adapter (hereafter VN-Adapter and FN-Adapter). We frame the task as binary word-pair classification: we predict if two verbs belong to the same VN class or FN frame. We extract training instances from FN and VN independently. This allows for a separate analysis of the impact of verb knowledge from each resource.
We generate positive training instances by extracting all unique verb pairings from the set of members of each main VN class/FN frame (e.g., walk-march), resulting in 181,882 instances created from VN and 57,335 from FN. We then generate k = 3 negative examples per positive example by combining controlled and random sampling. In controlled sampling, we follow prior work on semantic specialisation (Wieting et al., 2015;Glavaš and Vulić, 2018b;Lauscher et al., 2020b). For each positive example p = (w 1 , w 2 ) in the training batch B, we create two negativesp 1 = (ŵ 1 , w 2 ) andp 2 = (w 1 ,ŵ 2 );ŵ 1 is the verb from batch B other than w 1 that is closest to w 2 in terms of their cosine similarity in an auxiliary static word embedding space X aux ∈ R d ; conversely,ŵ 2 is the verb from B other than w 2 closest to w 1 . We additionally create one negative instancep 3 = (ŵ 1 ,ŵ 2 ) by randomly samplingŵ 1 andŵ 2 from batch B, not considering w 1 and w 2 . We ensure that the negatives are not present in the global set of all positive verb pairs. Similar to Lauscher et al. (2020b), we tokenise each (positive and negative) training instance into WordPiece tokens, prepended with sequence start token [CLS], and with [SEP] tokens in between the verbs and at the end of the input sequence. We use the representation of the [CLS] token x CLS ∈ R h (with h as the hidden state size of the Transformer) from the last Transformer layer as the latent representation of the verb pair, and feed it to a simple binary classifier: 4ŷ = softmax(x CLS W cl +b cl ), with W cl ∈ R h×2 and b cl ∈ R 2 as classifier's trainable parameters. We train by minimising the standard cross-entropy loss (L VERB in Figure 1).
Adapter Architecture. Instead of directly finetuning all parameters of the pretrained Transformer, we opt for storing verb knowledge in a separate set of adapter parameters, keeping the verb knowledge separate from the general language knowledge acquired in pretraining. This (1) allows downstream training to flexibly combine the two sources of knowledge, and (2) bypasses the issues with catastrophic forgetting and interference (Hashimoto et al., 2017;de Masson d'Autume et al., 2019).
We adopt the standard efficient adapter architecture of Pfeiffer et al. (2020a,c). In each Transformer layer l, we insert a single adapter (Adapter l ) after the feed-forward sub-layer. The adapter itself is a two-layer feed-forward neural network with a residual connection, consisting of a down-projection D ∈ R h×m , a GeLU activation (Hendrycks and Gimpel, 2016), and an upprojection U ∈ R m×h , where h is the hidden size of the Transformer model and m is the dimensionality of the adapter: Adapter l (h l , r l ) = U l (GeLU(D l (h l ))) + r l ; where r l is the residual connection, output of the Transformer's feedforward layer, and h l is the Transformer hidden state, output of the subsequent layer normalisation.

Downstream Fine-Tuning for Event Tasks
The next step is downstream fine-tuning for event processing tasks. We experiment with (1) tokenlevel event trigger identification and classification and (2) span extraction for event triggers and arguments (a sequence labeling task); see §3. For the former, we mount a classification head -a simple single-layer feed-forward softmax regression classifier -on top of the Transformer augmented with VN-/FN-Adapters. For the latter, we follow the architecture from prior work (M'hamdi et al., 2019;Wang et al., 2019) and add a CRF layer (Lafferty et al., 2001) on top of the sequence of Transformer's outputs (for subword tokens).
For all tasks, we propose and evaluate two different fine-tuning regimes: (1) full fine-tuning, where we update both the original Transformer's parameters and VN-/FN-Adapters (see 2a in Figure 1); and (2) task-adapter (TA) fine-tuning, where we keep both Transformer's original parameters and VN-/FN-Adapters frozen, while stacking a new trainable task adapter on top of the VN-/FN-Adapter in each Transformer layer (see 2b in Figure 1).

Cross-Lingual Transfer
Creation of curated resources like VN or FN takes years of expert linguistic labour. Consequently, such resources do not exist for a vast majority of languages. Given the inherent cross-lingual nature of verb classes and semantic frames (see  .1), we investigate the potential for verb knowledge transfer from English to target languages, without any manual target-language adjustments. Massively multilingual LMs, such as multilingual BERT (mBERT) (Devlin et al., 2019) or XLM-R (Conneau et al., 2020) have become the de facto standard mechanisms for zero-shot (ZS) crosslingual transfer. In our first transfer approach: we fine-tune mBERT first on the English verb knowledge, then on English task data, and then simply make task predictions for the target language input. The second approach, dubbed VTRANS, is inspired by the work on cross-lingual transfer of semantic specialisation for static word embeddings (Glavaś et al., 2019;Ponti et al., 2019;Wang et al., 2020b). In brief (with full details in Appendix C), starting from a set of positive pairs from English VN/FN, VTRANS involves three steps: (1) automatic translation of verbs in each pair into the target language, (2) filtering of the noisy target language pairs by means of a transferred relation prediction model trained on the English examples, and (3) training the verb adapters injected into the pretrained model, now with the translated and filtered target-language verb pairs. For the monolingual target-language FN-/VN-Adapter training, we follow the protocol used for English, see §2.2.

Experimental Setup
Event Processing Tasks and Data. In event processing tasks, systems are tasked with detecting that something happened, identifying what type of occurrence took place, as well as what entities were involved. Verbs typically act as the organisational core of each such event schema, carrying a lot of semantic and structural weight. Therefore, a model's grasp of verbs' properties should have a bearing on final task performance. Based on this assumption, we select event extraction and classification as suitable tasks to profile the methods from §2.
These tasks and the corresponding data are based on the two prominent frameworks for annotating event expressions: TimeML (Pustejovsky et al., 2003(Pustejovsky et al., , 2005 and the Automatic Content Extraction (ACE) (Doddington et al., 2004). First, we rely on the TimeML-annotated corpus from TempEval tasks (Verhagen et al., 2010;UzZaman et al., 2013), which targets automatic identification of temporal expressions and relations, and events. Second, we use the ACE dataset: it provides annotations for entities, the relations between them, and for events in which they participate in newswire text. 5 Task 1: Trigger Identification and Classification (TempEval). We frame the first event processing task as a token-level classification problem, predicting whether a token triggers an event and assigning it to one of the following event types: OC-CURRENCE (e.g., died, attacks), STATE (e.g., share, assigned), REPORTING (e.g., announced, said), I-ACTION (e.g., agreed, trying), I-STATE (e.g., understands, wants, consider), ASPECTUAL (e.g., ending, began), and PERCEPTION (e.g., watched, spotted). 6 We use the TempEval-3 data for English and Spanish (UzZaman et al., 2013), and the TempEval-2 data for Chinese (Verhagen et al., 2010) (see Table  6 in the appendix for exact dataset sizes).
Task 2: Trigger and Argument Identification and Classification (ACE). In this sequence labeling task, we detect and label event triggers and their arguments, with four individually scored subtasks: (i) trigger identification, where we identify the key word conveying the nature of the event, and (ii) trigger classification, where we classify the trigger word into one of the predefined categories; (iii) argument identification, where we predict whether an entity mention is an argument of the event identified in (i), and (iv) argument classification, where the correct role needs to be assigned to the identified event arguments. We use the ACE data available for English, Chinese, and Arabic. 7 Event extraction as specified in these two frameworks is a challenging, highly context-sensitive problem, where different words (most often verbs) may trigger the same type of event, and conversely, the same word (verb) can evoke differ- 5 We provide more details about the frameworks and their corresponding annotation schemes in Appendix A. 6 E.g., in the sentence: "The rules can also affect small businesses, which sometimes pay premiums tied to employees' health status and claims history.", affect and pay are event triggers of type STATE and OCCURRENCE, respectively. 7 The ACE annotations distinguish 34 trigger types (e.g., Business:Merge-Org, Justice:Trial-Hearing, Conflict:Attack) and 35 argument roles. Following previous work (Hsi et al., 2016), we conflate eight time-related argument roles -e.g., 'Time-At-End', 'Time-Before', 'Time-At-Beginning' -into a single 'Time' role in order to alleviate training data sparsity. ent types of event schemata depending on the context. Adopting these tasks for evaluation thus tests whether leveraging fine-grained curated knowledge of verbs' semantic-syntactic behaviour can improve pretrained LMs' reasoning about event-triggering predicates and their arguments.
Model Configurations. For each task, we compare the performance of the underlying "vanilla" BERT-based model (see §2.3) against its variant with an added VN-Adapter or FN-Adapter 8 (see §2.2) in two regimes: (a) full fine-tuning, and (b) task adapter (TA) fine-tuning (see Figure 1). To ensure that any performance gains are not merely due to increased parameter capacity offered by the adapters, we also evaluate a variant where we replace the verb adapter with a randomly initialised adapter of the same size (+Random). Additionally, we examine the impact of increasing the capacity of the trainable task adapter by replacing it with a 'Double Task Adapter' (2TA), i.e., a task adapter with double the number of trainable parameters compared to the base architecture from §2.2. Finally, we compare the VN/FN-Adapter approach with a computationally more expensive alternative method of injecting external verb knowledge, sequential fine-tuning, where the full BERT is first fine-tuned on the FN/VN data (as in 2.2) and then on the task (see Appendix D for details). Downstream Task Fine-Tuning. In downstream fine-tuning on TempEval, we train for 10 epochs in batches of size 32, with a learning rate 1e − 4 and maximum input sequence length of T = 128 Word-Piece tokens. For ACE, in light of a greater data sparsity, 9 we search for optimal hyperparameters for each language and evaluation setup from the following grid: learning rate l ∈ {1e − 5, 1e − 6}, epochs n ∈ {3, 5, 10, 25, 50}, batch b ∈ {8, 16} (maximum input sequence length T = 128).
Transfer Experiments in zero-shot (ZS) setups are based on mBERT, to which we add the VN-or FN-Adapter trained on the English VN/FN data. We train the model on English training data available for each task, and evaluate it on the target-language test set. For the VTRANS approach ( §2.4), we use language-specific BERT models available for our target languages, and leverage target-language adapters trained on translated and automatically refined verb pairs. The model, with or without the target-language VN-/FN-Adapter, is trained and evaluated on the training and test data available in the language. We carry out the procedure for three target languages (see Table 1). We use the same negative sampling parameter configuration proven strongest in our English experiments (k = 3 [ccr]).

Results and Discussion
English Event Processing. Table 2 shows the performance on English Task 1 (TempEval) and Task 2 (ACE). First, we note that the computationally more efficient setup with a dedicated task adapter (TA) yields higher absolute scores compared to full fine-tuning (FFT) on TempEval. When the underlying BERT is frozen along with the added FN-/VN-Adapter, the TA is enforced to encode additional task-specific knowledge into its parameters, beyond what is provided in the verb adapter. This yields two strongest results overall from the +FN/VN setups. On ACE, the primacy of TA-based training is overturned in favour of FFT. Encouragingly, boosts provided by verb adapters are visible regardless of the chosen task fine-tuning regime. We notice consistent statistically significant 10 improvements in the +VN setup, although the performance of the TA-based setups clearly suffers in argument (ARG) tasks due to decreased trainable parameter capacity. Lack of visible improvements from the Random Adapter supports the interpretation that performance gains indeed stem from the added useful 'non-random' signal in the verb adapters. In addition, we verify how our principal setup with added adapter modules compares to an alternative established approach, sequential finetuning (+FN/VN seq ). In TempEval, we note that fine-tuning all model parameters on VN/FN data allows retrieving more additional verb knowledge beneficial for task performance than adding smaller pre-trained adapters on top of the underlying model. However, FN/VN seq scores are still inferior to the results achieved in the TA-based +FN/VN setup. In ACE, the FN/VN seq results in trigger tasks are weaker than those achieved through the addition of self-contained knowledge adapters, however, they offer additional boosts in argument tasks.
Multilingual Event Processing. Table 3 compares the performance of zero-shot (ZS) transfer and monolingual target training (via VTRANS) on TempEval in Spanish and Chinese. For both, the addition of the FN-Adapter in the TA-based setup boosts ZS transfer. The benefits extend to the FFT setup in Chinese, achieving the top score overall.
In monolingual evaluation, we observe consistent gains from the added transferred knowledge via VTRANS in Spanish. In Chinese performance boosts come from the transferred VN-style class membership information (+VN). This suggests that even the noisily translated verb pairs carry enough useful signal through to the target language. To tease apart the contribution of the language-specific encoders and transferred verb knowledge, we carry out an additional monolingual evaluation substituting the target-language BERT with mBERT, trained on (noisy) target language verb signal (ES-MBERT/ZH-MBERT). Although mBERT scores are lower than monolingual BERTs in absolute terms, the use of the transferred verb knowledge helps reduce the gap between the models, with gains achieved over the baselines in Spanish. 11 In ACE, the top scores are achieved in the monolingual FFT setting; as with English, keeping the full capacity of BERT parameters unfrozen noticeably helps performance. 12 In Arabic, FN knowledge provides performance boosts across the four tasks and with both the zero-shot (ZS) and monolingual (VTRANS) transfer approaches, whereas the addition of the VN adapter boosts scores in ARG tasks. The usefulness of FN knowledge extends to zero-shot transfer in Chinese, and both adapters benefit the ARG tasks in the monolingual (VTRANS)    Table 4: Results on Arabic and Chinese ACE test sets for full fine-tuning (FFT) and the task adapter (TA) setup, for zero-shot (ZS) transfer with mBERT and VTRANS transfer approach with language-specific BERT (AR-BERT / ZH-BERT) and FN/VN adapters trained on noisily translated verb pairs ( §2.4). F 1 scores averaged over 5 runs.
transfer setup. Notably, in zero-shot transfer, we observe that the highest scores are achieved in the task adapter (TA) fine-tuning, where the inclusion of the verb adapters offers additional gains. Overall, however, the argument tasks elude the restricted capacity of the TA-based setup, with very low scores.
Additionally, in Appendix E we show the results with sequential fine-tuning. Similarly to our EN results (Table 2), we observe advantages of using the full capacity of BERT parameters to encode verb knowledge in most setups in TempEval, while the comparison to the adapter-based approach is less clear-cut on ACE. In sum, sequential fine-tuning is a strong verb knowledge injection variant; however, it is computationally more expensive and less portable. The modular and efficient adapter-based approach therefore presents an attractive alternative, while offering competitive task performance. Crucially, the strong results from the sequential setup further corroborate our core finding that external lexical verb information is indeed beneficial for event processing tasks across the board.

Zero-shot Transfer vs Monolingual Training.
The results reveal a considerable gap between the performance of ZS transfer versus monolingual finetuning. The event extraction tasks pose a significant challenge to zero-shot transfer via mBERT; however, mBERT exhibits much more robust performance in the monolingual setup, with available target-language training data for event tasks. In the latter, mBERT trails language-specific BERTs by less than 5 points (Table 3). This is encouraging, given that monolingual pretrained LMs currently exist only for a small set of high-resource languages. For all other languages -should there be language-specific event task data -one can leverage mBERT. Moreover, mBERT's performance is further improved by the inclusion of transferred verb knowledge via VTRANS: in Spanish, where its typological closeness to English renders direct transfer of semantic-syntactic information viable, the addition of VTRANS-based verb adapters yields significant gains both in the FFT and the TA setup. 13 These results confirm the effectiveness of lexical knowledge transfer suggested previously in the work on semantic specialisation of static word vectors (Ponti et al., 2019;Wang et al., 2020b).
Double Task Adapter. Promisingly, we see in Table 5 that the relative performance gains from FN/VN adapters are preserved regardless of the added trainable task adapter capacity. As expected, the increased task adapter size helps argument tasks in ACE, where verb adapters produce additional gains. Overall, this suggests that verb adapters indeed encode additional, non-redundant information beyond what is offered by the pretrained model alone, and boost the dedicated task adapter.
Cleanliness of Verb Knowledge. Despite the promising results with the VTRANS approach, there are still fundamental limitations: (1) noisy translation based on cross-lingual semantic similarity may already break the VerbNet class membership alignment; and (2) the language-specificity of verb classes due to which they cannot be directly ported to another language without adjustments. 14 The fine-grained class divisions and exact class membership in VN may be too English-specific to allow direct automatic translation. On the contrary, semantically-driven FrameNet lends itself better to cross-lingual transfer: we report higher average gains in cross-lingual setups with the FN-Adapter.
To quickly verify if the noisy direct transfer curbs the usefulness of injected knowledge, we evaluate the injection of clean verb knowledge from a small lexical resource available in Spanish: we train an ES FN-Adapter on top of ES-BERT on 13 We noted analogous positive effects on performance of the more powerful XLM-R Large model (Appendix E). 14 This is in contrast to the proven cross-lingual portability of synonymy and antonymy relations shown in previous work on semantic specialisation transfer Ponti et al., 2019), which rely on semantics alone.

Related Work
Event Extraction. The cost and complexity of event annotation requires robust transfer solutions capable of making fine-grained predictions in the face of data scarcity. Traditional event extraction methods relied on hand-crafted, language-specific features (Ahn, 2006;Gupta and Ji, 2009;Llorens et al., 2010;Hong et al., 2011;Li et al., 2013;Glavaš and Šnajder, 2015) (e.g., POS tags, entity knowledge), which limited their generalisation ability and effectively prevented language transfer. More recent approaches commonly resorted to word embedding input and neural text encoders such as recurrent nets (Nguyen et al., 2016;Duan et al., 2017;Sha et al., 2018) and convolutional nets (Chen et al., 2015;Nguyen and Grishman, 2015), as well as graph neural networks (Nguyen and Grishman, 2018;Yan et al., 2019) and adversarial networks (Hong et al., 2018;Zhang et al., 2019). Most recent empirical advancements in event trigger and argument extraction tasks stem from fine-tuning of LM-pretrained Transformer networks (Yang et al., 2019a;Wang et al., 2019;M'hamdi et al., 2019;Wadden et al., 2019;Liu et al., 2020).
Limited training data nonetheless remains an obstacle, especially when facing previously unseen event types. The alleviation of such data scarcity issues was attempted through data augmentation -automatic data annotation (Chen et al., 2017;Zheng, 2018;Araki and Mitamura, 2018) and bootstrapping for training data generation (Ferguson et al., 2018;Wang et al., 2019). The recent release of the large English event detection dataset MAVEN (Wang et al., 2020c), with annotations of event triggers only, partially remedies for English data scarcity. MAVEN also demonstrates that even the state-of-the-art Transformer models fail to yield satisfying event detection performance in the general domain. The fact that it is unlikely to expect datasets of similar size for other event extraction tasks and especially for other languages only emphasises the need for external event-related knowledge and transfer learning approaches, such as the ones introduced in this work.
Joint specialisation models (Yu and Dredze, 2014;Lauscher et al., 2020b;Levine et al., 2020, inter alia) train the representation space from scratch on the large corpus, but augment the selfsupervised training objective with an additional objective based on external lexical constraints. Lauscher et al. (2020b) add to the Masked LM (MLM) and next sentence prediction (NSP) pre-training objectives of BERT (Devlin et al., 2019) an objective that predicts pairs of (near-)synonyms, aiming to improve word-level semantic similarity in BERT's representation space. In a similar vein, Levine et al. (2020) add the objective that predicts WordNet supersenses. While joint specialisation models allow the external knowledge to shape the representation space from the very beginning of the distributional training, this also means that any change in lexical constraints implies a new, computationally expensive pretraining from scratch.
Retrofitting and post-specialisation methods (Faruqui et al., 2015;Ponti et al., 2018;Glavaš and Vulić, 2019;Lauscher et al., 2020a;Wang et al., 2020a), in contrast, start from a pretrained representation space (word embedding space or a pretrained encoder) and fine-tune it using external lexicosemantic knowledge. Wang et al. (2020a) fine-tune the pre-trained RoBERTa  with lexical constraints obtained automatically via dependency parsing, whereas Lauscher et al. (2020a) use lexical constraints derived from ConceptNet to inject knowledge into BERT: both adopt adapterbased fine-tuning, storing the external knowledge in a separate set of parameters. Our work adopts a similar adapter-based specialisation approach, however, focusing on event-oriented downstream tasks, and knowledge from VerbNet and FrameNet.

Conclusion
We investigated the potential of leveraging knowledge about semantic-syntactic behaviour of verbs to improve the capacity of large pretrained models to reason about events in diverse languages. We proposed an auxiliary pretraining task to inject VerbNet-and FrameNet-based lexical verb knowledge into dedicated verb adapter modules. We demonstrated that state-of-the-art pretrained models still benefit from the gold standard linguistic knowledge stored in lexical resources, even those with limited coverage. Crucially, we showed that the benefits of the knowledge from resourcerich languages can be extended to other, resourceleaner languages through translation-based transfer of verb class/frame membership information.

A Frameworks for Annotating Event Expressions
Two prominent frameworks for annotating event expressions are TimeML (Pustejovsky et al., 2003(Pustejovsky et al., , 2005 and the Automatic Content Extraction (ACE) (Doddington et al., 2004). TimeML was developed as a rich markup language for annotating event and temporal expressions, addressing the problems of identifying event predicates and anchoring them in time, determining their relative ordering and temporal persistence (i.e., how long the consequences of an event last), as well as tackling contextually underspecified temporal expressions (e.g., last month, two days ago). Currently available English corpora annotated based on the TimeML scheme include the TimeBank corpus (Pustejovsky et al., 2003), a human annotated collection of 183 newswire texts (including 7,935 annotated EVENTS, comprising both punctual occurrences and states which extend over time) and the AQUAINT corpus, with 80 newswire documents grouped by their covered stories, which allows tracing progress of events through time (Derczynski, 2017). Both corpora, supplemented with a large, automatically TimeMLannotated training corpus are used in the TempEval-3 task (Verhagen and Pustejovsky, 2008;UzZaman et al., 2013), which targets automatic identification of temporal expressions, events, and temporal relations.
The ACE dataset provides annotations for entities, the relations between them, and for events in which they participate in newspaper and newswire text. For each event, it identifies its lexical instantiation, i.e., the trigger, and its participants, i.e., the arguments, and the roles they play in the event. For example, an event type "Conflict:Attack" ("It could swell to as much as $500 billion if we go to war in Iraq."), triggered by the noun "war", involves two arguments, the "Attacker" ("we") and the "Place" ("Iraq"), each of which is annotated with an entity label ("GPE:Nation"). We experimented with n ∈ {10, 15, 20, 30} training epochs, as well as an early stopping approach using validation loss on a small held-out validation set as the stopping criterion, with a patience argument p ∈ {2, 5}; we found the adapters trained for the full 30 epochs to perform most consistently across tasks.
The size of the training batch varies based on the value of k negative examples generated from the starting batch B of positive pairs: e.g., by generating k = 3 negative examples for each of 8 positive examples in the starting batch we end up with a training batch of total size 8+3 * 8 = 32. We experimented with starting batches of size B ∈ {8, 16} and found the configuration k = 3, B = 16 to yield the strongest results (reported in this paper).

C VTRANS: Technical Details
First, we automatically translate the verbs by retrieving their nearest neighbour in the target language from the shared cross-lingual embedding space, aligned using the Relaxed Cross-domain Similarity Local Scaling (RCSLS) model of Joulin et al. (2018). Such translation procedure is liable to error due to an imperfect cross-lingual embedding space as well as polysemy and out-of-context word translation. We dwarf these issues in the second step, where we purify the set of noisily translated target language verb pairs by means of a neural lexico-semantic relation prediction model, the Specialization Tensor Model (Glavaš and Vulić, 2018a), here adjusted for binary classification. We train the STM for the same task as verb adapters during verb knowledge injection ( §2.2): to distinguish (positive) verb pairs from the same English VN class/FN frame from those from different VN classes/FN frames. In training, the input to STM are static word embeddings of English verbs taken from a shared cross-lingual word embedding space. We then make predictions in the target language by feeding vectors of target language verbs (from noisily translated verb pairs), taken from the same cross-lingual word embedding space, as input for STM. We provide more details on STM training in what follows.
STM Training Details. We train the STM using the sets of English positive examples from each lexical resource (Table 1). Negative examples are generated using controlled sampling (see §2.2), using a k = 2 [cc] configuration, ensuring that generated negatives do not constitute positive constraints in the global set. We use the pre-trained 300-dimensional static distributional word vectors computed on Wikipedia data using the FASTTEXT model (Bojanowski et al., 2017), cross-lingually aligned using the RCSLS model of Joulin et al.

D Sequential Fine-tuning Details
In the sequential fine-tuning setup, we first train the full cased variant of the BERT-based model on the VN/FN data. We generate negative examples using the strongest performing configuration of sampling parameters: k = 3 [ccr]. We train the model for 4 epochs using the Adam algorithm (Kingma and Ba, 2015), a learning rate of 2e − 5 with 1000 warmup steps and a batch size of 64. Next, we fine-tune the VN/FN-pretrained model on the two downstream tasks. For Task 1, we train for 10 epochs in batches of 32 and a learning rate of 1e−4 and a maximum input sequence T = 128. In Task 2, we find an optimal hyperparameter configuration for each language-setup combination from the grid: learning rate l ∈ {1e − 5, 1e − 6}, epochs n ∈ {3, 5, 10, 25, 50}, batch size b ∈ {8, 16}, with maximum input sequence length of T = 128. Table 9 presents the results of monolingual evaluation substituting the monolingual target language BERT with the massively multilingual encoder, with or without the FN/VN adapters trained on (noisy) target language verb signal (AR-MBERT/ZH-MBERT).  Table 7: Results on Arabic and Chinese ACE test sets for the sequential fine-tuning setup for zero-shot (ZS) transfer with mBERT and the VTRANS transfer approach with language-specific BERT (AR-BERT / ZH-BERT) or mBERT, on noisily translated FN/VN data ( §2.4). F 1 scores averaged over 5 runs; significant improvements (paired t-test; p < 0.05) over both baselines marked in bold.