Crosslingual Transfer Learning for Relation and Event Extraction via Word Category and Class Alignments

Previous work on crosslingual Relation and Event Extraction (REE) suffers from the monolingual bias issue due to the training of models on only the source language data. An approach to overcome this issue is to use unlabeled data in the target language to aid the alignment of crosslingual representations, i.e., via fooling a language discriminator. However, as this approach does not condition on class information, a target language example of a class could be incorrectly aligned to a source language example of a different class. To address this issue, we propose a novel crosslingual alignment method that leverages class information of REE tasks for representation learning. In particular, we propose to learn two versions of representation vectors for each class in an REE task based on either source or target language examples. Representation vectors for corresponding classes will then be aligned to achieve class-aware alignment for crosslingual representations. In addition, we propose to further align representation vectors for language-universal word categories (i.e., parts of speech and dependency relations). As such, a novel filtering mechanism is presented to facilitate the learning of word category representations from contextualized representations on input texts based on adversarial learning. We conduct extensive crosslingual experiments with English, Chinese, and Arabic over REE tasks. The results demonstrate the benefits of the proposed method that significantly advances the state-of-the-art performance in these settings.


Introduction
Relation and Event Extraction (REE) are important tasks of Information Extraction (IE), whose goal is to extract structured information from unstructured text (Walker et al., 2006). Due to their complexity, annotations for REE tasks are costly and only available in a few languages. Thus, there have been growing interests on crosslingual learning for REE in which a model is trained on a language, i.e., source language, and applied to another language, i.e., target language, where the annotations are not available. Recent approaches for crosslingual REE have mainly employed multilingual word embeddings, e.g., MUSE, (Joulin et al., 2018;Ni and Florian, 2019;Subburathinam et al., 2019) or multilingual pre-trained language models, e.g., multilingual BERT, (Devlin et al., 2019;M'hamdi et al., 2019;Ahmad et al., 2021;Nguyen and Nguyen, 2021) to learn crosslingual representation vectors for REE.
However, previous work on crosslingual REE suffers from the monolingual bias issue due to the monolingual training of models on only the source language data, leading to non-optimal crosslingual performance. A solution for this issue can resort to language adversarial training (Chen et al., 2019;Huang et al., 2019;Keung et al., 2019;Lange et al., 2020;He et al., 2020) where unlabeled data in the target language is used to aid crosslingual representations via fooling a language discriminator. The underlying principle for this approach is to encourage the closeness of representation vectors for sentences in the source and target languages (i.e., aligning representation vectors). However, a critical drawback of language adversarial training is the failure to condition on classes/types of examples in the alignment process. As such, a target language example of a class could be incorrectly aligned to a source language example of a different class in REE, causing confusion and hindering the performance of the models. The middle sub-figure in Figure 2 demonstrates the class misalignment of representation vectors in crosslingual REE.
To this end, we propose a crosslingual alignment method that explicitly conditions on class information of REE tasks to enhance representation alignment and learning. Our major intuition is that the semantics of the classes in REE tasks (e.g., the event type of Attack in event extraction) are gen- erally invariant across languages that can be leveraged as anchors to bridge representation vectors for examples in different languages. As such, we can obtain two semantic representation vectors for each class in an REE task based on representation vectors of examples in either source or target language. Afterward, the representation vectors of the same class can be regulated to match each other, serving as a mechanism for class-aware crosslingual alignment of representation vectors for source and target examples. To implement this idea, we use multilingual BERT (mBERT) to obtain same-space representations for examples in both source and target languages to facilitate the alignment process. Afterward, the source-language representation vector for a class is computed via representation vectors of source-language examples that belong to the corresponding class. For the target language, as class information is not provided, we seek to compute target-language representation vector for a class by aggregating representation vectors for unlabeled examples, weighted on an estimation of the probabilities for the examples to exhibit the class.
In addition to class semantics, we propose to further exploit universal parts of speech and dependency relations in parsing trees (i.e., word categories) to improve the cross-lingual alignment for representation vectors in REE. As such universal word categories have been consistently annotated for more than 100 languages (Zeman et al., 2020) and can be generated with high accuracy via existing toolkits, e.g., the transformer-based toolkit Trankit for multilingual NLP (Straka, 2018;Qi et al., 2020;Nguyen et al., 2021b), we expect this information to provide helpful anchor knowledge for cross-lingual representation learning. To this end, similar to the class-aware alignment, we propose to align representation vectors of the same universal word categories that are computed using contextualized representations of examples in the source and target languages to further improve the language-independence of representation vectors for REE.
A potential issue with the computation of word category representations via contextualized representations of examples is the preservation of context word information in representations for word categories that might introduce noise and hinder the representation alignment. To address this issue, we propose an adversarial training model that seeks to explicitly filter context information from word category representations. This is achieved by using Gradient Reversal Layer (Ganin and Lempitsky, 2015) to prevent word category representations from being able to recognize the context words in the original examples. We expect that this filtering mechanism can improve the word category pureness of the representations, thus providing appropriate inputs for the alignment process for improved representation learning.
We conduct extensive experiments with different crosslingual settings on English, Chinese, and Arabic for three REE tasks, i.e., Relation Extraction, Event Detection, and Event Argument Extraction. The results demonstrate the benefits of the proposed method that significantly advances the state-of-the-art performance in these settings.

Problem Statement
We study cross-lingual transfer learning for three REE tasks as defined in the ACE 2005 dataset (Walker et al., 2006), i.e., Relation Extraction (RE), Event Detection (ED), and Event Argument Extraction (EAE). Given two entity mentions in an input sentence, the goal of RE is to determine the semantic relationship between the mentions according to predefined relation types/classes (e.g., Employment). For ED, its purpose is to identify event triggers, which can be verbs/normalization with one or multiple words, that express occurrences of events of predefined types (e.g., Attack). Finally, given an event trigger and an entity mention, EAE aims to predict the role (e.g., Victim) that the entity mention plays in the corresponding event. Note that, we have a special type None to indicate nonrelation, non-trigger, or non-argument for RE, ED, and EAE respectively.
For further discussion, let D src = {(x src , y src )} (|D src | = N src ) be the labeled training set in the source language. As such, for ED, x src is an input sentence and y src serves as the golden sequence tag (using BIO) for the words in x src . For RE and EAE, x src involves an input sentence along with indexes of the given trigger word and entity mentions while y src represents the golden relation type or argument role for the input. We also assume access to an unlabeled dataset D tgt = {(x tgt )} (|D tgt | = N tgt ) in the target language where x tgt consists of similar information as x src for the corresponding task.

Baseline Methods
To prepare for our cross-lingual representation alignment techniques for REE, we first describe the baseline models explored in this work.

Using Source Language Data Only
In this section, we present two baselines that train models based only on labeled data in the source language. These baselines are the current stateof-the-art (SOTA) models for crosslingual transfer learning for ED, RE, and EAE on the ACE 2005 dataset (Walker et al., 2006).
BERTCRF (M'hamdi et al., 2019): This is the current SOTA model for crosslingual ED. Given an input sentence w = [w 1 , w 2 , . . . , w n ] with n words (in x src ), the model first sends w to the mBERT encoder to obtain a sequence of contextualized representations Z = [z 1 , z 2 , . . . , z n ] where z k is the representation for each w k ∈ w, computed as the aver-age of its word-piece representations returned by the last layer of mBERT. The ED task is then done by performing sequence labeling over the words in w where each word is assigned with a BIO tag to capture boundaries and event types of event triggers in w. In particular, the final representation vector for trigger prediction r ED src,k is directly formed from the word representation z k (i.e., r ED src,k = z k ). Afterward, this prediction representation is fed into a feed-forward network FFN ED to obtain a score vector that exhibits the likelihoods for w k to receive possible BIO tags for the predefined event types: s ED src,k = FFN ED (r ED src,k ) ∀1 ≤ k ≤ n. Next, the score vectors are sent to a Conditional Random Field (CRF) layer to learn the inter-dependencies between the tags and obtain conditional probability for possible tag sequences P ED (.|w = x src ). The negative log-likelihood of the golden tag sequence y src is then used to train the model: Finally, Viterbi decoding is employed to perform prediction in inference time. GATE (Ahmad et al., 2021): This is the current SOTA model for crosslingual RE and EAE on the ACE 2005 dataset. Given an input sentence w in x src , this model uses the same encoding step with mBERT in BERTCRF to obtain the contextualized representation z k for each w k ∈ w. Afterward, an overall word representation vector v k for w k is formed by the concatenation: v k = [z k ; z pos k ; z dep k ] where z pos k and z dep k are the embeddings of the universal part of speech and the dependency relation for w k . Here, the dependency relation for a word is obtained by retrieving the dependency relation between the word and its governor in the dependency tree. For RE, given two entity mentions, the sequence of vectors V = [v 1 , v 2 , . . . , v n ] is then passed to a Transformer layer (Vaswani et al., 2017) along with a syntax-based attention mask to compute a final representation vector r RE src for relation prediction over the input x src . Afterward, a score vector for the possible relations is computed via a feed-forward network FFN RE : s RE src = FFN RE (r RE src ). The score vector s RE src is then sent to a softmax layer to obtain a distribution over possible relation types for x src : P RE (.|x src ). Finally, to train the model, we minimize the standard negative log-likelihood of the golden label y src : For EAE, given an event trigger and an entity mention, we follow the same steps above for RE to compute the representation vector for role prediction r EAE src , the score vector s EAE src , and the negative log-likelihood for optimization L EAE .
Finally, for convenience, let r ED tgt,k , r RE tgt , and r EAE tgt be the final representation vectors for x tgt in the unlabeled data of target language. We also have s ED tgt,k , s RE tgt , and s EAE tgt for the likelihood score vectors for examples in the target language. These vectors are computed in the same way as their source language counterparts in this section.

Using Unlabeled Target Language Data
To avoid the monolingual bias in the cross-lingual methods for REE in Section 3.1, our work aims to exploit unlabeled data in the target language to improve the cross-lingual representations for REE. This section presents the typical approaches for leveraging unlabeled target language data for crosslingual transfer learning in NLP, offering additional baselines for our proposed model later. Language Adversarial Training (LADV): To leverage unlabeled data in the target language, this method introduces a language discriminator that receives representation vectors for input sentences and predicts the language identity (i.e., source or target) of the sentences (Chen et al., 2019;Huang et al., 2019;Keung et al., 2019;Cao et al., 2020). As such, given an REE task t ∈ {ED, RE, EAE}, the method seeks to jointly train a model for t (i.e., those described in Section 3.1) and the language discriminator so that the induced representation vectors for t can contain necessary information for the predictions in t and be language-agnostic to better transfer knowledge across languages at the same time.
To implement this method, we first obtain a representation vector for each input sentence in the source and target language data by feeding it into mBERT to obtain word representation vectors [z 1 , z 2 , . . . , z n ] as in BERTCRF. Following (Keung et al., 2019), the average of such word vectors is used as the representation for the sentence in this baseline. For convenience, let a src and a tgt be the sentence representation vectors for the input sentences in x src and x tgt respectively. Also, let f t lng be the language discriminator for task t (implemented by a feed-forward network with a sigmoid activation in the end). In the next step, the representation vector a * ( * ∈ {src, tgt}) for each sentence is sent to f t lng to obtain a probability p * = f t lng (a * ), indicating the likelihood that the input sentence belongs to the source language. Treating source and target language sentences as positive and negative examples, the loss for the discriminator L disc is then computed via the negative loglikelihood: The overall joint loss to train the model for t with LADV is thus: L = L task + L disc . Note that as LADV aims to prevent the language discriminator from recognizing the language identity from sentence representation vectors, we insert the Gradient Reversal Layer (GRL) (Ganin and Lempitsky, 2015) between a * and f task lng to reverse the gradients during the backward pass from L disc . Overall, fooling the language discriminator in LADV with GRL eliminates languagespecific features to improve generalization across languages for t. mBERT Finetuning (FMBERT): Recently, it has been shown that fine-tuning multilingual pretrained language models on unlabeled data of the target language can improve the crosslingual performance for NLP tasks (Pfeiffer et al., 2020). Motivated by such prior work, this baseline exploits the unlabeled data in the target language for cross-lingual representation learning by fine-tuning mBERT on the data using mask language modeling (MLM) (Devlin et al., 2019). Afterward, the finetuned mBERT model is utilized in the encoders for the baseline models for REE tasks in Section 3.1. class in an REE task. One version is based on representations of examples for the source language while the other version employs representations from target language examples. The two representation versions will then be matched to achieve cross-lingual representation alignment for REE.
As such, let l be a class in an REE task t (e.g., l is a BIO tag for event types in ED). We compute the source-language representation c t src,l for l via the average of representation vectors for examples with label l in D src . In particular, for t = RE or EAE, we have: Similarly, for t = ED: Here, 1 is the indicator function, and N l src is the number of examples (for RE and EAE) or words (for ED) in D src that are annotated with label l.
In the target language, as the golden labels y tgt for the examples x tgt are not provided, we propose to obtain a target-language representation c t tgt,l by aggregating representation vectors for all examples x tgt ∈ D tgt . Probability estimations for examples or words to belong to class l are used as the weights for the aggregation. In particular, we obtain the probability estimations by sending the score vectors s ED tgt,k , s RE tgt , and s EAE tgt to a softmax layer:ŷ ED tgt,k = softmax(s ED tgt,k ), and y t tgt = softmax(s t tgt ) (for t = RE or EAE). As such, we obtain the target-language representation for l via the weighted sum of r t tgt (for RE and EAE): Similarly, for ED: whereŷ t tgt,l andŷ ED tgt,k,l represent the likelihood score for class l in vectorsŷ t tgt andŷ ED tgt,k respectively. The alignment for the representations of class l is then achieved by minimizing the negative cosine similarity of the source-and target-language vectors (i.e., for task t): Adaptive Coefficient: In our implementation, we compute the source-language representations c t src,l for l after each training epoch while the targetlanguage representations c t tgt,l are obtained for in each training minibatch. The current parameters of the models are utilized to perform such calculation. As such, the quality of the representation vectors for classes might vary along the training process of the models. In particular, later epochs might correspond to better model parameters, thus leading to more reliable class representations. To this end, we propose to apply an adaptive coefficient λ cls for the class alignment loss L t cls so its impact is gradually increasing along the training: λ cls = 2 1+exp(−e/E) − 1 where E and e are the total and current numbers of training epochs, respectively. Note that λ cls is small in the early training stages and gradually increase in the process.

Word Category-based Alignment
We further exploit universal parts of speech (UPOS) and dependency relations as the language-agnostic knowledge to align crosslingual representations for REE. To achieve a fair comparison with prior work (Subburathinam et al., 2019;Ahmad et al., 2021), we employ the UDPipe toolkit (Straka and Straková, 2017) to obtain parts of speech and dependency relations for the sentences. Due to their similarity, we will only describe the UPOS-based alignment process and the dependency-based alignment can be done in the same way.
As such, we utilize an embedding table U (initialized randomly) to capture representation vectors for the possible UPOS, serving as an anchor knowledge across languages. Next, to facilitate the UPOS-based representation alignment, we compute additional representation vectors for UPOS based on representation vectors of examples in both source and target languages. In particular, for each word w k in an input sentence w (from x src or x tgt ), we send its contextualized representation z k from mBERT into a feed-forward network FFN U P OS to produce a representation vector q k for the UPOS w pos k of w k ∈ w: q k = FFN U P OS (z k ). Afterward, to leverage the language-universal of U , we propose to match q k to the embedding vector of w pos k in U for q k in both source and target language data. In other words, induced representation vectors in the source and target languages are both matched to the anchor knowledge U , providing a mechanism to align source and target representations.
To match q k and U , we seek to maximize the similarity between q k and the embedding of w pos k in U while minimizing q k 's similarities with embeddings of other UPOS at the same time. To implement this idea, we utilize the following function for minimization: where D = D src ∪ D tgt , O is the set of possible UPOS, and U [u] is the embedding of u in U . Context Information Filtering: Note that L align pos is also the negative log-likelihood for a feedforward classifier that uses U as the weight matrix and q k as the input vector to predict the UPOS w pos k for w k . As such, minimizing L align pos also serves to retain relevant information for UPOS prediction in the representation vector q k . However, due to the direct computation of q k from the contextualized representation z k , it is possible that q k still preserves context information from the input sentence w. This might introduce noise into q k as ideally, we expect q k to focus only on information about UPOS. As such, to improve the quality of q k for representation alignment, we propose to explicitly filter context information from vectors q k . Our main idea is to ensure that q k cannot be used to recover the context words in w. To achieve this goal, we first obtain an aggregated vector for the UPOS representation vectors in the input sentence w: q = 1 n n k=1 q k . The resulting vector is then fed into a Gradient Reversal Layer (GRL) (Ganin and Lempitsky, 2015), followed by a word classifier (i.e., a feed-forward network FFN ctx with a softmax layer in the end) to compute a probability distribution over the words in our vocabulary: y ctx = softmax(FFN ctx (GRL(q))). Finally, to filter the context information from q k , we minimize the negative log-likelihood of the context words w k in the input sentence w: whereŷ ctx [w k ] is the probability for word w i in the distributionŷ ctx . Note that while the minimization of the negative log-likelihood generally encourages input representations to reveal information about the prediction outputs (i.e., context words in our case), the introduction of GRL in L ctx pos reverses this process to discourage the context information in q, thus purifying q k to focus on UPOS knowledge and facilitating the representation alignment.
In the next steps for universal dependency relations, we follow the same procedure for L align pos and L ctx pos to obtain the losses L align dep and L ctx dep respectively for minimization. For convenience, let L pos = L align pos + L ctx pos and L dep = L align dep + L ctx dep . In summary, the overall loss function to train our models for a task t ∈ {ED, RE, EAE} with both class and word category alignment is thus: L main = L t + λ cls L t cls + λ pos L pos + λ dep L dep where λ cls is the adaptive coefficient, and λ pos and λ dep are trade-off parameters. For each of the language (i.e., English, Chinese and Arabic) and task (i.e., ED, RE, and EAE), the data split provides training, development, and test data. In our cross-lingual transfer learning experiments, the models will be trained on the training data of one language (the source) and evaluated on the test data of another language (the target). The unlabeled data for the target language is obtained by removing the labels from its training data.   We use the same hyper-parameters for BERTCRF and GATE as provided by previous work (M'hamdi et al., 2019;Ahmad et al., 2021). Specific hyper-parameters for our model are tuned on the development data. In particular, we use two layers for the feed forward networks with 50 hidden units for the layers, 50 dimensions for the UPOS and dependency embeddings, and 0.1 for the parameters λ pos and λ dep . For the baseline FMBERT, we utilize the huggingface library to finetune mBERT on unlabeled target data with MLM for 100, 000 steps (i.e., batch size of 64 and learning rate of 5e-5).
Performance Comparison: We compare the proposed crosslingual method for REE on two groups of baselines. The first group involve models that only use source language data for training, i.e., BERTCRF and GATE. These are current SOTA methods for crosslingual ED, RE, and EAE. The second baseline groups additionally employ unlabeled data in the target language to support crosslingual representation learning in REE, i.e., LADV and FMBERT. Our proposed method also leverages unlabeled data in the target language, called CC-CAR for class-and word category-based crosslingual alignment of representations. Note that LADV, FMBERT, and CCCAR should be applied on top of a source-only method (i.e., BERTCRF and GATE) to form a complete model. Tables 3 and 2 show the test data performance of the models for the three REE tasks in six crosslingual settings (i.e., with different pairs of languages for the source and target). It is clear from the tables that the proposed method CCCAR consistently outperforms other methods in all crosslingual settings for the three REE tasks. In particular, for EAE, CCCAR substantially improves the baseline model GATE (i.e., the current SOTA) by 1.9% on average while those improvement for LADV and FMBERT are only 0.45% and 0.38%. The same trend can be seen for RE and ED where CCCAR on average improves the baselines by 1.97% for the former and 7.7% for the latter. These results clearly demonstrate the effectiveness of the proposed method, highlighting the benefits of the class-and word category-based alignment for crosslingual REE.

Model
English → Chinese English → Arabic RE ED EAE RE ED EAE CCCAR -Class Align.  Ablation Study: This section conducts an ablation study to understand the contribution of each designed component in the proposed crosslingual alignment method CCCAR. In particular, we examine the performance of the following ablated models: (i) -Class Align.: this model excludes the class-based alignment component (i.e., the loss L t cls ) from CCCAR; (ii) -Adaptive Coeff.: instead of using the adaptive coefficient λ cls for the classbased alignment loss L t cls , this model utilizes a fixed value (i.e., 0.2 as tuned on development data) for λ cls ; (iii) -UPOS Align.: this model eliminates the UPOS-based alignment component (i.e., the losses L align pos and L ctx pos ) from CCCAR; (iv) -Dep Align.: the alignment component based on dependency relations (i.e., the losses L align dep and L ctx dep ) is not utilized in this model; (v) -Word Cat Align.: this model removes both UPOS-based and dependency-based alignment from CCCAR (i.e., excluding L pos and L dep ); and (vi) -Context Filtering: the word context filtering for the representation vectors of UPOS and dependency relations (with GRL) is not employed in this model (i.e., eliminating the losses L ctx pos and L ctx dep ). Table 4 presents the test data performance of the models in the English-to-Chinese and English-to-Arabic settings for the three REE tasks. As can be seen, removing any component of the proposed model would hurt the performance significantly across different settings and tasks, thus clearly illustrating the benefits of the designed components for CCCAR. The performance of the models drops the most when the class-based alignment is excluded, further demonstrating the importance of class-aware alignment for crosslingual REE.
Source-language Data Usage: Previous experiments show that using unlabeled data in the target language to align representation vectors in CCCAR can improve the performance for the source-only baselines for REE. In this section, we seek to understand how much labeled data in the source language can be saved if unlabeled data in the target language is employed with CCCAR for an REE task. In particular, we are interested in the portion of source language data that, once combined with unlabeled target language data via CCCAR, can produce similar performance as the source-only baseline trained on full source language data. To this end, we show the learning curves of the source-only and CCCARaugmented models for REE tasks when the size of the source-language training data varies. Figure  3 show the curves for the English-to-Chinese setting. As can be seen, the proposed CCCAR method with unlabeled target data only needs to use approximately 60% of the source-language training data for RE and EAE to achieve comparable performance with the source-only baselines on full source language data. This portion for ED is less than 80%. These results thus suggests an additional benefit of CCCAR to significantly reduce necessary data annotation for the source language based on unlabeled target language data in crosslingual learning for REE.
Alignment Effect of the Proposed Method: As discussed earlier, a major issue for LADV is that it might align representations of examples with different classes in the crosslingual setting. CC-CAR can address this issue as it explicitly relies on class information for representation alignment. To demonstrate these arguments, Figure 2 uses the t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten and Hinton, 2008) to visualize the example representations induced by GATE, the LADV baseline GATE+LADV, and the proposed GATE+CCCAR. This visualization is done over 4,000 randomly selected examples for the top 5 frequent classes in EAE.
Here, examples are sampled from training data for both source and target languages in the English-to-Chinese setting. As can be seen, in the source-only model GATE, representations for examples from the source language are quite separate from those in the target language. The representation alignment in GATE+LADV can address this issue by pushing representations from both languages closer. However, representations for examples with different classes are unexpectedly aligned in GATE+LADV, causing suboptimal representations for crosslingual settings. Finally, due to the explicit condition on class information for alignment, GATE+CCCAR can match representations for both languages while avoiding the cross-class alignment to improve crosslingual performance for REE.
However, a fundamental limitation of existing crosslingual models for REE is the monolingual bias due to the sole reliance on source language data for training. In other NLP tasks, LADV has been explored to address this issue by leveraging unlabeled data in the target language to perform crosslingual representation alignment (Chen et al., 2019;Huang et al., 2019;Lange et al., 2020;Cao et al., 2020;He et al., 2020). Unfortunately, LADV suffers from the cross-class alignment issue, making it less optimal for crosslingual REE. Finally, we note that language-universal representation learning is related to domain adaption research where models seek to learn domain-invariant representations (Ganin and Lempitsky, 2015;Fu et al., 2017;Adel et al., 2017;Xie et al., 2018;Cicek and Soatto, 2019;Tang et al., 2020;Ngo et al., 2021).

Conclusions
We present a novel method for crosslingual transfer learning for REE that leverages unlabeled data in the target language to support language-universal representation learning.