Using Deep Mixture-of-Experts to Detect Word Meaning Shift for TempoWiC

This paper mainly describes the dma submission to the TempoWiC task, which achieves a macro-F1 score of 77.05% and attains the first place in this task. We first explore the impact of different pre-trained language models. Then we adopt data cleaning, data augmentation, and adversarial training strategies to enhance the model generalization and robustness. For further improvement, we integrate POS information and word semantic representation using a Mixture-of-Experts (MoE) approach. The experimental results show that MoE can overcome the feature overuse issue and combine the context, POS, and word semantic features well. Additionally, we use a model ensemble method for the final prediction, which has been proven effective by many research works.


Introduction
Lexical Semantic Change (LSC) Detection has drawn increasing attention in the past years (Liu et al., 2021;Laicher et al., 2021).Existed research works (Liu et al., 2021) have shown that contextual word embeddings such as those produced by BERT (Devlin et al., 2018) have great advantages over non-contextual embeddings for inferring semantic shift when there is limited data.Meanwhile, many datasets are released to accelerate research in this direction.Pilehvar and Camacho-Collados (2018) proposed Word-in-Context (WiC) dataset as an benchmark for generic evaluation of context-sensitive representations.Raganato et al. (2020) extended WiC to XL-WiC dataset with multilingual extensions.In contrast to these, Tem-poWiC (Loureiro et al., 2022b) is crucially designed around the time-sensitive meaning shift and instances of word usage tied to Twitter trending topics.Our main work is to build a system that can detect semantic changes of target words in tweet pairs during different time periods for TempoWiC.
It is framed as a binary classification task that addresses whether two instances of a target word have the same meaning.And pre-trained language models are adopted to produce contextual embeddings.

Task Description
TempoWiC (Loureiro et al., 2022b) is a new benchmark especially aimed at detecting a meaning shift in social media.Given a pair of sentences and a target word, the task is framed as a simple binary classification problem in deciding whether the meaning corresponding to the first target word in context is the same as the second one or not.
The dataset of TempoWiC consists of 3297 annotated instances, which are divided into train/dev/test sets of size 1,428/396/1,473 instances, respectively.The target words involved in this task do not overlap between sets.For each sample, tweet pairs containing the target word were collected from the Twitter API at different time periods.The prior date is exactly one year before the peak date to avoid seasonal confound factors.The label True indicates that the word has the same meaning in the two tweets, while the label False indicates that the meaning is different.

Pre-trained Language Models
Recently, pre-trained language models (LM) have achieved remarkable achievement on natural language processing tasks, becoming one of the most effective methods for engineers and scholars.Transformers-based Pre-trained language models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2020), DeBER-TaV3 (He et al., 2021) is designed to pre-trained deep representation from unlabeled text, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language arXiv:2211.03466v1[cs.CL] 7 Nov 2022 inference, without substantial task-specific architecture modifications.
By training language models using Twitter corpora from different time periods, (Loureiro et al., 2022a) showed that language undergoes semantic transformations over time, proving that training a language model with outdated corpora leads to a decline in performance.

Mixture-of-Experts
MoE (Arnaud et al., 2019) is an approach for conditionally computing a representation.Given several expert inputs, the output of MoE is a weighted combination of the experts.Recently, MoE achieves significant improvements on several natural language processing tasks, such as named entity recognition (Meng et al., 2021), recommendation (Zhu et al., 2020) and machine translation (Shazeer et al., 2017).

System Overview
In this section, we first present the framework details for the models adopted in our work.Then we introduce several strategies for improving the models' robustness.Finally, we talk about the design of the model ensemble method.

Models
Our model framework can be divided into three layers:encoding, matching and prediction.The encoding layer is meant for sequence modeling to capture contextual semantic representation.The matching layer focuses on finding out the interrelation and differences between the target words in two different tweets.And the prediction layer is implemented as a classifier that decides whether the meaning of the target word is the same or not. A. Base Model Figure 1 shows the details of our base model.Two tweets are concatenated together and fed into a pre-trained LM, and the contextual embeddings(e.g.E 1 , E 2 1 ) corresponding to the target word on each tweet of the pair can be achieved.Then E 1 and E 2 are processed by the matching layer to find the difference in these two tweets.The 1 We experimented with different target word representations: the first token in the word span, the mean value of all tokens in the span, the concatenation of the first token and last token in the span.And we found that adopting the concatenation of the first token and last token in the span can perform better than others.Please refer to Appendix A for more details.
procedure can be summarized as follows: where E CLS is the embedding of the first token, E match is the output of the matching layer, M LP is a multi-layer perceptron, y true is the gold label and y o is the output by the base model.E 1 * E 2 means the Hadamard product of these two vectors, and E 1 −E 2 represents the elementwise subtraction.
Pre-trained Language Models  We extend the base model with two separate BiLSTM to integrate the POS information and the word semantic representation.For a pair of tweets, we first extract the contextual embeddings for the target word from pre-trained LM, and then we use two separate BiLSTM to get POS encoding and word semantic encoding.At last, an MoE module is adopted to merge these three encodings for the target word.The generated embeddings(e.g.E 1 , E 2 ) for the target word are then processed by the matching layer and prediction layer as described above.
Here we denote the POS encoding for the target word in the pair of tweets as E P 1 , E P 2 , and denote the GloVe-initialized word semantic encoding as E G 1 , E G 2 respectively.The details of an MoE module for this task are given in Figure 3, which consists of a gating network and three experts.The procedure can be summarized as follows: Pre-trained Language Models The weight for each expert is calculated separately.We define a task-specific vector V t , the weight for expert i can be calculated as: where θ are trainable parameters, [, ] is the concatenation and σ is the Sigmoid activation, E i is the encoding of i-th expert.
• Joint Gating Network(J-Gate): The weights for all experts are calculated together.We define the weight vector for all experts as W , which is a three-dimension vector and can be calculated as: , where θ are trainable parameters.

Data Cleaning and Augmentation
Given that the dataset is somewhat small, and there are some flaws in the labeled data, we adopt simple cleaning and augmentation strategies.We simply remove HTML tags and emojis in tweets, and replace the symbol @username with a generic placeholder.Moreover, we directly remove the wrongly labeled samples of the target word position.There are many different data augmentation strategies: token shuffling, cutoff, back-translation, and so on.
We just introduce the WiC dataset(Pilehvar and Camacho-Collados, 2018) for data augmentation in this paper.

Adversarial Training
Adversarial attack has been well applied in both computer vision and natural language processing to improve the model's robustness.We implement this strategy with Fast Gradient Method (Goodfellow et al., 2014), which directly uses the gradient to compute the perturbation and augments the input with this perturbation to maximizes the adversarial loss.The training procedure can be summarized as follows: where x is input, y is the gold label, D is the dataset, θ is the model parameters, L(x + ∆x, y; θ) is the loss function and ∆x is the perturbation.

Model Ensemble
For the final prediction, we implement a model ensemble method.In detail, we use one base model and the other two MoE models mentioned above to get the prediction scores and then average these output scores as the final result.

Experiments 4.1 Experimental Setup
Our implementation is based on the Transformers library by HuggingFace (Wolf et al., 2019) for the pre-trained models and corresponding tokenizers.
During training, the data is processed by batches of size 8, the maximum length of each sample is set to 256, and the learning rate is set to 1e-6 with a warmup ratio over 10%.By default, we set to 1.0 in FGM and set the MLP to two layers with a hidden size of 256.When MoE models are employed, the hidden size of BiLSTM is set to 1024, and the pre-trained Twitter GloVe word vectors2 are used for word embedding initialization.Moreover, we use nltk toolkit3 to extract POS tags, and the POS embeddings are randomly initialized.Our system jointly optimizes over different experts, but their model architectures differ.We adopt differential learning rates to tackle this problem.The learning rate for the transformer-based model is set to 1e-6, and the learning rate for BiLSTM is set to 1e-4.

Results and Analysis
In this section, we first present experimental results on the base model.Then we experiment with MoE models using the effective strategies validated on the base model.At last, the results of the model ensemble are reported.
We explore the impact of different pre-trained LMs adopted as the contextual encoder.Results given in Table 1 show that DeBERTa-large can perform well on this task.And TimeLMs (Loureiro et al., 2022a) can perform better than generic RoBERTa since they are implemented and adapted to the Twitter domain.Moreover, TimeLMs-2020-09 can achieve almost the best results among TimeLMs, largely because the dev dataset is distributed over this time period.From the last two rows in Table 1, we can find that data cleaning and augmentation can increase the macro-F1 score by 2.83 percentage points, and FGM training can increase this indicator by 2.57%.Additionally, the ablation study results on matching layer are presented in Appendix A, we can find that the first token [CLS] embedding can help improve the performance of this task.The subtraction and Hadamard product operations can also help find the difference between target words in two tweets.
When we experiment with MoE models, the data cleaning and augmentation, and FGM training are adopted by default.And the pre-trained DeBERTalarge is used for the contextual encoder.Table 2 shows the performance of different MoE models.We can find that when integrating POS information and word semantic representation by using an MoE architecture, the performance can improve a lot.More specifically, the MoE model with S-Gate and J-Gate can achieve macro-F1 scores of 79.25% and 79.19% respectively, both of which increase the base by more than 2%.For further analysis, ablation studies are done here.We experiment with  POS information and GloVe separately and find that using an MoE model to integrate POS information can improve the performance by 1%, while using an MoE model to combine word semantic representation can increase the macro-F1 score by about 2%.Table 3 gives the results of our model ensemble method.By averaging the prediction scores of one base model and the other two MoE models(S-Gate + POS + GloVe, J-Gate + POS + GloVe), the macro-F1 score can increase by more than 1% on the dev dataset.And our model ensemble method achieves a macro-F1 score of 77.05% on the test dataset, which attains the first place in this task.

Conclusion
In this work, we provide an overview of the combined approach to detect the meaning shift in social media.We investigate the impact of adopting different pre-trained LMs, finding that DeBERTa performs best for this task.Experimental results show that strategies such as data augmentation and adversarial training can enhance the model's robustness.In particular, incorporating POS information and word-level semantic representation with MoE models can significantly improve performance.For future work, we will investigate how to incorporate different TimeLMs with MoE models for this task.

A Additional Experiments on the base model
In this part, we present several additional experimental results on the base model.We tried different target word representation methods for contextual embedding.The results on dev dataset are listed in Table 4.

Table 1 :
Results of base model on dev dataset

Table 2 :
Results of MoE-based models on dev dataset

Table 3 :
Ensemble results on both Dev and Test dataset

Table 4 :
Results of different target word representation methodsTo make further analysis, we conducted ablation studies to investigate the contribution of different components of matching layer.Results are shown in Table5.

Table 5 :
Results of different components of matching layer