Using Two Losses and Two Datasets Simultaneously to Improve TempoWiC Accuracy

WSD (Word Sense Disambiguation) is the task of identifying which sense of a word is meant in a sentence or other segment of text. Researchers have worked on this task (e.g. Pustejovsky, 2002) for years but it’s still a challenging one even for SOTA (state-of-the-art) LMs (language models). The new dataset, TempoWiC introduced by Loureiro et al. (2022b) focuses on the fact that words change over time. Their best baseline achieves 70.33% macro-F1. In this work, we use two different losses simultaneously. We also improve our model by using another similar dataset to generalize better. Our best configuration beats their best baseline by 4.23%.


Introduction
In 2019, Pilehvar and Camacho-Collados (2018) introduced WiC dataset.It is framed as a binary classification task between pairs of sentences including one identical target word with different meanings.In 2020, XL-WiC was introduced by Raganato et al. (2020) and made WiC richer by providing more examples and adding more languages.We benefit from the English part of XL-WiC as a helping dataset to improve the generalization of our model.Loureiro et al. (2022b) baselines include: RoBERTa (Liu et al., 2019) base and large, TimeLMs (Loureiro et al., 2022a) 2019-90M and 2021-124M and BERTweet (Nguyen et al., 2020) base and large.They examine two different methods of using these models: Finetuning and SP-WSD layer pooling weights as explained in Loureiro et al. (2022c).The best result is for TimeLMs-2019-90M with SP-WSD with 70.33% macro-F1.We examine RoBERTa-base and TimeLMs-Jun2022-153M.
For classification, many previous works (e.g.Peters et al., 2019) use standard practice and con-catenate both sentences with a [SEP] token and fine-tune the [CLS] embedding.In this work, we use two different losses simultaneously, cross entropy loss on RoBERTa classification head output as standard practice and add cosine embedding loss on average of target word output embeddings.

Model
We use LMs as base model.We add classification head and also cosine similarity + sigmoid on top of them.The classification head consists of two FC (fully connected) layers and a dropout layer between them (like standard RoBERTa classification head).RoBERTa uses a byte-level BPE (Byte-Pair Encoding) encoding scheme so it's possible that we have multiple embeddings for a single word.For second output path, we average embeddings (it can be more than one as explained) related to target word in first sentence and second sentence and compare them using cosine similarity, finally we apply sigmoid activation to get a binary classification.Our experiments shows that the second output path is more accurate by a large margin.Figure 1 shows an overview of described architecture.

Loss Function
For the loss function, we have the sum of two losses, one on standard RoBERTa classification head and another on similarity (in case of the same meaning) or dissimilarity (in case of the different meaning) of the embeddings of the last layer related to the target word.For the former, we use cross entropy loss and for latter we use cosine embedding loss.The second loss, help our model to make similar contextual embeddings for target word closer and push dissimilar ones away from each other.

Dataset
The main dataset is TempoWiC, but we also use the XL-WiC dataset to make our model more robust.It's important to know that XL-WiC samples are not tweets so it is out-of-domain data and the added data may cause model accuracy to degrade if the combined dataset is not representative.We explored using the main dataset without adding any sample from the XL-WiC dataset, by adding a random subset of XL-WiC, and also by adding the whole XL-WiC.

Framework & Tools
We use PyTorch (Paszke et al., 2019) + Hugging-Face transformers (Wolf et al., 2020) to implement our models and for reporting results we use the Codalab online platform 1 .

Hyper-parameters
We use Ray Tune (Liaw et al., 2018) to tune our hyper-parameters including learning rate, train epochs, random seed, batch size and weight decay.Increasing weight decay helps us avoid over-fitting which was the main problem in our initial model.

Experiments
We have multiple configurations to test:

Results
The biggest problem we were facing was overfitting.This is expected since we use transformerbased LMs.

Conclusion
In this work, we beat the best baseline of Loureiro et al. (2022b) by a large margin.To do this we use two losses simultaneously (standard classifier head cross entropy loss and cosine embedding loss on average of target word output embeddings) to train SOTA LMs, and also use XL-WiC as a helping dataset to generalize better.The best LM was TimeLMs-Jun2022-153M which is a pre-trained model on 153M tweets.

Figure 1 :
Figure 1: An overview of architecture.We use two losses simultaneously.First one is cross entropy loss on standard classifier head (black path) and the second one is cosine embedding loss on average of target word output embeddings (red path).