DefSent: Sentence Embeddings using Definition Sentences

Sentence embedding methods using natural language inference (NLI) datasets have been successfully applied to various tasks. However, these methods are only available for limited languages due to relying heavily on the large NLI datasets. In this paper, we propose DefSent, a sentence embedding method that uses definition sentences from a word dictionary, which performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks than conventional methods. Since dictionaries are available for many languages, DefSent is more broadly applicable than methods using NLI datasets without constructing additional datasets. We demonstrate that DefSent performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks to the methods using large NLI datasets. Our code is publicly available at https://github.com/hpprc/defsent.

Sentence embedding methods using natural language inference (NLI) datasets have been successfully applied to various tasks. However, these methods are only available for limited languages due to relying heavily on the large NLI datasets. In this paper, we propose DefSent, a sentence embedding method that uses definition sentences from a word dictionary, which performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks than conventional methods. Since dictionaries are available for many languages, DefSent is more broadly applicable than methods using NLI datasets without constructing additional datasets. We demonstrate that DefSent performs comparably on unsupervised semantics textual similarity (STS) tasks and slightly better on SentEval tasks to the methods using large NLI datasets. Our code is publicly available at https://github.com/hpprc/ defsent.

Introduction
Sentence embeddings represent sentences as dense vectors in a low dimensional space. Recently, sentence embedding methods using natural language inference (NLI) datasets have been successfully applied to various tasks, including semantic textual similarity (STS) tasks. However, these methods are only available for limited languages due to relying heavily on the large NLI datasets. In this paper, we propose DefSent, a sentence embedding method that uses definition sentences from a word dictionary. Since dictionaries are available for many languages, DefSent is more broadly applicable than the methods using NLI datasets without constructing additional datasets.
Defsent is similar to the model proposed by Hill et al. (2016) in that it generates sentence embeddings so that the embeddings of a definition sen- Figure 1: Sentence-BERT (left) and DefSent (right). tence and the word it represents are similar. However, while Hill et al. (2016)'s model is based on recurrent neural network language models, Def-Sent is based on pre-trained language models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), with a fine-tuning mechanism as well as Sentence-BERT (Reimers and Gurevych, 2019). Sentence-BERT is one of the state-of-the-art sentence embedding models, which is based on pretrained language models that are fine-tuned on NLI datasets. Overviews of Sentence-BERT and Def-Sent are depicted on Figure 1.

Sentence Embedding Methods
In this section, we introduce BERT, RoBERTa, and Sentence-BERT, followed by a description of Def-Sent, our proposed sentence embedding method.

BERT and RoBERTa
BERT is a pre-trained language model based on the Transformer architecture (Vaswani et al., 2017). Utilizing masked language modeling and next sentence prediction, BERT acquires linguistic knowledge and outputs contextualized word embeddings. In masked language modeling, a specific proportion of input tokens is replaced with a special token [MASK], and the model is trained to predict these masked tokens. Next sentence prediction is a task to predict whether two sentences connected by a sentence separator token [SEP] are consecutive sentences in the original text data. BERT uses the output embedding of the unique token [CLS] at the beginning of each such sentence for prediction.
RoBERTa has the same structure as BERT. It attempts to improve BERT by removing the next sentence prediction from pre-training objectives and increasing the data size and batch size. While both Sentence-BERT and DefSent are applicable to BERT and RoBERTa, we use BERT for the explanations in this paper. Conneau et al. (2017) proposed InferSent, a sentence encoder based on a Siamese network structure. InferSent trains the sentence encoder such that similar sentences are distributed close to each other in the semantic space. Reimers and Gurevych (2019) proposed Sentence-BERT, which also uses a Siamese network to create BERT-based sentence embeddings. An overview of Sentence-BERT is depicted on the left side of Figure 1. Sentence-BERT first inputs the sentences to BERT and then constructs a sentence embedding from the output contextualized word embeddings by pooling. They utilize the following three types of pooling strategy. CLS Using the [CLS] token embedding.; When using RoBERTa, since the [CLS] token does not exist, the beginning-of-sentence token <s> is used as an alternative.

Sentence-BERT
Mean Using the mean of the contextualized embeddings of all words in a sentence.
Max Using the max-over-time of the contextualized embeddings of all words in a sentence.
Let u and v be the sentence embeddings for each of the sentence pairs obtained by pooling. Then compose a vector [u; v; |u − v|] and feed it to the label prediction layer, which has the same number of output dimensions as the number of classes. For fine-tuning, Reimers and Gurevych uses the SNLI dataset (Bowman et al., 2015) and the Multi-Genre NLI dataset (Williams et al., 2018), which together contain about one million sentences.

DefSent
Since they have the same meaning, we focus on the relationship between a definition sentence and the word it represents. To learn how to embed sentences in the semantic vector space, we train the sentence embedding model by predicting the word from definitions. An overview of DefSent is depicted on the right side of Figure 1. We call the layer that predicts the original token from the [MASK] embeddings used in the masked language modeling during BERT pre-training a word prediction layer. Also, we use w k to denote the word corresponding to a given definition sentence X k .
DefSent inputs the definition sentence X k to BERT and derives the sentence embedding u by pooling the output embeddings. As in Sentence-BERT, three types of pooling strategy are used: CLS, Mean, and Max. Then, the derived sentence embedding u is input to the word prediction layer to obtain the probability P (w k |X k ). We use crossentropy loss as a loss function and fine-tune BERT to maximize P (w k |X k ).
In DefSent, the parameters of the word prediction layer are fixed. This setting allows us to finetune models without training an additional classifier, as is the case with both InferSent and Sentence-BERT. Additionally, since our method uses a word prediction layer that has been pre-trained in masked language modeling, the sentence embedding u is expected to be similar to the contextualized word embedding of w k when w k appears as the same meaning as X k .

Word Prediction Experiment
To evaluate how well DefSent can predict words from sentence embeddings, we conducted an experiment to predict a word from its definition.

Dataset
DefSent requires pairs of a word and its definition sentence. We extracted these from the Oxford Dictionary dataset used by Ishiwatari et al. (2019). Each entry in the dataset consists of a word and its definition sentence, and a word can have multiple definitions. We split this dataset into train, dev, and test sets in the ratio of 8:1:1 word by word to evaluate how well the model can embed unseen definitions of unseen words. It is worth noting that since DefSent utilizes the pre-trained word prediction layer of BERT and RoBERTa, it is impossible to obtain probabilities for out-of-vocabulary (OOV) words. Therefore, we cannot calculate losses of these OOV words in a straightforward way. 1 In our experiments, we only use words and their respective definitions in the dataset, as contained by the model vocabulary. The statistics of the datasets are listed in Table 1.

Settings
We used the following pre-trained models: BERTbase (bert-base-uncased), BERT-large (bert-largeuncased), RoBERTa-base (roberta-base), and RoBERTa-large (roberta-large) from Transformers (Wolf et al., 2020). The batch size was 16, a finetuning epoch size was 1, the optimizer was Adam (Kingma and Ba, 2015), and we set a linear learning rate warm-up over 10% of the training data. For each respective model and pooling strategy, the learning rate was chosen based on the highest recorded Mean Reciprocal Rank (MRR) for the dev set in the range of 2 x × 10 −6 , x ∈ {0, 0.5, 1, ..., 7}. We conducted experiments with ten different random seeds, and their mean was used as the evaluation score. Top-k accuracy (the percentage of correct answers within the first, third, and tenth positions) and MRR were calculated from the output word probabilities when a definition sentence was fed into the model. Also, we evaluated the performance of BERT-base without fine-tuning for comparison. Table 2 shows the experimental results. 2 Max was the best pooling strategy for BERT-base without fine-tuning, but its top-1 accuracy was extremely low at 0.0157. This indicates that it is not adequate for predicting words from definitions without finetuning. DefSent performed higher for larger models. In the case of BERT, CLS was the best pooling strategy for both base and large models. CLS was also the best pooling strategy for RoBERTa-base but Mean was the best for RoBERTa-large.

Extrinsic Evaluations
Next, to evaluate the general quality of the constructed sentence embedding, we conducted evaluations on semantic textual similarity (STS) tasks and SentEval tasks (Conneau and Kiela, 2018). simplicity and intuitiveness. 2 We report the fine-tuning time and computing infrastructure in Appendix A, and report the learning rate, means, and standard deviations on the word prediction experiment in Appendix B. We also show the actual predicted words when definition sentences and other sentences are given as inputs in Appendices C and D, respectively.

Settings
We compared the performance of DefSent with several existing sentence embedding methods including InferSent (Conneau et al., 2017), Universal Sentence Encoder (Cer et al., 2018), and Sentence-BERT (Reimers and Gurevych, 2019). For the pooling strategies, we used the strategy that achieved the highest MRR in the word prediction task for each pre-trained model. 3 The performance of the existing methods was taken from Reimers and Gurevych (2019).

Semantic textual similarity tasks
We evaluated DefSent on unsupervised STS tasks.
In these tasks, we compute semantic similarities of given sentence pairs and calculate Spearman's rank correlation ρ between similarities and gold scores of sentence similarities. In the unsupervised setting, none of the models are optimized on the STS datasets. Instead, the similarities of the given sentence embeddings are calculated using common similarity measures such as negative Manhattan distance, negative Euclidean distance, and cosinesimilarity. In this study, we used cosine-similarity.  We performed experiments on unsupervised STS tasks using the STS12-16 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, STS Benchmark (Cer et al., 2017), and SICK-Relatedness (Marelli et al., 2014) datasets. These datasets contain sentence pairs and their similarity scores, which is a real number from 0 to 5 assigned by human evaluations. Experiments were conducted with ten different random seeds, and the mean was used as the evaluation score. Table 3 shows the experimental results. Although the training data size used in DefSent was only about 5% that of Sentence-BERT, DefSent-BERT-base and DefSent-RoBERTa-base performed comparably to Sentence-BERT-base and Sentence-RoBERTa-base. In particular, DefSent-RoBERTa models showed high performance in the STS Benchmark.

SentEval
SentEval (Conneau and Kiela, 2018) is a popular toolkit for evaluating the quality of universal sentence embeddings that aggregates various tasks, including binary and multi-class classification, natural language inference, and sentence similarity. For the SentEval evaluations, we trained a logistic regression classifier using sentence embeddings as input features to evaluate the extent to which each sentence embedding contained the important information for each task. We used the same tasks and settings as Reimers and Gurevych (2019) and performed a 10-fold cross-validation. We conducted experiments with three different random seeds, and the mean was used as the evaluation score. Table 4 shows the results. 4 DefSent-RoBERTalarge achieved the best average score among all models. Also, increasing the model size improved the performance consistently. The performances of DefSent-BERT-large, DefSent-RoBERTa-base, and DefSent-RoBERTa-large were better than the performances of Sentence-BERT-based methods. These results indicate that DefSent embeds useful information that can be applied to various tasks.

Conclusion
In this paper, we proposed DefSent, a new sentence embedding method using a dictionary, and demonstrated its effectiveness through a series of experiments. Its performance was comparable to or even slightly better than existing methods using large NLI datasets. DefSent is based on dictionaries developed for many languages, so it does not require new language resources when applied to other languages. Since the model is trained with the same word prediction process as the masked language modeling, sentence embeddings derived by DefSent are expected to be similar to contextualized word embeddings of a word when it appears with the same meaning as the definition.
In future work, we will evaluate the performance of DefSent when it is applied to languages other than English and when it is applied to a broader range of downstream tasks, such as document classification tasks. We will also analyze the relationship between the sentence embeddings by DefSent and the contextualized word embeddings in the semantic vector space and investigate how model architecture and size influence the embeddings.

A Average Runtime and Computing Infrastructure
Fine-tuning for DefSent-BERT-base and DefSent-RoBERTa-base took about 5 minutes on a single NVIDIA GeForce GTX 1080 Ti. Fine-tuning for DefSent-BERT-large and DefSent-RoBERTa-large took about 15 minutes on a single Quadro GV100. Table 5 shows the experimental results on the word prediction experiment for each model and pooling strategy with learning rate. Table 6 shows the predicted words when the embeddings of definition sentences are input. We used BERT-large as a model and CLS as a pooling strategy for the experiment. For prediction, sentences were first input into the model to obtain sentence embeddings. Then the sentence embeddings were input into the pre-trained word prediction layer to obtain word probabilities. We show the top five words with the highest probability. Table 7 shows the predicted words when the embeddings of sentences other than definition sentences are input. We used BERT-large as a model and CLS as a pooling strategy for the experiment. The evaluation procedure is the same as for Appendix C.       Table 9: The percentage of correct answers (%) for each task of SentEval. The scores are the mean and standard deviation of three evaluations with different random seeds.