Imputing Out-of-Vocabulary Embeddings with LOVE Makes LanguageModels Robust with Little Cost

State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words.To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words.We present a simple contrastive learning framework, LOVE, which extends the word representation of an existing pre-trained language model (such as BERT) and makes it robust to OOV with few additional parameters.Extensive evaluations demonstrate that our lightweight model achieves similar or even better performances than prior competitors, both on original datasets and on corrupted variants. Moreover, it can be used in a plug-and-play fashion with FastText and BERT, where it significantly improves their robustness.


Introduction
Word embeddings represent words as vectors (Mikolov et al., 2013a,b;Pennington et al., 2014). They have been instrumental in neural network approaches that brought impressive performance gains to many natural language processing (NLP) tasks. These approaches use a fixed-size vocabulary. Thus they can deal only with words that have been seen during training. While this works well on many benchmark datasets, real-word corpora are typically much noisier and contain Out-of-Vocabulary (OOV) words, i.e., rare words, domainspecific words, slang words, and words with typos, which have not been seen during training. Model performance deteriorates a lot with unseen words, and minor character perturbations can flip the prediction of a model (Liang et al., 2018;Belinkov and Bisk, 2018;Sun et al., 2020;Jin et al., 2020). Simple experiments (Figure 1) show that the addition of typos to datasets degrades the performance for text classification models considerably. To alleviate this problem, pioneering work pretrained word embeddings with morphological features (sub-word tokens) on large-scale datasets (Wieting et al., 2016;Bojanowski et al., 2017;Heinzerling and Strube, 2017;Zhang et al., 2019). One of the most prominent works in this direction is FastText (Bojanowski et al., 2017), which incorporates character n-grams into the skip-gram model. With FastText, vectors of unseen words can be imputed by summing up the n-gram vectors. However, these subword-level models come with great costs: the requirements of pre-training from scratch and high memory footprint. Hence, several simpler approaches have been developed, e.g., MIM-ICK (Pinter et al., 2017), BoS (Zhao et al., 2018) and KVQ-FH (Sasaki et al., 2019). These use only the surface form of words to generate vectors for unseen words, through learning from pre-trained embeddings.
Although MIMICK-like models can efficiently reduce the parameters of pre-trained representa-a mispelling of my name

Pre-trained Embeddings
Out-Of-Vocabulary Model Learn Impute Figure 2: Our lightweight OOV model, LOVE, learns the behavior of pre-trained embeddings (e.g., FastText and BERT), and is then able to impute vectors for unseen words. LOVE can enhance the robustness of existing word representations in a plug-and-play fashion.
tions and alleviate the OOV problem, two main challenges remain. First, the models remain bound in the trade-off between complexity and performance: The original MIMICK is lightweight but does not produce high-quality word vectors consistently. BoS and KVQ-FH obtain better word representations but need more parameters. Second, these models cannot be used with existing pre-trained language models such as BERT. It is these models, however, to which we owe so much progress in the domain (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019;Liu et al., 2020). To date, these high-performant models are still fragile when dealing with rare words (Schick and Schütze, 2020), misspellings (Sun et al., 2020) and domainspecific words (El Boukkouri et al., 2020).
We address these two challenges head-on: we design a new contrastive learning framework to learn the behavior of pre-trained embeddings, dubbed LOVE, Learning Out-of-Vocabulary Embeddings. Our model builds upon a memory-saving mixed input of character and subwords instead of n-gram characters. It encodes this input by a lightweight Positional Attention Module. During training, LOVE uses novel types of data augmentation and hard negative generation. The model is then able to produce high-quality word representations that are robust to character perturbations, while consuming only a fraction of the cost of existing models. For instance, LOVE with 6.5M parameters can obtain similar representations as the original FastText model with more than 900M parameters. What is more, our model can be used in a plug-and-play fashion to robustify existing language models. We find that using LOVE to produce vectors for unseen words improves the performance of FastText and BERT by around 1.4-6.8 percentage points on noisy text -without hampering their original capabilities (As shown in Figure 2).
In the following, Section 2 discusses related work, Section 3 introduces preliminaries, Section 4 presents our approach, Section 5 shows our experiments, and Section 6 concludes. The appendix contains additional experiments and analyses. Our code and data is available at https: //github.com/tigerchen52/LOVE 2 Related Work

Character-level Embeddings
To address OOV problems, some approaches inject character-level features into word embeddings during the pre-training (Wieting et al., 2016;Cao and Rei, 2016;Bojanowski et al., 2017;Heinzerling and Strube, 2017;Kim et al., 2018;Li et al., 2018;Üstün et al., 2018;Piktus et al., 2019;Zhu et al., 2019;Zhang et al., 2019;Hu et al., 2019). One drawback of these methods is that they need to pre-train on a large-scale corpus from scratch. Therefore, simpler models have been developed, which directly mimic the well-trained word embeddings to impute vectors for OOV words. Some of these methods use only the surface form of words to generate embeddings for unseen words (Pinter et al., 2017;Zhao et al., 2018;Sasaki et al., 2019;Fukuda et al., 2020;Jinman et al., 2020), while others use both surface and contextual information to create OOV vectors (Schick and Schütze, 2019a,b). In both cases, the models need an excessive number of parameters. FastText, e.g., uses 2 million n-gram characters to impute vectors for unseen words.

Pre-trained Language Models
Currently, the state-of-the-art word representations are pre-trained language models, such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) and XLnet (Yang et al., 2019), which adopt subwords to avoid OOV problems. However, BERT is brittle when faced with rare words (Schick and Schütze, 2020) and misspellings (Sun et al., 2020). To make BERT more robust, Charac-terBERT (El Boukkouri et al., 2020) and Char-BERT (Ma et al., 2020) infuse character-level features into BERT and pre-train the variant from scratch. This method can significantly improve the performance and robustness of BERT, but requires pre-training an adapted transformer on a large amount of data. Another work on combating spelling mistakes recommends placing a word corrector before downstream models (Pruthi et al., 2019), which is effective and reusable. The main weakness of this method is that an error generated by the word corrector propagates to downstream tasks. For example, converting "aleph" to "alpha" may break the meaning of a mathematical statement. And indeed, using the word corrector consistently leads to a drop (0.5-2.0 percentage points) in BERT's performance on the SST dataset (Socher et al., 2013).

Contrastive Learning
The origin of contrastive learning can be traced back to the work by Becker and Hinton (1992) and Bromley et al. (1993). This method has achieved outstanding success in self-supervised representation learning for images (Oord et al., 2018;Hjelm et al., 2018;He et al., 2020;Chen et al., 2020;Grill et al., 2020). The contrastive learning framework learns representations from unlabeled data by pulling positive pairs together and pushing negative pairs apart. For training, the positive pairs are often obtained by taking two randomly augmented versions of the same sample and treating the other augmented examples within a mini-batch as negative examples (Chen et al., 2017(Chen et al., , 2020. The most widely used loss is the infoNCE loss (or contrastive loss) (Hjelm et al., 2018;Logeswaran and Lee, 2018;Chen et al., 2020;He et al., 2020). Although many approaches adopt contrastive learning to represent sentences (Giorgi et al., 2020;Wu et al., 2020;Gao et al., 2021), it has so far not been applied to word representations.

Mimick-like Model
Given pre-trained word embeddings, and given an OOV word, the core idea of MIMICK (Pinter et al., 2017) is to impute an embedding for the OOV word using the surface form of the word, so as to mimic the behavior of the known embeddings. BoS (Zhao et al., 2018), KVQ-FH (Sasaki et al., 2019), Robust Backed-off Estimation (Fukuda et al., 2020), andPBoS (Jinman et al., 2020) work similarly, and we refer to them as mimick-like models. Formally, we have a fixed-size vocabulary set V, with an embedding matrix W ∈ R |V|×m , in which each row is a word vector u w ∈ R m for the word w.
A mimick-like model aims to impute a vector v w for an arbitrary word w ∈ V. The training objective of mimick-like models is to minimize the expected distance between u w and v w pairs: Here, ψ(·) is a distance function, e.g., the Eu- Here, ζ(·) is a function that maps w to a list of subunits based on the surface form of the word (e.g., a character or subword sequence). After that, the sequence is fed into the function φ(·) to produce vectors, and the inside structure can be CNNs, RNNs, or a simple summation function. After training, the model can impute vectors for arbitrary words. Table 1 shows some features of three mimick-like models.

Contrastive Learning
Contrastive learning methods have achieved significant success for image representation (Oord et al., 2018;Chen et al., 2020). The core idea of these methods is to encourage learned representations for positive pairs to be close, while pushing representations from sampled negative pairs apart. The widely used contrastive loss (Hjelm et al., 2018;Logeswaran and Lee, 2018;Chen et al., 2020;He et al., 2020) is defined as: Here, τ is a temperature parameter, sim(·) is a similarity function such as cosine similarity, and (u i , u + ), (u i , u − ) are positive pairs and negative pairs, respectively (assuming that all vectors are normalized). During training, positive pairs are usually obtained by augmentation for the same sample, and negative examples are the other samples in the mini-batch. This process learns representations that are invariant against noisy factors to some extent.

Our Approach: LOVE
LOVE (Learning Out-of-Vocabulary Embeddings) draws on the principles of contrastive learning to maximize the similarity between target and generated vectors, and to push apart negative pairs. An overview of our framework is shown in Figure 3. It is inspired by work in visual representation learning (Chen et al., 2020), but differs in that one of the positive pairs is obtained from pre-trained embeddings instead of using two augmented versions. We adopt five novel types of word-level augmentations and a lightweight Positional Attention Module in this framework. Moreover, we find that adding hard negatives during training can effectively yield better representations. We removed the nonlinear projection head after the encoder layer, because its improvements are specific to the representation quality in the visual field. Furthermore, our approach is not an unsupervised contrastive learning framework, but a supervised learning approach.
Our framework takes a word from the original vocabulary and uses data augmentation to produce a corruption of it. For example, "misspelling" becomes "mispelling" after dropping one letter "s". Next, we obtain a target vector from the pre-trained embeddings for the original word, and we generate a vector for the corrupted word. These two vectors are a pair of positive samples, and we maximize the similarity between them while making the distance of negative pairs (other samples in the same mini-batch) as large as possible. As mentioned before, we use the contrastive loss as an objective function (Eq 3). There are five key ingredients in the framework that we will detail in the following (similar to the ones in Table 1): the Input Method, the Encoder, the Loss Function, our Data Augmentation, and the choice of Hard Negatives.

Input Method
Our goal is to use the surface form to impute vectors for words. The question is thus how to design the function ζ(·) mentioned in Section 3.1 to represent each input word. MIMICK (Pinter et al., 2017) straightforwardly uses the character sequence (see Table 1). This, however, loses the information of morphemes, i.e., sequences of characters that together contribute a meaning. Hence, FastText (Bojanowski et al., 2017) adopts character n-grams. Such n-grams, however, are highly redundant. For example, if we use substrings of length 3 to 5 to represent the word misspelling, we obtain a list with 24 n-gram characters -while only the substrings {mis, spell, ing} are the three crucial units to understand the word. Hence, like BERT, we use WordPiece (Wu et al., 2016) with a vocabulary size of around 30000 to obtain meaningful subwords of the input word. For the word misspelling, this yields {miss, ##pel, ##ling }. However, if we just swap two letters (as by a typo), then the sequence becomes completely different: {mi, ##sp, ##sell, ##ing }. Therefore, we use both the character sequence and subwords ( Figure A1).
We shrink our vocabulary by stemming all words and keeping only the base form of each word, and by removing words with numerals. This decreases the size of vocabulary from 30 000 to 21 257 without degrading performance too much (Section A.1).

Encoder
Let us now design the function φ(·) mentioned in Section 3.1. We are looking for a function that can encode both local features and global features. Local features are character n-grams, which provide robustness against minor variations such as character swaps or omissions. Global features combine local features regardless of their distance. For the word misspelling, a pattern of prefix and suffix mis+ing can be obtained by combining the local information at the beginning and the end of the word. Conventional CNNs, RNNs, and self-attention cannot extract such local and global information at the same time. Therefore, we design a new Positional Attention Module. Suppose we have an aforementioned mixed input sequence and a corresponding embedding matrix V ∈ R |V|×d where d is the dimension of vectors. Then the input can be represented by a list of vectors: X = {x 1 , x 2 , ..., x n } ∈ R n×d where n is the "mispelling" Data Augmentation Encoder

Pre-trained Embeddings
Maximize Similarity "misspelling" length of the input. To extract local information, we first adopt positional attention to obtain n-gram features, and then feed them into a conventional self-attention layer to combine them in a global way. This can be written as: Here, SA is a standard multi-head self-attention and PA is a positional attention.
Here, P ∈ R n×d are the position embeddings, and W V ∈ R d×d V are the corresponding parameters. More details about the encoder are in Appendix C.4.

Loss Function
In this section, we focus on the loss function L(·). Mimick-like models often adopt the mean squared error (MSE), which tries to give words with the same surface forms similar embeddings. However, the MSE only pulls positive word pairs closer, and does not push negative word pairs apart. Therefore, we use the contrastive loss instead (Equation 3). Wang and Isola (2020) found that the contrastive loss optimizes two key properties: Alignment and Uniformity. The Alignment describes the expected distance (closeness) between positive pairs: Here, p pos is the distribution of positive pairs. The Uniformity measures whether the learned representations are uniformly distributed in the hypersphere: Here, p data is the data distribution and t > 0 is a parameter. The two properties are consistent with our expected word representations: positive word pairs should be kept close and negative word pairs should be far from each other, finally scattered over the hypersphere.

Data Augmentation and Hard Negatives
Our positive word pairs are generated by data augmentation, which can increase the amount of training samples by using existing data. We use various strategies ( Figure 4) to increase the diversity of our training samples: (1) Swap two adjacent characters, (2) Drop a character, (3) Insert a new character, (4) Replace a character according to keyboard distance, (5) Replace the original word by a synonymous word. The first four augmentations are originally designed to protect against adversarial attacks (Pruthi et al., 2019). We add the synonym replacement strategy to keep semantically similar words close in the embedding space -something that cannot be achieved by the surface form alone. Specifically, a set of synonyms is obtained by retrieving the nearest neighbors from pre-trained embeddings like FastText. Negative word pairs are usually chosen randomly from the mini-batch. However, we train our model to be specifically resilient to hard negatives (or difficult negatives), i.e., words with similar surface forms but different meanings (e.g., misspelling and dispelling). To this end, we add a certain number of hard negative samples (currently 3 of them) to the mini-batch, by selecting word pairs that are not synonyms and have a small edit distance.

Mimicking Dynamical Embeddings
Pre-trained Language Models (e.g., ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019)) dynamically generate word representations based on specific contexts, which cannot be mimicked directly. To this end, we have two options: We can either learn the behavior of the input embeddings in BERT before the multi-layer attentions or mimic the static distilled embeddings (Bommasani et al., 2020; Gupta and Jaggi, 2021).
We use the BERT as an example to explain these two methods. Suppose we have a subword sequence after applying WordPiece to a sentence: W = {w 1 , w 2 , ..., w n }. For the subword sequence W , BERT first represents it as a list of subword embeddings: E in = {e sub 1 , e sub 2 , ..., e sub n }. We refer to this static representation as the Input Embedding of BERT, and we can use our model to mimic the behavior of this part. We call this method mimicking input embeddings. For ease of implementation, we learn only from the words that are not separated into pieces. After that step, BERT applies a multilayer multi-head attention to the input embeddings E in , which yields a contextual representation for each subword: E out = {e out 1 , e out 2 , ..., e out n }. However, these contextual representations vary according to the input sentence and we cannot learn from them directly. Instead, we choose to mimic the distilled static embeddings from BERT, which are obtained by pooling (max or average) the contextual embeddings of the word in different sentences. We call this method mimicking distilled embeddings. The latter allows for better word representations, while the former does not require training on a large-scale corpus. Our empirical studies show that mimicking distilled embeddings performs only marginally better. Therefore, we decided to rather learn the input embeddings of BERT, which is simple yet effective

Plug and Play
One of the key advantages of our model is that it can be used as a plug-in for other models. For models with static word embeddings like FastText, one can simply use our model to generate vectors for unseen words. For models with dynamic word embeddings like BERT, if a single word is tokenized into several parts, e.g. misspelling = {miss, ##pel, ##ling }, we regard it as an OOV word. Then, we replace the embeddings of the subwords by a single embedding produced by our model before the attention layer. Our final enhanced BERT model has 768 dimensions and 16M parameters. Note that the BERT-base model has~110M parameters and its distilled one has~550M parameters.

Evaluation Datasets
There are two main methods to evaluate word representations: Intrinsic and Extrinsic. Intrinsic evaluations measure syntactic or semantic relationships between words directly, e.g., word similarity in word clusters. Extrinsic evaluations measure the performance of word embeddings as input features to a downstream task, e.g., named entity recognition (NER) and text classification. Several studies have shown that there is no consistent correlation between intrinsic and extrinsic evaluation results (Chiu et al., 2016;Faruqui et al., 2016;Wang et al., 2019). Hence, we evaluate our representation by both intrinsic and extrinsic metrics. Specifically, we use 8 intrinsic datasets (6 word , 2008). It is worth noting that the RareWord dataset contains many long-tail words and the BC2GM is a domain-specific NER dataset. All data augmentations and typo simulations are implemented by NLPAUG 1 . Appendix B provides more details on our datasets and experimental settings. Tasks   Table 2 shows the experimental results on 8 intrinsic tasks. Compared to other mimick-like models, our model achieves the best average score across 8 datasets while using the least number of parameters. Specifically, our model performs best on 5 word similarity tasks, and second-best on the word cluster tasks. Although there is a gap between our model and the original FastText, we find our performance acceptable, given that our model is 100x times smaller. Tasks   Table 3 shows the results on four downstream datasets and their corrupted version. In this experiment, we introduce another non-trivial baseline: Edit Distance. For each corrupted word, we find 1 https://github.com/makcedward/nlpaug the most similar word from a vocabulary using edit distance and then use the pre-trained vectors of the retrieved word. Considering the time cost, we only use the first 20K words appearing in FastText (2M words) as reference vocabulary. The typo words are generated by simulating post-OCR errors. For the original datasets, our model obtains the best results across 2 datasets and the second-best on NER datasets compared to other mimick-like models. For the corrupted datasets, the performance of the FastText model decreases a lot and our model is the second best but has very close scores with BoS consistently. Compared to other mimick-like models, our model with 6.5M achieves the best average score. Although Edit Distance can effectively restore the original meaning of word, it is 400x times more time-consuming than our model.

Robustness Evaluation
In this experiment, we evaluate the robustness of our model by gradually adding simulated post-OCR typos (Ma, 2019). Table 4 shows the performances on SST2 and CoNLL-03 datasets. We observe that our model can improve the robustness of the original embeddings without degrading their performance. Moreover, we find our model can make FastText more robust compared to other commonly used methods against unseen words: a generic UNK token or character-level representation of neural networks. Figure 5 shows the robust-

Ablation Study
We now vary the components in our architecture (input method, encoder and loss function) to demonstrate the effectiveness of our architecture.
Input Method. To validate the effect of our Mixed Input strategy, we compare it with two other methods: using only the character sequence or only the subword sequence. Table 5 shows that the Mixed method achieves better representations, and any removal of char or subword information can decrease the performance.
Encoder. To encode the input sequence, we developed the Positional Attention Module (PAM), which first extracts ngram-like local features and then uses self-attention combine them without distance restrictions.  of PAM is acceptable in comparison. We visualize the attention weights of PAM in Appendix C.4, to show how the encoder extracts local and global morphological features of a word.
Loss Function. LOVE uses the contrastive loss, which increases alignment and uniformity. Wang and Isola (2020) proves that optimizing directly these two metrics leads to comparable or better performance than the original contrastive loss. Such a loss function can be written as: Here, λ is a hyperparameter that controls the impact of uniform . We set this value to 1.0 because it achieves the best average score on RareWord and SST2. An alternative is to use the Mean Squared Error (MSE), as in mimick-like models.  ment by directly using alignment and uniformity. We also tried various temperatures τ for the contrastive loss, and the results are shown in Table A3 in the appendix. In the end, a value of τ = 0.07 provides a good performance.
Data Augmentation and Hard Negatives. In Table 5, we observe that the removal of our hard negatives decreases the performance, which demonstrates the importance of semantically different words with similar surface forms.
LOVE uses five types of word augmentation. We find that removing this augmentation does not deteriorate performance too much on the word similarity task, while it causes a 0.4 point drop in the text classification task (the last row in Table 5), where data augmentations prove helpful in dealing with misspellings. We further analyze the performance of single and composite augmentations on RareWord and SST2 in the appendix in Figure A3 and Figure A4. We find that a combination of all five types yields the best results.

The performance of mimicking BERT
As described in Section 4.5, we can mimic the input or distilled embeddings of BERT. After learning from BERT, we use the vectors generated by LOVE to replace the embeddings of OOV subwords. Finally, these new representations are fed into the multi-layer attentions. We call this method the Replacement strategy. To valid its effectiveness, we compare it with two other baselines: (1) Linear Combination (Fukuda et al., 2020). For each subword e sub , the generated vectors of word e word containing the subwords are added to the subword vectors of BERT: where e sub ∈ R d is a subword vector of BERT, and e word ∈ R d is a generated vector of our model. W ∈ R d are trainable parameters.
(2) Add. A generated word vector is directly added to a corresponding subword vector of BERT: e new = e sub + e word (10) Table 6 shows the result of these strategies. All of them can bring a certain degree of robustness to BERT without decreasing the original capability, which demonstrates the effectiveness of our framework. Second, the replacement strategy consistently performs best. We conjecture that BERT cannot restore a reasonable meaning for those rare and misspelling words that are tokenized into subwords, and our generated vectors can be located nearby the original word in the space. Third, we find mimicking distilled embeddings performs the best while mimicking input embeddings comes close. Considering that the first method needs training on large-scale data, mimicking the input embeddings is our method of choice.

Conclusion
We have presented a lightweight contrastivelearning framework, LOVE, to learn word representations that are robust even in the face of out-ofvocabulary words. Through a series of empirical studies, we have shown that our model (with only 6.5M parameters) can achieve similar or even better word embeddings on both intrinsic and extrinsic evaluations compared to other mimick-like models. Moreover, our model can be added to models with static embeddings (such as FastText) or dynamical embeddings (such as BERT) in a plug-and-play fashion, and bring significant improvements there. For future work, we aim to extend our model to languages other than English.  Figure A1: An illustration of our Mixed input for the word misspell.

A Details of Our Approach
A.1 Shrinking Our Model We consider the following four methods to reduce the total parameters of our model: (1) Matrix Decomposition. The original matrix can be decomposed into two smaller matrices V = U × M, U ∈ R |V|×h , M ∈ R h×m and h < m.
Here, we set m = 300 and h = 200 respectively.
(2) Top Subword. We use only the top-k frequent subwords, using the word frequencies from a corpus. We set the parameter k = 20000.
(3) Hashing. We use a hashing strategy to share memory for subwords aiming to reduce the parameters. We use a bucket size of 20000.
(4) Preprocessing. The original vocabulary contains plurals and conjugations, therefore we stem all complete words and remove words with numerals, obtaining a new vocabulary of 21257 words. Table A1 shows that the preprocessing method can reduce parameters very effectively while obtaining a very competitive performance.    Figure A3: Performances of different augmentations on RareWord, measured as Spearman's ρ. Diagonal entries correspond to individual augmentation and offdiagonal entries correspond to composite augmentation.

B.3 Intrinsic and Extrinsic Evaluations
We choose the setting discussed in Section 4 to train our model for 20 epochs, and evaluate each intrinsic task based on the vectors that the models produce. As for the extrinsic tasks, we feed word vectors into each neural network and fix them during training. We use CNNs for text classification (Zhang and Wallace, 2015) and BiLSTM+CRF for NER (Huang et al., 2015). We compare different embeddings on both intrinsic and extrinsic datasets by using generated vectors. For the word cluster tasks, the produced vectors are clustered by K-Means and then measured by Purity. The hyper-parameters of the extrinsic tasks are shown in Table A2. For each dataset, our model is trained with five learning rates {5e−3, 3e−3, 1e−3, 8e− 4, 5e − 4}. We select the best one on the develop- Figure A4: Performances of different augmentations on SST2, measured as accuracy. Diagonal entries correspond to individual augmentation and off-diagonal entries correspond to composite augmentation. ment set to report its score on the test set.
To generate a corrupted dataset, we simulate post-OCR errors. We adopt the augmentation tool developed by Ma (2019) to corrupt 70% of the original words. To check the robustness of BERT, we directly finetune a BERT-base model using Huggingface (Wolf et al., 2020). During finetuning, the batch size is 16 and we train 5 epochs. We select the best model among five learning rates {9e − 5, 7e − 5, 5e − 5, 3e − 5, 1e − 5} on the development set and report the score of the model on the test set. Sim (Agirre et al., 2009), and (6) Simverb (Agirre et al., 2009). The task is scored by Spearman's ρ, which computes the correlation between gold similarity and the similarity obtained from generated vectors. For the word cluster task, we use (1) AP (Almuhareb, 2006) and (2) BLESS (Baroni and Lenci, 2011). The generated word vectors are first clustered by K-means (MacQueen et al., 1967) and then scored by the cluster purity.
Extrinsic Datasets. We use both sentence-level and token-level downstream datasets to evaluate the quality of word representations. For the sentence level, we use SST2 (Socher et al., 2013) and MR (Pang and Lee, 2005), and the metric is accuracy. For the token level, we use two NER datasets: general CoNLL-03 (Sang and De Meulder, 2003) and biomedical BC2GM (Smith et al., 2008). The metric is the entity-level F1 score. As before, we select the best model among five different learning rates {5e − 3, 3e − 3, 1e − 3, 8e − 4, 5e − 4} on the development set and then report the model score on the test set.
C Additional Analyses

C.1 Qualitative Analysis
To better understand the clusterings produced by LOVE, we chose 15 words from the AP dataset (Almuhareb, 2006), covering three topics (Chemical Substance, Illness, and Occupation). We added 3 corrupted words, oxgen, archiitect and leukamia. Figure A2 shows how LOVE, BoS, and KVQ-FH cluster these words (using a PCA  projection and K-means). All approaches space out the clusters to some degree. In particular, BoS and KVQ-FH have trouble separating professions and chemical substances. For the corrupted words, only LOVE is able to embed them close enough to their original form, so that they appear in the correct cluster. Figure A4 shows the performance of five augmentation strategies on the text classification task SST2. We observe that synonym is the most effective methods. The first four methods have a weaker effect, but the keyboard replacement can bring a certain degree of improvement. The results on RareWord are similar ( Figure A3).

C.3 Effect of τ in Contrastive Loss
As discussed in Chen et al. (2020), a proper temperature can yield better representation in the visual area because τ is able to weigh the negatives by their relative hardness. As shown in Table A3, we attempt different values of temperature and find that there is no consistent τ that makes a model work well both on intrinsic and extrinsic datasets. Hence, we choose the best performer on average, i.e., τ = 0.07.

C.4 Visualization of Encoder
As mentioned before, we combine two types of attention heads (self-attention and positional attention) to encode a subword sequence. Here, we vi-  sualize the attention weights on each side and show how they work. Figure A5 shows the positiondependent weights. We use sinusoidal functions to generate positional embeddings, and the weights are the dot product between these embeddings. We observe the positional weights tend to the left and right subwords in addition to themselves, which yields trigram representations. Figure A6 shows the self-attention weights which are computed from the trigram subwords of positional attention. Hence, each subword in this figure is a trigram representation instead of a single subword representation. As we see, selfattention can capture global features regardless of distance. We take the first token [CLS] as an example, and this self-attention assigns high weights for the token e and [SEP], which constructs a representation like this: [CLS]b + me[SUB] + ##me [SEP]. This segment tells us this word starts with b and ends with me.