LexFit: Lexical Fine-Tuning of Pretrained Language Models

Transformer-based language models (LMs) pretrained on large text collections implicitly store a wealth of lexical semantic knowledge, but it is non-trivial to extract that knowledge effectively from their parameters. Inspired by prior work on semantic specialization of static word embedding (WE) models, we show that it is possible to expose and enrich lexical knowledge from the LMs, that is, to specialize them to serve as effective and universal “decontextualized” word encoders even when fed input words “in isolation” (i.e., without any context). Their transformation into such word encoders is achieved through a simple and efficient lexical fine-tuning procedure (termed LexFit) based on dual-encoder network structures. Further, we show that LexFit can yield effective word encoders even with limited lexical supervision and, via cross-lingual transfer, in different languages without any readily available external knowledge. Our evaluation over four established, structurally different lexical-level tasks in 8 languages indicates the superiority of LexFit-based WEs over standard static WEs (e.g., fastText) and WEs from vanilla LMs. Other extensive experiments and ablation studies further profile the LexFit framework, and indicate best practices and performance variations across LexFit variants, languages, and lexical tasks, also directly questioning the usefulness of traditional WE models in the era of large neural models.


Pooling Pooling
Word embedding extraction w v u = sleeping BERT Step 1: Lexical fine-tuning u Step 2: Extracting word vectors from (LexFit-ed) BERT Figure 1: Illustration of the full pipeline for obtaining decontextualized word representations, based on lexically fine-tuning pretrained LMs via dual-encoder networks (Step 1, §2.1), and then extracting the representations from their (fine-tuned) layers (Step 2, §2.2).
both static and contextualized WEs ultimately learn solely from the distributional word co-occurrence signal. This source of signal is known to lead to distortions in the induced representations by conflating meaning based on topical relatedness rather than authentic semantic similarity (Hill et al., 2015;Schwartz et al., 2015;. This also creates a ripple effect on downstream applications, where model performance may suffer (Faruqui, 2016;Lauscher et al., 2020). Our work takes inspiration from the methods to correct these distortions and complement the distributional signal with structured information, which were originally devised for static WEs. In particular, the process known as semantic specialization (or retrofitting) injects information about lexical relations from databases like WordNet (Beckwith et al., 1991) or the Paraphrase Database (Ganitkevitch et al., 2013) into WEs. Thus, it accentuates relationships of pure semantic similarity in the re-fined representations (Faruqui et al., 2015;Ponti et al., 2019, inter alia).
Our goal is to create representations that take advantage of both 1) the expressivity and lexical knowledge already stored in pretrained language models (LMs) and 2) the precision of lexical finetuning. To this effect, we develop LEXFIT, a versatile lexical fine-tuning framework, illustrated in Figure 1, drawing a parallel with universal sentence encoders like SentenceBERT (Reimers and Gurevych, 2019). 1 Our working hypothesis, extensively evaluated in this paper, is as follows: pretrained encoders store a wealth of lexical knowledge, but it is not straightforward to extract that knowledge. We can expose this knowledge by rewiring their parameters through lexical fine-tuning, and turn the LMs into universal (decontextualized) word encoders.
Compared to prior attempts at injecting lexical knowledge into large LMs (Lauscher et al., 2020), our LEXFIT method is innovative as it is deployed post-hoc on top of already pretrained LMs, rather than requiring joint multi-task training. Moreover, LEXFIT is: 1) more efficient, as it does not incur the overhead of masked language modeling pretraining; and 2) more versatile, as it can be ported to any model independently from its architecture or original training objective. Finally, our results demonstrate the usefulness of LEXFIT: we report large gains over WEs extracted from vanilla LMs and over traditional WE models across 8 languages and 4 lexical tasks, even with very limited and noisy external lexical knowledge, validating the rewiring hypothesis. The code is available at: https://github.com/cambridgeltl/lexfit.

From Language Models to (Decontextualized) Word Encoders
The motivation for this work largely stems from the recent work on probing and analyzing pretrained language models for various types of knowledge they might implicitly store (e.g., syntax, world knowledge) (Rogers et al., 2020). Here, we focus on their lexical semantic knowledge Liu et al., 2021), with an aim of extracting high-quality static word embeddings from the parameters of the input LMs. In what follows, we describe lexical fine-tuning via dual-encoder networks ( §2.1), followed by the WE extraction pro-1 These approaches are connected as they are both trained via contrastive learning on dual-encoder architectures, but they provide representations for a different granularity of meaning. cess from the fine-tuned layers of pretrained LMs ( §2.2), see Figure 1.

LEXFIT: Methodology
Our hypothesis is that the pretrained LMs can be turned into effective static decontextualized word encoders via additional inexpensive lexical finetuning (i.e., LEXFIT-ing) on lexical pairs from an external resource. In other words, they can be specialized to encode lexical knowledge useful for downstream tasks, e.g., lexical semantic similarity (Wieting et al., 2015;Ponti et al., 2018). Let P = {(w, v, r) m } M m=1 refer to the set of M external lexical constraints. Each item p ∈ P comprises a pair of words w and v, and denotes a semantic relation r that holds between them (e.g., synonymy, antonymy). Further, let P r denote a subset of P where a particular relation r holds for each item, e.g., P syn is a set of synonymy pairs. Finally, for each positive tuple (w, v, r), we can construct 2k negative "no-relation" examples by randomly pairing w with another word w ¬,k , and pairing v with v ¬,k , k = 1, . . . , k, ensuring that these negative pairs do not occur in P . We refer to the full set of negative pairs as N P . Lexical fine-tuning then leverages P and N P ; We propose to tune the underlying LMs (e.g., BERT, mBERT), using external lexical knowledge, via different loss functions, relying on dual-encoder networks with shared LM weights and mean pooling, as illustrated in Figure 1. We now briefly describe several loss functions, evaluated later in §4.
Classification Loss. Similar to prior work on sentence-level text inputs (Reimers and Gurevych, 2019), for each input word pair (w, v) we concatenate their d-dimensional encodings w and v (obtained after passing them through BERT and after pooling, see Figure 1) with their element-wise difference |w − v|. The objective is then: (1) ⊕ denotes concatenation, and W ∈ R 3d×c is a trainable weight matrix of the softmax classifier, where c is the number of classification classes. We experiment with two variants of this objective, termed SOFTMAX henceforth: in the simpler binary variant, the goal is to distinguish between positive synonymy pairs (the subset P syn ) and the corresponding set of 2k × |P syn | no-relation negative pairs. In the ternary variant (c = 3), the classifier must distinguish between synonyms (P syn ), antonyms (P ant ), and no-relation negatives. The classifiers are optimized via standard cross-entropy.
Ranking Loss. The multiple negatives ranking loss (MNEG) is inspired by prior work on learning universal sentence encoders Henderson et al., 2019Henderson et al., , 2020; the aim of the loss, now adapted to word-level inputs, is to rank true synonymy pairs from P syn over randomly paired words. The similarity between any two words w and v is quantified via the similarity function S operating on their encodings S(w i , w j ). In this work we use the scaled cosine similarity following Henderson et al. (2019): where C is the scaling constant. Lexical fine-tuning with MNEG then proceeds in batches of B pairs (w i , v i ), . . . , (w B , v B ) from P syn , with the MNEG loss for a single batch computed as follows: Effectively, for each batch Eq.
(2) maximizes the similarity score of positive pairs (w i , v i ), and minimizes the score of B −1 random pairs. For simplicity, as negatives we use all pairings of w i with v j -s in the current batch where (w i , v j ) ∈ P syn Henderson et al., 2019).
Multi-Similarity Loss. We also experiment with a recently proposed state-of-the-art multi-similarity loss of Wang et al. (2019), labeled MSIM. The aim is again to rank positive examples from P syn above any corresponding no-relation 2k negatives from N P . Again using the scaled cosine similarity scores, the adapted MSIM loss per batch of B positive pairs (w i , v i ) from P syn is defined as follows: For brevity, in Eq.
(3) we only show the formulation with the k negatives associated with w i , but the reader should be aware that the complete loss function contains another term covering k negatives v i,¬,k associated with each v i . C is again the scaling constant, and is the offset applied on the similarity matrix. 2 MSIM can be seen as an extended variant of the MNEG ranking loss.
Finally, for any input word w, we extract its word vector via the approach outlined in §2.2; exactly the same approach can be applied to the original LMs (e.g., BERT) or their lexically fine-tuned variants ("LEXFIT-ed" BERT), see Figure 1.

Extracting Static Word Representations
The extraction of static type-level vectors from any underlying Transformer-based LM, both before and after LEXFIT fine-tuning, is guided by best practices from recent comparative analyses and probing work Bommasani et al., 2020). Starting from an underlying LM with N Transformer layers {L 1 (bottom layer), . . . , L N (top)} and referring to the embedding layer as L 0 , we extract a decontextualized word vector for some input word w, fed into the LM "in isolation" without any surrounding context, following  3) The final representation is constructed as the average over the subword encodings further averaged over n ≤ N layers (i.e., all layers up to layer L n included, denoted as AVG(≤ n)). 3 Further,  empirically verified that: (a) discarding final encodings of [CLS] and [SEP ] produces better type-level vectors -we follow this heuristic in this work; and (b) excluding higher layers from the average may also result in stronger vectors with improved performance in lexical tasks.
This approach operates fully "in isolation" (ISO): we extract vectors of words without any surrounding context. The ISO approach is lightweight: 1) it disposes of any external text corpora; 2) it encodes words efficiently due to the absence of context. Moreover, it allows us to directly study the richness of lexical information stored in the LM's parameters, and to combine it with ISO lexical knowledge from external resources (e.g., WordNet).

Experimental Setup
Languages and Language Models. Our language selection for evaluation is guided by the following (partially clashing) constraints : a) availability of comparable pretrained monolingual LMs; b) task and evaluation data availabil-ity; and c) ensuring some typological diversity of the selection. The final test languages are English (EN), German (DE), Spanish (ES), Finnish (FI), Italian (IT), Polish (PL), Russian (RU), and Turkish (TR). For comparability across languages, we use monolingual uncased BERT Base models for all languages (N = 12 Transformer layers, 12 attention heads, hidden layer dimensionality is 768), available (see the appendix) via the HuggingFace repository (Wolf et al., 2020).
External Lexical Knowledge. We use the standard collection of EN lexical constraints from previous work on (static) word vector specialization (Zhang et al., 2014;Ono et al., 2015;Ponti et al., 2018Ponti et al., , 2019. It covers the lexical relations from WordNet (Fellbaum, 1998) and Roget's Thesaurus (Kipfer, 2009); it comprises 1,023,082 synonymy (P syn ) word pairs and 380,873 antonymy pairs (P ant ). For all other languages, we rely on non-curated noisy lexical constraints, obtained via an automatic word translation method by Ponti et al. (2019); see the original work for the details of the translation procedure.
LEXFIT: Technical Details. The implementation is based on the SBERT framework (Reimers and Gurevych, 2019), using the suggested settings: AdamW (Loshchilov and Hutter, 2018); learning rate of 2e − 5; weight decay rate of 0.01, and we run LEXFIT for 2 epochs. The batch size is 512 with MNEG, and 256 with SOFTMAX and MSIM, where one batch always balances between B positive examples and 2k · B negatives (see §2.1).
Word Vocabularies and Baselines. We extract decontextualized type-level WEs in each language both from the original BERTs (termed BERT-REG) 4 and the LEXFIT-ed BERT models for exactly the same vocabulary. Following , the vocabularies cover the top 100K most frequent words represented in the respective fastText (FT) vectors, trained on lowercased monolingual Wikipedias by Bojanowski et al. (2017). 5 The equivalent vocabulary coverage allows for a direct comparison of all WEs regardless of the induction/extraction method; this also includes the FT 4 For the baseline BERT-REG WEs, we report two variants: (a) all performs layerwise averaging over all Transformer layers (i.e., AVG(≤ 12)); (b) best reports the peak score when potentially excluding highest layers from the layer averaging (i.e., AVG(≤ n), n ≤ 12; see §2.2) . 5 Note that the LEXFIT procedure does not depend on the chosen vocabulary, as it operates only on the lexical items found in the external constraints (i.e., the set P ). vectors, used as baseline "traditional" static WEs (termed FASTTEXT.WIKI) in all evaluation tasks.
Evaluation Tasks. We evaluate on the following standard and diverse lexical semantic tasks: Task 1: Lexical semantic similarity (LSIM) is an established intrinsic task for evaluating static WEs (Hill et al., 2015). We use the recent comprehensive multilingual LSIM benchmark Multi-SimLex , which comprises 1,888 pairs in 13 languages, for our EN, ES, FI, PL, and RU LSIM evaluation. We also evaluate on a verbfocused EN LSIM benchmark: SimVerb-3500 (SV) (Gerz et al., 2016), covering 3,500 verb pairs, and SimLex-999 (SL) for DE and IT (999 pairs) (Leviant and Reichart, 2015). 6 Task 2: Bilingual Lexicon Induction (BLI), a standard task to assess the "semantic quality" of static cross-lingual word embeddings (CLWEs) , enables investigations on the alignability of monolingual type-level WEs in different languages before and after the LEXFIT procedure. We learn CLWEs from monolingual WEs obtained with all WE methods using the established and supervision-lenient mapping-based approach (Mikolov et al., 2013a;Smith et al., 2017) with the VECMAP framework (Artetxe et al., 2018). We run main BLI evaluations for 10 language pairs spanning EN,DE,RU,FI,TR. 7 Task 3: Lexical Relation Prediction (RELP). We assess the usefulness of lexical knowledge in WEs to learn relation classifiers for standard lexical relations (i.e., synonymy, antonymy, hypernymy, meronymy, plus no relation) via a state-ofthe-art neural model for RELP which learns solely based on input type-level WEs . We use the WordNet-based evaluation data of  for EN, DE, ES; they contain 10K annotated word pairs per language, 8K for training, 2K for test, balanced by class and in the splits. We extract evaluation data for two more languages: FI and IT. We report micro-averaged F 1 scores, averaged across 5 runs for each input WE space; the default RELP model setting is used. In RELP and LSIM, we remove all training and test 6 The evaluation metric is the Spearman's rank correlation between the average of human LSIM scores for word pairs and the cosine similarity between their respective WEs. 7 A standard BLI setup and data from  is adopted: 5K training word pairs are used to learn the mapping, and another 2K pairs as test data. The evaluation metric is standard Mean Reciprocal Rank (MRR). For EN-ES, we run experiments on MUSE data (Conneau et al., 2018). RELP/LSIM examples also present in the P syn and P ant sets to avoid any evaluation data leakage. 8 Task 4: Lexical Simplification (LexSIMP) aims to automatically replace complex words (i.e., specialized terms, less-frequent words) with their simpler in-context synonyms, while retaining grammaticality and conveying the same meaning as the more complex input text (Paetzold and Specia, 2017). Therefore, discerning between semantic similarity (e.g., synonymy injected via LEXFIT) and broader relatedness is critical for LexSIMP . We adopt the standard LexSIMP evaluation protocol used in prior research on static WEs (Ponti et al., 2018(Ponti et al., , 2019. 1) We use Light-LS (Glavaš and Štajner, 2015), a languageagnostic LexSIMP tool that makes simplifications in an unsupervised way based solely on word similarity in an input (static) WE space; 2) we rely on standard LexSIMP benchmarks, available for EN (Horn et al., 2014), IT (Tonelli et al., 2016), and ES (Saggion, 2017); and 3) we report the standard Accuracy scores (Horn et al., 2014). 9 Important Disclaimer. We note that the main purpose of the chosen evaluation tasks and experimental protocols is not necessarily achieving state-ofthe-art performance, but rather probing the vectors in different lexical tasks requiring different types of lexical knowledge, 10 and offering fair and insightful comparisons between different LEXFIT variants, as well as against standard static WEs (fastText) and non-tuned BERT-based static WEs.

Results and Discussion
The main results for all four tasks are summarized in Tables 1-4, and further results and analyses are available in §4.1 (with additional results in the appendix). These results offer multiple axes of comparison, discussed in what follows.

Comparison to Other Static Word Embeddings.
The results over all 4 tasks indicate that static WEs from LEXFITed monolingual BERT 1) outperform traditional WE methods such as FT, and 2) offer also large gains over WEs originating from non-LEXFITed BERTs . These re-sults demonstrate that the inexpensive lexical finetuning procedure can indeed turn large pretrained LMs into effective decontextualized word encoders, and this can be achieved for a reasonably wide spectrum of languages for which such pretrained LMs exist. What is more, LEXFIT for all non-EN languages has been run with noisy automatically translated lexical constraints, which holds promise to support even stronger static LEXFITbased WEs with human-curated data in the future, e.g., extracted from multilingual WordNets (Bond and Foster, 2013), PanLex (Kamholz et al., 2014), or BabelNet (Ehrmann et al., 2014).
The results give rise to additional general implications. First, they suggest that the pretrained LMs store even more lexical knowledge than thought previously (Ethayarajh, 2019;Bommasani et al., 2020;; the role of LEXFIT finetuning is simply to 'rewire' and expose that knowledge from the LM through (limited) lexical-level supervision. To further investigate the 'rewiring' hypothesis, in §4.1, we also run LEXFIT with a drastically reduced amount of external knowledge.
BERT-REG vectors display large gains over FT vectors in tasks such as RELP and LexSIMP, again hinting that plenty of lexical knowledge is stored in the original parameters. However, they still lag FT vectors for some tasks (BLI for all language pairs; LSIM for ES, RU, PL). However, LEXFIT-ed BERT-based WEs offer large gains and outperform FT WEs across the board. Our results indicate that 'classic' WE models such as skip-gram (Mikolov et al., 2013b) and FT are undermined even in their last field of use, lexical tasks.
This comes as a natural finding, given that word2vec and FT can in fact be seen as reduced and training-efficient variants of full-fledged language models (Bengio et al., 2003). The modern LMs are pretrained on larger training data with more parameters and with more sophisticated Transformerbased neural architectures. However, it has not been verified before that effective static WEs can be distilled from such LMs. Efficiency differences aside, this begs the following discussion point for future work: with the existence of large pretrained LMs, and effective methods to extract static WEs from them, as proposed in this work, how useful are traditional WE models still in NLP applications?
Lexical Fine-Tuning Objectives. The scores indicate that all LEXFIT variants are effective and can expose the lexical knowledge from the fine-tuned    BERTs. However, there are differences across their task performance: the ranking-based MNEG and MSIM variants display stronger performance on similarity-based ranking lexical tasks such as LSIM and BLI. The classification-based SOFTMAX objective is, as expected, better aligned with the RELP task, and we note slight gains with its ternary variant which leverages extra antonymy knowledge.
This finding is well aligned with the recent findings demonstrating that task-specific pretraining results in stronger (sentence-level) task performance (Glass et al., 2020;Henderson et al., 2020;Lewis et al., 2020). In our case, we show that task-specific lexical fine-tuning can reshape the underlying LM's parameters to not only act as a universal word encoder, but also towards a particular lexical task.
The per-epoch time measurements from Table 1 validate the efficiency of LEXFIT as a post-training fine-tuning procedure. Previous approaches that attempted to inject lexical information (i.e., word senses and relations) into large LMs (Lauscher et al., 2020;Levine et al., 2020) relied on joint LM (re)training from scratch: it is effectively costlier than training the original BERT models.
Performance across Languages and Tasks. As expected, the scores in absolute terms are highest for EN: this is attributed to (a) larger pretraining LM data as well as (b) to clean external lexical knowledge. However, we note encouragingly large gains in target languages even with noisy translated lexical constraints. LEXFIT variants show similar relative patterns across different languages and tasks. We note that, while BERT-REG vectors are unable to match FT performance in the BLI task, our LEXFIT methods (e.g., see MNEG and MSIM BLI scores) outperform FT WEs in this task as well, offering improved alignability (Søgaard et al., 2018) between monolingual WEs. The large gains of BERT-REG over FT in RELP and LexSIMP across all evaluation languages already suggest that plenty of lexical knowledge is stored in the pretrained BERTs' parameters; however, LEXFIT-ing the models offers further gains in LexSIMP and RELP across the board, even with limited external supervision (see also Figure 2c).
High scores with FI in LSIM and BLI are aligned with prior work (Virtanen et al., 2019;Rust et al., 2021) that showcased strong monolingual performance of FI BERT in sentence-level tasks. Along this line, we note that the final quality of LEXFITbased WEs in each language depends on several factors: 1) pretraining data; 2) the underlying LM; 3) the quality and amount of external knowledge.

Further Discussion
The multi-component LEXFIT framework allows for a plethora of additional analyses, varying components such as the underlying LM, properties of the LEXFIT variants (e.g., negative examples, finetuning duration, the amount of lexical constraints). We now analyze the impact of these components on the "lexical quality" of the LEXFIT-tuned static WEs. Unless noted otherwise, for computational feasibility and to avoid clutter, we focus 1) on a subset of target languages: EN, ES, FI, IT, 2) on the MSIM variant (k = 1), which showed robust perfor- mance in the main experiments before, and 3) on LSIM, BLI, and RELP as the main tasks in these analyses, as they offer a higher language coverage.

Varying the Amount of Lexical Constraints.
We also probe what amount of lexical knowledge is required to turn BERTs into effective decontextualized word encoders by running tests with reduced lexical sets P sampled from the full set. The scores over different P sizes, averaged over 5 samples per each size, are provided in Figure 2, and we note that they extend to other evaluation languages and LEXFIT objectives. As expected, we do observe performance drops with fewer external data. However, the decrease is modest even when relying on  only 5k external constraints (e.g., see the scores in BLI and RELP for all languages; EN Multi-SimLex score is 69.4 with 50k constraints, 65.0 with 5k), or even non-existent (RELP in FI). Remarkably, the LEXFIT performance with only 10k or 5k fine-tuning pairs 11 remains substantially higher than with FT or BERT-REG WEs in all tasks. This empirically validates LEXFIT's sample efficiency and further empirically corroborates our knowledge rewiring hypothesis: the original LMs already contain plenty of useful lexical knowledge implicitly, and even a small amount of external supervision can expose that knowledge.
Copying or Rewiring Knowledge? Large gains over BERT-REG even with mere 5k pairs (LEXFITing takes only a few minutes), where the large portion of the 100K word vocabulary is not covered in the external input, further reveal that LEXFIT does not only copy the knowledge of seen words and relations into the LM: it leverages the (small) external set to generalize to uncovered words.
We confirm this hypothesis with another experiment where our input LM is the same BERT Base architecture parameters with the same subword vocabulary as English BERT, but with its parameters now randomly initialized using the Xavier initialization (Glorot and Bengio, 2010). Running LEX-FIT on this model for 10 epochs with the full set of lexical constraints (see §3) yields the following LSIM scores: 23.1 (Multi-SimLex) and 14.6 (SimVerb), and the English RELP accuracy score of 61.8%. The scores are substantially higher than those of fully random static WEs (see also the appendix), which indicates that the LEXFIT procedure does enable storing some lexical knowledge into the model parameters. However, at the same time, these scores are substantially lower than the ones achieved when starting from LM-pretrained models, even when LEXFIT is run with mere 5k fine-tuning lexical pairs. 12 This again strongly suggests that LEXFIT 'unlocks' already available lexical knowledge stored in the pretrained LM, yielding benefits beyond the knowledge available in the external data. Another line of recent work (Liu et al., 2021) further corroborates our findings.
Multilingual LMs. Prior work indicated that massively multilingual LMs such as multilingual BERT (mBERT) (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) cannot match the performance of their language-specific counterparts in both lexical  and sentence-level tasks (Rust et al., 2021). We also analyze this conjecture by LEXFIT-ing mBERT instead of monolingual BERTs in different languages. The results with MSIM (k = 1) are provided in Figure 4; we observe similar comparison trends with other languages and LEXFIT variants, not shown due to space constraints. While LEXFIT-ing mBERT offers huge gains over the original mBERT model, sometimes even larger in relative terms than with monolingual BERTs (e.g., LSIM scores for EN increase from 0.21 to 0.69, and from 0.24 to 0.60 for FI; BLI scores for EN-FI rise from 0.21 to 0.37), it cannot match the absolute performance peaks of LEXFIT-ed monolingual BERTs.
Storing the knowledge of 100+ languages in its limited parameter budget, mBERT still cannot capture monolingual knowledge as accurately as language-specific BERTs (Conneau et al., 2020). However, we believe that its performance with LEXFIT may be further improved by leveraging recently proposed multilingual LM adaptation strategies that mitigate a mismatch between shared multilingual and language-specific vocabularies (Artetxe et al., 2020;Chung et al., 2020;Pfeiffer et al., 2020); we leave this for future work.
Layerwise Averaging. A consensus in prior work (Tenney et al., 2019;Ethayarajh, 2019; points that out-of-context lexical knowledge in pretrained LMs is typically stored in bottom Transformer layers (see Table 5). However, Table 5 also reveals that this does not hold after LEXFITing: the tuned model requires knowledge from all layers to extract effective decontextualized WEs and reach peak task scores. Effectively, this means that, through lexical fine-tuning, model "reformats" all its parameter budget towards storing useful lexical knowledge, that is, it specializes as (decontextualized) word encoder.
Varying the Number of Negative Examples and their impact on task performance is recapped in Figure 3b. Overall, increasing k does not benefit (and sometimes even hurts) performance -the exceptions are EN LSIM; and the RELP task with the SOFTMAX variant for some languages. We largely attribute this to the noise in the target-language lexical pairs: with larger k values, it becomes increasingly difficult for the model to discern between noisy positive examples and random negatives.
Longer Fine-Tuning. Instead of the standard setup with 2 epochs (see §3), we run LEXFIT for 10 epochs. The per-epoch snapshots of scores are summarized in the appendix. The scores again validate that LEXFIT is sample-efficient: longer fine-tuning yields negligible to zero improvements in EN LSIM and RELP after the first few epochs, with very high scores achieved after epoch 1 already. It even yields small drops for other languages in LSIM and BLI: we again attribute this to slight overfitting to noisy target-language lexical knowledge.

Conclusion and Future Work
We proposed LEXFIT, a lexical fine-tuning procedure which transforms pretrained LMs such as BERT into effective decontextualized word encoders through dual-encoder architectures. Our experiments demonstrated that the lexical knowledge already stored in pretrained LMs can be further exposed via additional inexpensive LEXFITing with (even limited amounts of) external lexical knowledge. We successfully applied LEXFIT even to languages without any external human-curated lexical knowledge. Our LEXFIT word embeddings (WEs) outperform "traditional" static WEs (e.g., fastText) across a spectrum of lexical tasks across diverse languages in controlled evaluations, thus directly questioning the practical usefulness of the traditional WE models in modern NLP.
Besides inducing better static WEs for lexical tasks, following the line of lexical probing work (Ethayarajh, 2019;, our goal in this work was to understand how (and how much) lexical semantic knowledge is coded in pretrained LMs, and how to 'unlock' the knowledge from the LMs. We hope that our work will be beneficial for all lexical tasks where static WEs from traditional WE models are still largely used (Schlechtweg et al., 2020;Kaiser et al., 2021).
Despite the extensive experiments, we only scratched the surface, and can indicate a spectrum of future enhancements to the proof-of-concept LEXFIT framework beyond the scope of this work. We will test other dual-encoder loss functions, including finer-grained relation classification tasks (e.g., in the SOFTMAX variant), and hard (instead of random) negative examples (Wieting et al., 2015;Lauscher et al., 2020;Kalantidis et al., 2020). While in this work, for simplicity and efficiency, we focused on fully decontextualized ISO setup (see §2.2), we will also probe alternative ways to extract static WEs from pretrained LMs, e.g., averages-over-context (Liu et al., 2019;Bommasani et al., 2020;. We will also investigate other approaches to procuring more accurate external knowledge for LEXFIT in target languages, and extend the framework to more languages, lexical tasks, and specialized domains. We will also focus on reducing the gap between pretrained monolingual and multilingual LMs.    (Hill et al., 2015;Leviant and Reichart, 2015) (SL; 999 word pairs) For EN, we also report the scores on the verb similarity dataset SimVerb-3500 (Gerz et al., 2016) (SV). All LEXFIT-based WEs have been induced from "lexically fine-tuned" LMs, relying on the standard setup described in §3, and relying on lexical constraints also summarized in §3. All results with LEXFIT variants are obtained relying on the best-performing configuration for extracting word representations from the comparative study of . BERT-REG denotes the extraction of word representations (again with the best strategy from prior work) from the regular underlying BERT models, which were not further "LEXFIT-ed": (all) layerwise averaging over all Transformer layers; (best) the highest results reported by , often achieved by excluding several highest layers from the layerwise averaging. The highest scores per column are in bold; the second best result per column is underlined.   Table 8: Results in the BLI task across different language pairs and dual-encoder lexical fine-tuning (LEXFIT) objectives (MNEG, MSIM, SOFTMAX). The size of the training dictionary is (a) 5,000 or (b) 1,000 word translation pairs. MRR scores reported; avg refers to the average score across all 10 language pairs. All results with LEXFIT variants are obtained relying on the best-performing configuration for extracting word representations from the comparative study of . BERT-REG denotes the extraction of word representations (again with the best strategy from prior work) from the regular underlying BERT models, which were not further "LEXFIT-ed": (all) layerwise averaging over all Transformer layers. The highest scores per column for each training dictionary size are in bold; the second best result is underlined.