Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling

In this paper, we propose an alternative to the classic masked language modeling (MLM) pre-training paradigm, where we modify the objective from the reconstruction of the exact identity of randomly selected masked sub-words to the prediction of their latent semantic properties. We coin the proposed pre-training technique masked latent semantic modeling (MLSM for short). In order to make the contextualized determination of the latent semantic properties of the masked subwords possible, we rely on an unsupervised technique using sparse coding. Our experimental results reveal that the fine-tuned performance of those models that we pre-trained via MLSM is consistently and significantly better compared to the use of vanilla MLM pre-training and other strong baselines.


Introduction
Recent successes in natural language processing are predominantly fueled by the use of large pretrained language models (PLMs) that are constructed in a self-supervised manner over massive amounts of raw text.Autoencoder-style language models (Devlin et al. 2019;Liu et al. 2019;Lan et al. 2020; inter alia) are typically trained via masked language modeling (MLM).
PLMs pre-trained using MLM are capable of returning distributions over their vocabulary that peak at plausible substitutes of masked (sub)tokens given some sequence of input text.The individual updates performed during MLM pre-training, however, are not aligned to what we expect from the PLMs in the long run, i.e., that they output distributions of plausible substitutes for masked (sub)tokens.
As a motivating example, consider the sentence "Alice is eating a cake.", and suppose that we randomly select the token cake to be masked.For such a training example, we would obtain the masked input sentence "Alice is eating a [MASK].".
The smallest possible training loss would incur for this particular example if our model allocated all its output probability mass to the word "cake" as being the only possible replacement of the [MASK] token, while assigning precisely zero probability to other alternatives, that are otherwise totally viable from a human cognitive perspective, including words such as pear, croissant, soup, etc.
What eventually provides PLMs trained with the MLM objective the ability to output token distributions that are plausible from a human perspective, is that they are trained over massive amounts of diverse batches, and the different possible substitutes even out in expectation.The hypothesis that we investigate in this paper is that we can train PLMs more efficiently if -instead of relying on the exact identity of the masked tokens during pre-trainingwe required our model to output such distributions that are not peaked at a single symbol (corresponding to the identity of the masked token).
A more natural and perhaps more sample efficient option to overcome the misalignment between the individual pre-training updates of PLMs and their long-term objective could compare the output distribution of the model to some desired distribution of substitutes, which would -instead of encoding the masked input token with a one-hot categorical distribution -assign nonzero probabilities to multiple viable tokens.Note, however, if we had access to such desired output token distributions for the masked tokens, then language modeling was already solved and the task of training PLMs would become obsolete.
An alternative approach for pre-training that mitigate the exact reliance on the identity of the masked tokens could rely on semantic resources, such as WordNet (Fellbaum, 1998) and ConceptNet (Speer et al., 2017).In this case, one might require the language model to output semantic properties of the masked tokens, i.e., in the previous example input sentence, instead of recovering the word cake for the [MASK] token, the goal of the PLM was to output its semantic properties, such that the masked word refers to a concept that is edible and is a kind of dessert.
One difficulty of such an approach is that obtaining semantic resources with sufficient coverage is notoriously arduous (Gale et al., 1992).In addition, such a pre-training paradigm would need the training corpus to be annotated with the possible properties of the words in their particular contexts.Few well-resourced languages have such sense-annotated corpora available, e.g., SEMCOR (Miller et al., 1994) for English, but their size is still several orders of magnitude smaller compared to the size of the corpora used for pre-training PLMs.
In this work, we propose masked latent semantic masking (MLSM), such an alternative of MLM, which attenuates the direct reliance on the masked tokens during pre-training.The main idea behind MLSM is that we no longer require the PLM to recover the exact identity of the masked tokens, but we predict their (latent) semantic properties.
Our pre-training procedure is based on the observation that sparse representations obtained via sparse coding on the hidden representations of neural models are well aligned with human interpretable properties (Balogh et al., 2020;Berend, 2020;Yun et al., 2021).The way we incorporate the above property into MLSM pre-training is that we derive context sensitive latent semantic information about the masked tokens by performing sparse coding on their hidden representation and we aim at predicting those as a pre-training task.Since we determine the sparse representations in an unsupervised way, our approach is not affected by the difficulties that would arose when relying on external semantic resources.
Our evaluation confirms our expectations, i.e., that PLMs pre-trained with MLSM can significantly outperform such models that are trained via vanilla MLM and other strong baselines.We release our code for performing MLSM pre-training at https://github.com/szegedai/MLSM.

Related work
Integrating external semantic knowledge into PLMs has gained increasing research interest (Mihaylov and Frank, 2018;Bauer et al., 2018;Peters et al., 2019;Ye et al., 2019;Yang et al., 2019;Qiu et al., 2019;Liu et al., 2020;Wang et al., 2021;Lu et al., 2021).These efforts cover a wide method-ological spectrum, depending on how the external knowledge is incorporated into the PLMs.The different approaches can be distinguished for instance on the location where the external knowledge gets injected, i.e., either at the input, architectural or output level.
Colon-Hernandez et al. ( 2021) provides a comprehensive overview of approaches aiming at the incorporation of external knowledge into PLMs.What all these approaches have in common is that they rely on some explicit knowledge representation, e.g., in the form of triplets of some knowledge graph.In contrast, our main research question was whether it is possible to increase the semantic awareness of PLMs without having access to explicitly stored knowledge during the pre-training.
SenseBERT (Levine et al., 2020) is such a modification of BERT that also aims at the integration of semantic knowledge, however, it also differs from our approach in multiple aspects.The most important difference is that MLSM does not require any external linguistic resources, whereas Sense-BERT relies on WordNet, making it only available for languages where such a linguistic resource is accessible.
Our fully unsupervised approach for inducing implicit semantic information to words within their context builds on the observation that performing sparse coding over the contextual representations of PLMs results in such sparse contextualized representations that align well with the semantic categories of the words.Berend (2020) showed that these sparse vectors can be used for improving word sense disambiguation (WSD), whereas Yun et al. (2021) used it for creating visualizations that help understanding the inner workings of PLMs from a semantic perspective.(Berend, 2022) provided further evidence that sparse contextualized word representations can be successfully exploited in cross-lingual WSD as well.

Methodology
As opposed to vanilla MLM, being agnostic to the semantics of the masked tokens, we propose MLSM, an alternative pre-training formulation which gets enhanced via semantic information.

Determining semantic information
As mentioned earlier, we wish to incorporate context-sensitive semantic information of the masked tokens into the pre-training procedure.
That is, for some token t i within an input sequence of tokens I = [t 1 , t 2 , . . ., t i , . . ., t |I| ], we need to be able to determine its semantic profile s(t i ).

Using external semantic resources
Semantic information of words can be obtained from ontologies or knowledge graphs such as Word-Net (Fellbaum, 1998) and ConcepNet (Speer et al., 2017), containing semantic information in the form of triplets.For instance, (cake, HasProperty, sweet) is one of the triplets included in ConceptNet, based on which s(t i ) can be determined.
One difficulty arising when using knowledge resources for determining s(t i ) originates from the potential ambiguity of words, as knowledge bases can provide multiple (often semantically conflicting) relations a word can partake.For instance, ConceptNet contains the triplet (book, AtLocation, university), as well as (book, SameAs, reserve), which refer to the noun and verb senses of the word book, respectively.When the knowledge base offers multiple semantic relations a word is involved, we choose one of the viable semantic information uniformly at random, which is akin to how Levine et al. (2020) handled ambiguity.
We consider such a modified pre-training procedure in which we extend the output vocabulary of our model by special symbols corresponding to the relations pertaining to some knowledge base K, and the objective of pre-training is to output such a special symbol for a randomly masked input token, which is compatible with the knowledge in K.We refer to this approach as MESM (Masked Explicit Semantic Modeling).When referring to a concrete realization of MESM, we shall suffix it with the abbreviated name of the knowledge base we rely on, with WN and CN denoting WordNet and ConceptNet, respectively.

Using sparse coding
We propose an efficient unsupervised method for determining the latent semantic description of any token given its context.Our approach requires a teacher PLM denoted by T .During MLSM pre-training of a student model S, we can use T for inferring latent semantic information that we require S to recover as its pre-training objective.
Our proposed way of determining s(t i ) from T is based on the use of sparse coding (Mairal et al., 2009).We first perform a dictionary learning phase, during which we solve for where D ∈ R d×k is a dictionary matrix, the norm of its column vectors not exceeding 1, α j ∈ R k is a sparse vector of coefficients indicating the extent to which the vectors from D are used for the reconstruction of h (l) j ∈ R d , which is the hidden state of token j determined by T in layer l. λ is the regularization coefficient that control the sparsity of α j .
Once the dictionary matrix D is determined, we can obtain sparse contextualized representation for any h (l) i , i.e., a hidden state from layer l of T as An important difference between ( 1) and ( 2) is that for the latter case, we do not optimize towards D, hence the determination of the sparse coefficients for a fixed dictionary matrix D corresponds to solving a LASSO optimization on the hidden representation of the tokens for which we determine sparse contextual representations for.
As both (1) and ( 2) include a non-negativity constraint towards α i , a natural approach to convert its sparsity structure into a latent semantic distribution is to ℓ 1 -normalize it.That way, we can handle the ℓ 1 -normalized coefficients of α i as a probability with which the k latent semantic properties (expressed by the column vectors of D) hold for token t i in its particular context.
During MLSM pre-training, we consider these sparse normalized distributions obtained from (2) for each masked token as the latent semantic information describing them.Similar to the integration of explicit human-collected semantic information into pre-training described in Section 3.1.1,we introduce k new special symbols into the output vocabulary of the model.
The k new symbols correspond to the semantic atoms that comprise the columns of the dictionary matrix D determined by (1), and we consider the loss of the student model S regarding the masked input tokens by comparing its output distribution towards the k special symbols with the latent semantic distribution that we obtain from the teacher model T described above.Unless stated otherwise, the loss that we employ for comparing the similarity of the output distribution of S and the desired target distribution derived from T during MLSM is the KL divergence.

Experiments
We pre-trained several BERT language models of medium size relying on the transformers library (Wolf et al., 2020).The medium models (Turc et al., 2019;Bhargava et al., 2021) comprise of 8 transformer blocks and use a hidden representation of 512 dimensions.We considered the official pretrained bert-base-cased model as the teacher model T for MLSM and we relied on its tokenizer for all the models that we pre-trained.
In order to assess the effects of the different pre-training variants, we evaluated the fine-tuning performance of the models over 10 diverse tasks.We repeated all fine-tuning experiments 10 times in order to account for the variability of the results that occur due to the random initialization of the taskspecific classification head (Dodge et al., 2020).
We used 3 NVIDIA A6000 GPUs for performing the experiments.Each pre-training and the repeated fine-tuning experiments that followed them for a single checkpoint took approximately 2 GPU weeks and 1 GPU day, respectively.1

Details on pre-training
We used the preprocessing pipeline released as part of the WikiBERT models (Pyysalo et al., 2021) 2 for obtaining a recent Wikipedia dump for conducting pre-training.We also shuffled the preprocessed input sequences for ensuring the diversity of the pre-training batches.Our corpus consisted of approximately 125 million sentences and 2.7 billion whitespace delimited tokens.
As large batch-size has been demonstrated to be beneficial during pre-training (Liu et al., 2019), we set the batch size to 1024 (using gradient accumulation over 32 batches).For ensuring model comparability, we fixed the input contents (including the positions of the tokens being masked) and the order of the batches in which they followed each other across the different pre-trainings.
We performed 300,000 update steps, resulting in approximately 300 million input sequences.Similar to the original implementation of BERT (Devlin et al., 2019), in order to speed up pre-training, we used a maximal sequence length of 128 for 90% of the pre-training, then increased it to 512 for the remaining steps.We used the typical learning rate of 1e−4 with linear learning rate scheduling, a warm-up phrase of 3, 000 steps and the AdamW optimizer (Loshchilov and Hutter, 2019).
The way MLSM determines latent semantic distribution to tokens requires obtaining D. For doing so, we followed (Berend, 2020), i.e., we processed the SEMCOR corpus (Miller et al., 1994), and collected the hidden states from the last layer of bert-base-cased, while setting the number of semantic atoms and the regularization coefficient to k = 3000 and λ = 0.05, respectively.
Besides using MLSM, we constructed four different PLMs that served as our baselines.One of our baseline was pre-trained using the standard MLM objective, where the goal is to reconstruct the exact identity of the masked tokens from the output distribution of the model.
We pre-trained two further PLMs, where we relied on the contents of explicit semantic information in the form of knowledge bases.For these models, that we refer to as MESM-WN and MESM-CN, we required the models to output semantic information about the masked words according to WordNet and ConceptNet, respectively.When relying on WordNet, we introduced 45 additional special symbols to the vocabulary of our model, each corresponding to one of the possible lexnames in WordNet, and our goal was to classify masked tokens to their correct supersense.When using ConceptNet, we first collected its 3000 most frequent relations (since we used that many semantic atoms for determining our latent semantic descriptions as well), and created a special symbol for each (e.g.,IsA_food).
Since MLSM utilizes a pre-trained teacher model T , which allows us to determine the latent semantic distributions that is required by MLSM pre-training, a natural baseline to compare against is based on distillation from the same teacher model.We refer to the distilled pre-training as MLM-D.In this scenario, we first determine the token output distribution of T for the masked tokens, and calculate our pre-training loss as the KL divergence between that distribution and the one our model outputs.
As many of the datasets are part of the GLUE (Wang et al., 2019b) and SuperGLUE benchmarks (Wang et al., 2019a), where the labels of the test set are not available, we performed our evaluation on the development sets.The hyperparameters were not tuned in any way to perform well on these sets, i.e., we used the same hyperparameters for all the tasks, and our choice for the selected values was purely driven by adapting commonly used values from earlier work and common best practices.
We accessed the above benchmarks and performed the evaluation of the fine-tuned models via the datasets and evaluate libraries (Lhoest et al., 2021;von Werra et al., 2022).We used the same frequently used hyperparameters for fine-tuning all the datasets.That is, we used a learning rate of 2e−5 with linear learning rate scheduling and a batch size of 32, performing 3 epochs.As the evaluation metric, we always report the fine-tuning performance after the last epoch.

Results of the fine-tuning experiments
As mentioned before, evaluations were conducted 10 times for each task and differently pre-trained PLM.The largest performance gap is between MLSM and the approaches that try to integrate external human knowledge bases into the pre-training (the MESM-* paradigms).On average, the distilled model (MLM-D) was able to stay the closest in performance to the PLM pre-trained using MLSM, however, its performance still lags considerably behind that of our proposed model.
We performed the 10 repeated experiments for each dataset by ensuring their comparability across differently pre-trained models, i.e., we made sure that whenever a pair of PLMs was evaluated with the same seed and on the same task, their classification heads were initialized identically.Additionally, for each task, the batches included the same instances and got utilized in the same order during the fine-tuning experiments.
We depict the pairs of final task performances across all the fine-tuning experiments between the comparable trials of the two best performing models, i.e., the one pre-trained with MLSM and the one using distillation (MLM-D) in Figure 1.We can see that the MLSM pre-trained model resulted in better fine-tuning performances compared to those of the distilled model in most of the cases.Indeed, the p-value of the Wilcoxon signed-rank test between their paired experimental results is p < 2e−12, which indicates that MLSM performs significantly better during fine-tuning compared to standard distillation.The p-value between the results of vanilla MLM and MLSM was even smaller, i.e., p < 3e−17.

Investigating pre-training dynamics
Besides investigating the fine-tuning performance of the differently pre-trained models at the end of pre-training, we additionally evaluated them at different checkpoints, i.e., after processing 10% (30K), 25% (75K), 50% (150K) and 100% (300K) of all the update steps performed during pretraining.
Figure 2 illustrates that the fine-tuning performances of the PLM that was pre-trained using MLSM is substantially higher than any of the PLMs pre-trained in an alternative fashion.This observation does not only hold at the end of pre-training, but also for the earlier checkpoints.
Furthermore, the performance of the PLM relying on MLSM already surpasses the endof-training performance of all alternatively pretrained PLMs at its 25% checkpoint (at 75K pretraining steps), supporting our earlier hypothesis that MLSM converges faster and provides a more sample efficient form of pre-training.

The role of using semantic distributions
As mentioned in Section 3.1.2,the loss function employed by MLSM is the KL divergence between the latent semantic distributions determined for the masked tokens using the teacher model T and the output of the student model S. We experimented with such a variant of MLSM that does not utilize the entire latent semantic distribution of the masked tokens, but only considers the index of the semantic category with the highest probability mass it is assigned to.
Under this variant of MLSM, the loss function we employed was no longer the KL divergence, but the cross entropy of the output of S and the most probable latent semantic category determined from T .We refer to this variant of masked latent semantic modeling as MLSM-CE (owing to the use of cross entropy as the loss function).
Figure 3 compares the evaluation scores of both MLSM and MLSM-CE, revealing that the use of the entire latent semantic distribution determined by our approach together with KL divergence loss is more beneficial compared to the use of the cross entropy-based loss.The overall average performance scores of the MLSM and MLSM-CE pretrained models are 0.781 and 0.762, respectively.

Comparing model sizes
As a final quantitative experiment, we pre-trained another BERT model with vanilla MLM training, but this time, the model was of the base version.
Apart from its larger capacity, pre-training was conducted entirely identically to the previously discussed smaller PLMs.The base model differs from the medium models in that it consists of 12 transformer blocks and it has a hidden dimension of size 768 (as opposed to the 8 blocks and 512dimensional hidden states for the medium model).
These differences make the size of the base model more than 2.5 times that of the medium (there are ≈ 110M and ≈ 42M parameters for the base and medium models, respectively).
In this comparison, we investigated if the substantially larger capacity of the base model that we pre-trained using standard MLM would make it a clearly better choice for fine-tuning compared to the considerably smaller medium-sized model that we trained with MLSM.Fine-tuning experiments using the base model was also conducted 10 times for each task.The average performances of the two pre-trained models of different capacities and pre-training protocol are included in Table 2.
We can observe that even though the medium model has approximately only 40% of the parameters of the base model, when trained with the proposed MLSM approach, it is still capable of performing close to the more than 2.5× sized base model.Indeed, the medium model pre-trained with MLSM is capable of preserving 99.96% of the average performance of the base model trained with vanilla MLM.In contrast, the medium model pretrained with vanilla MLM had an average performance of 0.7675 (see Table 1), which corresponds to only roughly 98.17% of the 0.7818 average performance of the base model.Our computational budget prevented us from using student models S larger than medium size, however, we managed to pre-train models with larger teacher models T (as for those backpropagation did not have to be performed).The two larger T that we relied on were bert-large-cased and roberta-large.Following the observations in (Berend, 2020), we determined the contextualized latent semantic profiles based on the hidden representations from layer 21 of the large models.
The fine-tuning results of the fully-trained PLMs averaged over all evaluation scenarios are reported in Table 3, illustrating that without increasing the capacity of S, we were not able to improve the fine-tuning scores of the PLMs.A likely cause for this is the capacity gap between S and T (nearly a factor of ×10).It is worth mentioning that this phenomenon not only affected MLSM, but vanilla distillation as well.In fact, when using BERT large as T , the performance of MLM-D and MLSM are  on par, whereas MLSM performs noticeably better compared to MLM-D when using RoBERTa large as T .That is, using the same T , MLSM always performed at least as good as vanilla distillation.

Illustrating latent semantic distributions
We next illustrate the latent semantic distributions qualitatively over a small example.Consider the following sentences, each containing a token of interest written in boldface: (i) Alice is eating a cake.(ii) Bob is cooking a soup.(iii) The cat sits on the mat.(iv) The fox is chasing a rabbit.
Figure 5 illustrates the pairwise cosine similarities between the token representations of the target words extracted from bert-base-cased and its sparse counterparts that we obtained by solving (2).
We can see in Figure 5a that the pairwise similarities are rather homogeneous for the dense representations, which can be explained by their anisotropy (Ethayarajh, 2019).Figure 5b, in contrast, reveals that the pairwise similarities of the inspected to- kens behave more plausibly for the sparse representations.
Figure 4 helps in better assessing the semantic distributions induced by our approach.Each subfigure considers one of the target words (disclosed in their captions), and contains their top-10 semantic categories (referenced by their indices) that the particular target word got assigned, ordered by their decreasing order of probability mass.
The subplots also include those probabilities that the other 3 non-target words received for the most dominant latent semantic categories of the target word.We can see in Figure 4a and Figure 4b that words referring to edible things (cake and soup) have an overlapping set of dominant semantic categories (with category 2782 being the most important for both of them), whereas their most prominent latent semantic categories overlap little (if any) with the other two example words that are related to animals.A similar tendency can be noticed between the animal related words in Figure 4c and Figure 4d.

Conclusion
In this paper, we proposed masked latent semantic modeling (MLSM), such a modification of the classical masked language modeling task, where the goal is changed from the prediction of the exact identity of the masked tokens to that of their latent semantic categories.We suggested a contextsensitive unsupervised approach for determining the latent semantic categories of tokens by performing sparse coding of their hidden representations from a pre-trained teacher model.The reliance on a teacher model makes our approach similar to model distillation in the sense that we can transfer the capabilities of a (larger) pre-trained model into a newly trained one.Comparison of the fine-tuning capabilities of the PLM pre-trained via classical model distillation and MLSM revealed a clear benefit towards the latter approach.
Our experiments also corroborate that MLSM pre-training behaves more sample efficient compared to other alternatives, as the fine-tuning performance of the pre-trained model at its 25% completeness level was capable of achieving better performances than any of the alternatively trained models at their end of pre-training stage (see Figure 2).More importantly, our experiments revealed that by relying on MLSM pre-training, it was possible to cram the fine-tuning capabilities of a PLM with 2.5× parameter into a smaller one (see Table 2).Finally, in order to foster reproducibility of our proposed approach, we share all our code and pre-trained models at https://github.com/szegedai/MLSM and https://huggingface.co/ SzegedAI/bert-medium-mlsm, respectively.the possibility to use ELKH Cloud (see Héder et al., 2022; https://science-cloud.hu/)which helped us in achieving the results published in this paper.

Risks and Limitations
Our pre-training relies on distributions of latent semantic properties of masked tokens that we determine in a purely unsupervised manner by performing sparse coding of the hidden states from an already pre-trained PLM.This property would make our approach in principle available to be used in pre-training PLMs for any language with an already pre-trained PLM and a pre-training corpus of raw, unannotated text available.Despite of this fact, we only considered English in our experiments, causing the potential risk of reinforcing the community bias in mainly focusing on the English language only.
The fact that our proposed approach requires an already pre-trained PLM can be deemed as a limitation.Hence for languages, where only a pretraining corpora exist, but no PLM has been pretrained with vanilla MLM, a classical PLM needs to be pre-trained first, which increases the costs of performing MLSM.We should add, however, that once the MLSM pre-training ends, it has the same fine-tuning costs as a PLM pre-trained with traditional MLM.
It would be interesting to see if the proposed pre-training paradigm was applicable for autoregressive models as well.Currently we tested our proposed approach in autoencoder-style masked language models.Extending our work to autoregressive models is something we regard as a potential future work.
Finally, we considered the pre-training of medium-sized model, with the knowledge being distilled from larger models.This decision was driven by our computation budget, but it would be definitely instructive to see the effects of the same procedure employed for student models of larger capacity.

Figure 1 :
Figure 1: Scatterplot of the pairwise performances of the individual fine-tuning evaluations of the model pretrained with MLSM and distillation (MLM-D).Markers above the diagonal indicate evaluations for which the MLSM pre-trained model scored better.

Figure 2 :
Figure 2: Average performance of the differently pretrained models as a function of the number of pretraining update steps performed.

Figure 4 :
Figure4: Illustration of the 10 most prominent latent semantic categories of the example words and the corresponding probabilities our approach assigned to them.

FDNHFigure 5 :
Figure 5: The pairwise cosine similarity between the words in the example sentences using the original dense contextualized representations and the sparse ones.

Table 1 :
Fine-tuning performances of the differently pre-trained PLMs, averaged over 10 independent initializations of their classification head.

Table 2 :
Breakdown of the average performance of the base-sized BERT pre-trained with vanilla MLM and the medium-sized BERT pre-trained with MLSM.The standard deviation of the scores are put in parenthesis.

Table 3 :
The results of vanilla distillation and MLSM when using different teacher models.