Contextual Representation Learning beyond Masked Language Modeling

Currently, masked language modeling (e.g., BERT) is the prime choice to learn contextualized representations. Due to the pervasiveness, it naturally raises an interesting question: how do masked language models (MLMs) learn contextual representations? In this work, we analyze the learning dynamics of MLMs and find that it adopts sampled embeddings as anchors to estimate and inject contextual semantics to representations, which limits the efficiency and effectiveness of MLMs. To address these problems, we propose TACO, a simple yet effective representation learning approach to directly model global semantics. To be specific, TACO extracts and aligns contextual semantics hidden in contextualized representations to encourage models to attend global semantics when generating contextualized representations. Experiments on the GLUE benchmark show that TACO achieves up to 5x speedup and up to 1.2 points average improvement over MLM.


Introduction
In the age of deep learning, the basis of representation learning is to learn distributional semantics. The target of distributional semantics can be summed up in the so-called distributional hypothesis (Harris, 1954): Linguistic items with similar distributions have similar meanings. To model similar meanings, traditional representation approaches (Mikolov et al., 2013;Pennington et al., 2014) (e.g., Word2Vec) model distributional semantics by defining tokens using context-independent (CI) dense vectors, i.e., word embeddings, and directly aligning the representations of tokens in the same context. Nowadays, pre-trained language models (PTMs) (Devlin et al., 2019;Radford et al., 2018;Qiu et al., 2020) expand static embeddings into contextualized representations where each token has two kinds of representations: contextindependent embedding, and context-dependent * Equal Contribution † This work is done at ByteDance AI Lab.
Enc. (CD) dense representation that stems from its embedding and contains context information. Although language modeling and representation learning have distinct targets, masked language modeling is still the prime choice to learn token representations with access to large scale of raw texts (Peters et al., 2018;Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020).
It naturally raises a question: How do masked language models learn contextual representations? Following the widely-accepted understanding (Wang and Isola, 2020), MLM optimizes two properties, the alignment of contextualized representations with the static embeddings of masked tokens, and the uniformity of static embeddings in the representation space. In the alignment property, sampled embeddings of masked tokens play as an anchor to align contextualized representations. We find that although such local anchor is essential to model local dependencies, the lack of global anchors brings several limitations. First, experiments show that the learning of contextual representations is sensitive to embedding quality, which harms the efficiency of MLM at the early stage of training. Second, MLM typically masks multiple target words in a sentence, resulting in multiple embedding anchors in the same context. This pushes contextualized representations into different clusters and thus harms modeling global dependencies.
To address these challenges, we propose a novel Token-Alignment Contrastive Objective (TACO) to directly build global anchors. By combing local anchors and global anchors together, TACO achieves better performance and faster convergence than MLM. Motivated by the widely-accepted belief that contextualized representation of a token should be the mapping of its static embedding on the contextual space given global information, we propose to directly align global information hidden in contextualized representations at all positions of a natural sentence to encourage models to attend same global semantics when generating contextualized representations. Concerning possible relationships between context-dependent and context-independent representations, we adopt the simplest probing method to extract global information via the gap between context-dependent and context-independent representations of a token for simplification, as shown in Figure 1. To be specific, we define tokens in the same context (text span) as positive pairs and tokens in different contexts as negative pairs, to encourage the global information among tokens within the same context to be more similar compared to that from different contexts.
We evaluate TACO on GLUE benchmark. Experiment results show that TACO outperforms MLM with average 1.2 point improvement and 5x speedup (in terms of sample efficiency) on BERTsmall, and with average 0.9 point improvement and 2x speedup on BERT-base.
The contributions of this paper are as follows.
• We analyze the limitation of MLM and propose a simple yet efficient method TACO to directly model global semantics.
• Experiments show that TACO outperforms MLM with up to 1.2 point improvement and up to 5x speedup on GLUE benchmark.

Objective Analysis
The key idea of MLM is to randomly replace a few tokens in a sentence with the special token [MASK] and ask a neural network to recover the original tokens. Formally, we define a corrupted sentence as x 1 , x 2 , · · · , x L , and feed it into a Transformers encoder (Vaswani et al., 2017), the hidden states from the final layer are denoted as h 1 , h 2 , · · · , h L . We denote the embeddings of the corresponding original tokens as e 1 , e 2 , · · · , e L . The MLM objective can be formulated as: where M denotes the set of masked tokens and |V| is the size of vocabulary V. m i is hidden state of the last layer at the masked position, and can be regarded as a fusion of contextualized representations of surrounding tokens. Following the widelyaccepted understanding (Wang and Isola, 2020), Eq.1 optimizes: (1) the alignment between contextualized representations of surrounding tokens and the context-independent embedding of the target token and (2) the uniformity of representations in the representation space.
In the alignment part, MLM relies on sampled contextual-independent embeddings of masked tokens as anchors to align contextualized representations in contexts, as shown in Figure 2. Local anchor is the key feature of MLM. Therefore, the learning of contextualized representations heavily relies on embedding quality. In addition, multiple local anchors in a sentence tend to pushing contextualized representations of surrounding tokens closer to different clusters, encouraging models to attend local dependencies where global semantics are neglected.

Empirical Analysis
To verify our understanding, we conduct comprehensive experiments to investigate: How does embedding anchor affect the learning dynamics of MLM? We re-train a BERT-small (Devlin et al., 2019) model with the MLM objective solely and analyze the changes in its semantic space during pre-training. The training details are described in Appendix A.
Contextualized representation evaluation. In general, if contextualized representations are well learned, the contextualized representations in a same context will have higher similarity than that of in different contexts. Naturally, we use the gap between intra-sentence similarity and inter-sentence similarity to evaluate contextual information in contextualized representations. We call this gap as contextual score. The similarity can be evaluated via probing methods like L2 distance, cosine similarity, etc. We observe similar findings on different probing methods and only report cosine similarity here for simplification. Figure 3(b) shows how contextual score changes during training. Other statistical results are listed in Appendix A.
Embedding similarity evaluation. To observe how sampled embeddings affect contextualized representation learning, we evaluate the embedding similarity between co-occurrent tokens. Motivated by the target that co-occurrent tokens should have similar representations, we use the similarity score calculated by cosine similarity between co-occurrent words labeled by humans (sampled from the WordSim353 dataset (Agirre et al., 2009)) as the evaluation metric. Figure 3(a) shows how embedding similarity between co-occurrent tokens changes during training. The learning of contextualized representations heavily relies on embeddings similarity. As we can see from Figure 3(a), the embedding similarity between co-occurrent tokens first decreases during the earliest stage of pre-training. It is because all embeddings are randomly initialized with the same distribution and the uniformity feature in MLM pushes tokens far away from each other, thus resulting in the decrease of embedding similarity. Meanwhile, the contextual score, i.e., the gap between intra-context similarity and inter-context similarity in Figure 3(b), does not increase at the earliest stage of training. It shows that random embeddings provide little help to learn contextual semantics. During 5K-10K iterations, only when embeddings become closer, contextualized representations in the same context begin to have similar features. At this stage, the randomly sampled embeddings from the same sentence, i.e., the same context, usually have similar representations and thus MLM can push contextualized tokens closer to each other. We further verify the effects of embedding quality in Figure 4. To this end, we train two BERT models whose embedding matrices are frozen and initialized with the ones from different pre-training stage. We can see the model initialized with random embedding fails to teach contextualized representations to attend sentence meanings and representations from different contexts have almost the same similarity. However, the variant with welltrained but frozen embeddings learns to distinguish different contexts early at around 4k steps. These statistical observations verify that embedding anchors bring the efficiency and effectiveness problem.
Surprisingly, embedding anchors reduce global contextual information in contextualized representation at the later stage of training. Figure 3(a) shows that embedding similarity begins to drop after 8k steps. It shows that the model learns the specific meanings of co-occurrent tokens and begins to push them a little bit far away. Since MLM adopts local anchors, these local em-beddings push contextualized representations into different clusters. The contextual score begins to decrease too. This phenomenon proves the embedding bias problem where the learning of contextualized representations is decided by the selected embeddings where the global contextual semantics are neglected. Figure 4: The impact of embedding quality for the learning of contextualized representations. We train two BERT-small variants from scratch, whose embedding is either (a) randomly initialized and frozen or (b) copied from normally pre-trained BERT at 250k steps and frozen.

Proposed Approach: TACO
To address the challenges of MLM, we propose a new method TACO to combine global anchors and local anchors. We first introduce TC, a tokenalignment contrastive loss which explicitly models global semantics in Section 3.1, and combine TC with MLM to get the overall objective for training our TACO model in Section 3.2.

Token-alignment Contrastive Loss
To model global semantics, the objective is expected to be capable of explicitly capturing information shared between contextualized representation of tokens within the same context. Therefore, a natural solution is to maximize the mutual information of contextual information hidden in contextualized representations in the same context. To extract shared contextual information, we first define a rule to generate contextual representations of tokens by combining embeddings and global information. Formally, hi = f (ei, g).
( 2) where f is a probing algorithm and e i is the embedding and g is the global bias of a concrete context. In this paper, we adopt a straightforward probing method to get global information hidden in contextualized representations, where Given contextualized representations of an token x and its nearby tokens c in the same context, we use g x and g c to represent global semantics hidden in these representations. The mutual information between the two global bias g x and g c is According to van den Oord et al. 2019, the In-foNCE loss serves as an estimator of mutual information of x and c: where L(g x , g c ) is defined as: where c − k is the k-th negative sample of x and K is the size of negative samples. Hence minimizing the objective L(g x , g c ) is equivalent to maximizing the lower bound on the mutual information I(g x , g c ). This objective contains two parts: positive pairs Previous study (Chen et al., 2020) has shown that cosine similarity with temperature performs well as the score function f in InfoNCE loss. Following them, we take where τ is the temperature hyper-parameter and · is 2 -norm function.
Contextualized representation: To get global bias g x and g c following Eq. 3, we adopt the widely-used Transformer (Vaswani et al., 2017) as the encoder and take the last hidden states as the contextualized representations h x and h c . Formally, suppose a batch of sequences {s i } where i ∈ {1, · · · , N }. We feed it into the Transformer encoder to obtain contextualized representations, Positive pairs: Given each token x, we randomly sample a positive sample c from nearby tokens in the same context (sequence) within a window span where W is the window size.
Negative pairs: Given each token x, we randomly sample K tokens from other sequences in this batch as negative samples c − k . To sum up, the Token-alignment Contrastive (TC) loss is applied to every token in a batch as: where N is the number of sequences of this batch; s i is the i-th sequence; j and jc are tokens in s i where jc = j; g i is the global semantics hidden in contextualized representation of token s i . g i j and g i jc are generated via: where h i j and e i j are the contextualized representation and static embedding of the anchor token, respectively. h i jc and e i jc are the contextualized representation and static embedding of the sampled positive token in the same context.

Training Objective
As described before, the token-alignment contrastive loss L TC is designed to model global dependencies while MLM is able to capture local dependencies. Therefore, we can better model contextualized representations by combining the tokenalignment contrastive loss L TC and the MLM loss to get our overall objective L TACO : Baselines We mainly compare TACO with MLM on BERT-small and BERT-base models. In addition, we also compare TACO with related contrastive methods: a sentence-level contrastive method BERT-NCE and a span-based contrastive learning method INFOWORD, both from Kong et al. (2020). We directly compare TACO with the results reported in their paper. Table 1 and Figure 5 show the results of TACO on BERT-small. As we can see, compared with MLM with 250k training steps ( convergence steps), TACO achieves comparable performance with only 1/5 computation budget. By modeling global dependencies, TACO can significantly improve the efficiency of contextualized representation learning. In addition, when pre-trained with the same steps, TACO outperforms MLM with 1.2 average score improvement on the validation set. In addition to convergence, we also compare TACO and MLM on fewer training data. The results are shown in Table 2. We sample 4 tasks with the largest amount of training data for evaluation. As we can see, TACO trained on 25% data can achieve competitive results with MLM trained on full data. These results also verify the data efficiency of our method, TACO.

Results on BERT-Base
We also compare TACO with MLM on base-sized models, which are the most commonly used models according to the download data from Hugging-    In addition, as shown in Table 4, TACO achieves competitive results compared to BERT-NCE and INFOWORD, two similar contrastive methods.

TACO and MLM
To better understand how TACO works, we conduct a quantitative comparison on the learning dynamic for BERT and TACO. Similar to Section 2.2, we plot the Cosine similarity among contextualized representations of tokens in the same context (intra-context) and different contexts (inter-context) in Figure 6. We find that the learning dynamic of TACO significantly differs from that of MLM. Specifically, for TACO, the intra-context representation similarity remains high and the gap between intra-context similarity and inter-context similarity remains large at the later stage of training. This confirms that TACO can better fulfill global semantics, which may contribute to the superior downstream performance.

Ablation Study
TACO is implemented as a token-level contrastive (TC) loss along with the MLM loss. Therefore, the improvement of TACO might come from two aspects, including 1) denser supervision signals from the all-token objective and 2) the benefits of the contrastive loss to strengthen global dependencies. It is helpful to figure out which factor is more important. To this end, we design two variants for ablation. One is a concentrated TACO, where the contrastive loss is built on the 15% masked positions only, keeping the same density of supervision signal with MLM. The other is an extended MLM, where not only 15% masked positions are asked to predict the original token, so do the rest 85% unmasked positions. The extended MLM has the same dense supervision with TACO but loses the benefits of modeling the global dependencies. The results on small models are shown in Figure 6.
As we can see, the performance of TACO decreases if we sample a part of token positions to implement TC objectives. It shows that more supervision signals benefit the final performance of  6 Related Work

Language Representation Learning
Classic language representation learning methods (Mikolov et al., 2013;Pennington et al., 2014) aims to learn context-independent representation of words, i.e., word embeddings. They generally follow the distributional hypothesis (Harris, 1954). Recently, the pre-training then fine-tuning paradigm has become a common practice in NLP because of the success of pre-trained language models like BERT (Devlin et al., 2019). Contextdependent (or contextualized) representations are the basic characteristic of these methods. Many existing contextualized models are based on the masked language modeling objective, which randomly masks a portion of tokens in a text sequence and trains the model to recover the masked tokens. Many previous studies prove that pre-training with the MLM objective helps the models learn syntactical and semantic knowledge (Clark et al., 2019). There have been numerous extensions to MLM. For example, XLNet (Yang et al., 2019) introduced the permutated language modeling objective, which predicts the words one by one in a permutated order. BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) investigated several denoising objectives and pre-trained an encoder-decoder architecture with the mask span infilling objective. In this work, we focus on the key MLM objective and aim to explore how MLM objective helps learn contextualized representation.

Contrastive-based SSL
Apart from denoising-based objectives, contrastive learning is another promising way to obtain selfsupervision. In contrastive-based self-supervised learning, the models are asked to distinguish the positive samples from the negative ones for a given anchor. Contrastive-based SSL method was first introduced in NLP for efficient learning of word representations by negative sampling, i.e., SGNS (Word2Vec (Mikolov et al., 2013)). Later, similar ideas were brought into CV field for learning image representation and got prevalent, such as MoCo (He et al., 2020), SimCLR (Chen et al., 2020), BYOL (Caron et al., 2020), etc.
In the recent two years, there have been many studies targeting at reviving contrastive learning for contextual representation learning in NLP. For instance, CERT (Fang et al., 2020) utilized backtranslation to generate positive pairs. CAPT (Luo et al., 2020) applied masks to the original sentence and considered the masked sentence and its original version as the positive pair. DeCLUTR (Giorgi et al., 2020) samples nearby even overlapping spans as positive pairs. INFOWORD (Kong et al., 2020) treated two complementary parts of a sentence as the positive pair. However, the aforementioned methods mainly focus on sentence-level or spanlevel contrast and may not provide dense selfsupervision to improve efficiency. Unlike these approaches, TACO regards the global semantics hidden in contextualized token representations as the positive pair. The token-level contrastive loss can be built on all input tokens, which provides a dense self-supervised signal.
Another related work is ELECTRA (Clark et al., 2020). ELECTRA samples machine-generated tokens from a separate generator and trains the main model to discriminate between machine-generated tokens and original tokens. ELECTRA implicitly treats the fake tokens as negative samples of the context, and the unchanged tokens as positive samples. Unlike this method, TACO does not require architectural modifications and can serve as a plug-and-play auxiliary objective, largely improving pretraining efficiency.

Conclusion
In this paper, we propose a simple yet effective objective to learn contextualized representation. Taking MLM as an example, we investigate whether and how current language model pre-training objectives learn contextualized representation. We find that the MLM objective mainly focuses on local anchors to align contextualized representations, which harms global dependencies modeling due to an "embedding bias" problem. Motivated by these problems, we propose TACO to directly model global semantics. It can be easily combined with existing LM objectives. By combining local and global anchors, TACO achieves up to 5× speedups and up to 1.2 improvements on GLUE score. This demonstrates the potential of TACO to serve as a plug-and-play approach to improve contextualized representation learning.

A.1 Pre-training Hyper-parameters
All pre-training approaches involved in experiments use the same pre-training hyper-parameters but do not include BERT-NCE and INFOWORD. Results of BERT-NCE and INFOWORD are directly cited from the original paper (Kong et al., 2020). Following , we do not use the next sentence prediction (NSP) objective and use dynamic masking for MLM with a 15% mask ratio, where the masked positions are decided on the fly.
TACO introduces three extra hyper-parameters, including negative sample size K, positive sample window size W and temperature τ . We set the temperature τ as a small value, 0.07, following Fang et al. (2020). By searching for the best K out of {10, 50} and W out of {3, 5, 10, 50} on the small TACO model, we found that TACO with K=50 and W =5 performs best, so we also apply these hyperparameter choices for base-sized TACO. The full set of pre-training hyper-parameters are listed in Table 5. Actually, TACO outperforms MLM under most cases in our preliminary experiments. However, we still also find some extreme cases which might harm the effectiveness of TACO. If the size of negative samples K is too small, e.g., smaller than 10, the performance of TACO degenerates nearly to the level of BERT baseline. Similar conclusions are also mentioned in related works (He et al., 2020;Chen et al., 2020). Also, if the positive window size W is too large, e.g., bigger than 50, the performance of TACO degrades, too. We suspect the over-large positive window brings more false-positive samples, which makes the sequence meaning ambiguous, thus harms the performance.

A.2 Fine-tuning Details
For small-sized models, we fine-tune all saved checkpoints (5k, 10k, 20k, 30k, 40k, 50k, 100k, 150k, 200k, 250k-step) of different pre-trained models (TACO and its ablations) with the same hyper-parameters on each task. Considering the large amount of pre-training checkpoints, we just adopt the default fine-tuning hyper-parameters and repeat fine-tuning 4 times with different random seeds. Then the best performed fine-tuned models on validation sets are used for testing. This setting helps make a fair comparison among models and avoids a large amount of grid-search runs. The taskspecific hyper-parameters for small-sized models    are listed in Table 7. The general fine-tuning hyperparameters are listed in Table 6. For base-sized models, we save checkpoints at 100k, 250k, and 500k steps, respectively. During fine-tuning, we also conduct multiple fine-tuning runs with different task-specific hyper-parameter combinations as shown in Table 8. Concretely, we randomly sample 6 different hyper-parameter combinations and report the average score for validation results. Then we select the best-performing run of 500k-step checkpoints (converged) for testing.

A.3 Statistic Details
Embedding Similarity We calculate cosine similarity of 20 randomly sampled pairs of frequently co-occurrent words from the WordSim353 dataset (Agirre et al., 2009) labeled by human annotators to plot the average similarity curve in Figure 3(b). Corresponding embeddings are obtained from the embedding layer of the BERT model and variant models mentioned in Section 2.2.
Intra-/Inter-context Similarity For every token w i in the corpus, we randomly sample a positive token w j =i within the same context (sentence) and another token w k from other sentences. As mentioned in Section 2.2, we take BERT (Devlin et al., 2019) as our encoder to get contextualized representations through the last hidden states h. We mainly adopt the cosine similarity as the measurement and calculate the average intra-context similarity (between h i and h j ) and the average inter-context similarity (between h i and h k ) over all tokens in the corpus. It is worth noticing that we do use any masks here when generating a token's contextualized representation for statistics.
Other Measurements We observe the same findings for MLM under other measurements, though the statistics before are mainly based on cosine similarities. We tried other similarities or distances, e.g., L1 distance, L2 distance and L10 distance, to evaluate the discrepancy between contextualized representations from the same context and different contexts. Specifically, we make intra-context and inter-context statistics under specific measurement at different pre-training checkpoints, then calculate the ratio of intra-context measurement over the inter-context one. Table 9 shows the statistical results. As we can see, when the ratio of L1 distance decreases, the ratio of cosine similarity and the dot-production similarity increase, vice versa.

B Extra Experiments
In the standard implementation of BERT, the parameters of input embeddings are shared with output embeddings. All experiments and analyses in this paper are based on this assumption. To further   Table 9: The ratio of intra-context measurement over inter-context measurement during pre-training. We list two distance measurements and three similarity measurements here.
confirm the effectiveness of TACO, we conduct the extra experiments without embedding sharing on BERT-small. The results are showed in Table 10. It is unexpected that the variants without embedding sharing perform worse compared their counterparts due to lack of regularization of weight sharing. From the results, we can see that the TACO without embedding sharing performs slightly worse than TACO with embedding sharing. However, compared to the MLM, it is still better than MLM than 0.9 average GLUE score when convergence. These results prove the effectiveness of TACO even when embeddings are not sharing. Table 10: Results on GLUE validation set with small-size models. For models without embedding sharing, we run 3 experiments with different random seeds for each task and report the average score.