Language Model Pre-Training with Sparse Latent Typing

Modern large-scale Pre-trained Language Models (PLMs) have achieved tremendous success on a wide range of downstream tasks. However, most of the LM pre-training objectives only focus on text reconstruction, but have not sought to learn latent-level interpretable representations of sentences. In this paper, we manage to push the language models to obtain a deeper understanding of sentences by proposing a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge. Besides, the language model pre-trained with such an objective also significantly improves Information Extraction related downstream tasks in both supervised and few-shot settings. Our code is publicly available at https://github.com/renll/SparseLT.


Introduction
Transformer-based Pre-trained Language Models (PLMs) have achieved significant success on a wide range of NLP tasks.However, typical pre-training objectives for PLMs only focus on teaching the model to directly reconstruct text-level words or sentences, but have not sought to obtain deeper sentence understanding by learning latent-level interpretable representations.For example, transformerdecoder models like the OpenAI GPT series (Radford et al., 2018(Radford et al., , 2019;;Brown et al., 2020)

Fine-tune
Figure 1: A general illustration of our approach to teach pre-trained language model to extract sentence-level keywords with latent type representations in a completely self-supervised manner.
the task of language modeling for pre-training, and transformer-encoder models like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) are trained by predicting the masked tokens within a sentence.Both of these training objectives merely train the models to recover the masked tokens or predict the next words or sentences, while ignoring to learn latent-level representations of sentences that could be potentially useful for both better language understanding and downstream tasks.Pre-training a language model to learn latent representations is extremely hard: First, there are no ground-truth labels for the latent representations that could be used for reliable supervised learning.During pre-training, the model is only given an unlabeled text corpus over which to identify latent representations such as sentence-level keywords and structures.This means the training process must be strictly self-supervised (Rush et al., 2018).Furthermore, to be interpretable, the latent representations for natural language texts are supposed to be discrete, which further complicates the design of a completely differentiable training framework.
To push the language models to learn deeper understandings of sentences, in this paper, we propose a novel pre-training framework, Sparse Latent Typing, that enables the language model to sparsely extract sentence-level keywords with meaningful latent types.We have tackled all abovementioned challenges and our framework is fully differentiable and completely self-supervised.As shown in Figure 1, given an input sentence from the pre-training corpus, we introduce a latent typing mechanism to jointly selects and classifies the keywords from the sentence into a category of randomly initialized latent types.We implement such a latent classification model based on Gumbel Sampling (Jang et al., 2017) to make sure the overall pre-training framework is differentiable.Since there are no ground-truth labels available for the selected keywords and latent types, we incorporate an one-layer transformer decoder into the training pipeline to map the fused token and latent type representations back to the original sentence, and use the sentence reconstruction loss to control for adequate usefulness of the latent representations.
Our approach provides the decoder model with a shortcut to directly access the encoded token representations, so that the latent representation for each of the input tokens can be learned as an auxiliary type representation.For pre-training objectives, in addition to minimizing the sentence reconstruction error, we also introduce a novel typing sparsity loss to minimize the number of token representation selected for latent typing.A KL-divergence based diversity loss is also proposed to encourage a diverse selection of the latent types.Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.Besides, the language model pre-trained with such an objective also significantly improves Information Extraction related downstream tasks in both supervised and few-shot settings.
In summary, our contributions are three-fold: • We propose a fully differentiable language model pre-training framework that enables the model to sparsely extract sentence-level keywords with latent types in a completely self-supervised manner.
• We provide comprehensive analysis and interpretation for our experimental results showing that the pre-trained model is able to extract meaningful latent type representations.
• Extensive experiments on IE-related downstream tasks demonstrate that our proposed pre-training framework can significantly advance state-of-the-art.

Related Work
Knowledge-Enhanced Language Models As pretrained language models (Radford et al., 2018;Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019;Brown et al., 2020;Lewis et al., 2020a;Raffel et al., 2020) are achieving great success on downstream NLP tasks, many research studies focus on how to make these PLMs more knowledgeable.Previous studies (Peters et al., 2019;Zhang et al., 2019;Xiong et al., 2020;He et al., 2020;Yamada et al., 2020;Qin et al., 2021;Wang et al., 2021) either focus on designing entityrelation-aware pre-training objectives, or modifying the model architecture to make it capable of fusing both text and entity information.However, all of these previous approaches utilize largescale, human-annotated, semi-structured external resources (e.g., Wikipedia).In comparison, our method is completely self-supervised and only needs a text corpus for pre-training, which focuses more on encouraging the model to learn knowledge clusters at a latent level.
Latent Structure Learning There are also several studies (Liu et al., 2021;Subramani et al., 2022) Each of the token types z i is selected from a latent embedding space C consisting of V c = |C| different latent vectors.The text decoder g : (Z, X ) → S then reconstructs the original sentence s through the pair of latent types and selected token representations (Z, X ).
The objective of sparse latent typing is to find pairs of latent types and token representations that are as compact as possible but still contain the necessary information for reconstructing the original input sentences.Formally, we want to minimize the following joint objective, with: where T is the number of the selected token repre- and Gumbel-Softmax (Jang et al., 2017), the optimization problem remains as how to minimize the non-differentiable term T to encourage the sparse selection of the token representations.

Learning Sparse Latent Types
To tackle the non-differentiable problem of the size of the selected typing pairs T = |(Z, X )|, we first take a closer look at the latent type classifier h which decides the latent type z i of each token representation x i .Our insight is that we can regard the action of not selecting a token representation as a frozen zero type vector c 1 = 0 ∈ C. We then do an element-wise multiplication between z i and x i to obtain the representations xi = x i ⊗ z i that are to be fed into the text decoder g.The advantages of this approach are that (1) the element-wise multiplication naturally prevents the gradient from being propagated to the token representations that are classified as the zero type vector c 1 , (2) the element-wise multiplication directly modulates the gradients of the token representations with the latent type vectors.This can in principle provide better guidance to the text encoder with the information of the latent vectors than can be provided by other vector fusion operators such as element-wise addition or vector concatenation.Based on this framework, we developed a novel typing sparsity loss in Section 4.2 to approximately minimize the typing pairs size T .While our approach is generally applicable for any text encoder and decoder, specific neural architectures used in this work are discussed in Section 5.1.
In our framework, the latent type classifier h is simplified as a mapping h ′ : X → Z that only outputs the latent types z i for each token representation.The simplified text decoder g ′ : X → S then only needs to model the fused representation space X = Z ⊗ X for sentence reconstruction.⊗ is the vector fusion operator and should be interpreted as element-wise multiplication in this work.The proposed architecture for sparse latent typing is illustrated in Figure 2, which is further explained in the following subsections.

Gumbel Latent Typing
Given the token representations generated from the text encoder, where N is the number of input tokens, and d m is the length of the token representation vectors, our Gumbel latent type classifier first maps X into logits L ∈ R N ×Vc with a weight matrix W ∈ R dm×Vc , and then outputs the probabilities P i,v of choosing the v-th latent type for each token representation x i , where G i,v ∼ Gumbel(0, 1) is the Gumbel noise sampled from a standard Gumbel distribution and τ is the non-negative temperature, following the previous efforts on the Gumbel softmax operation (Jang et al., 2017;Maddison et al., 2016).The reason why we are using Gumbel softmax for our latent type classifier is that it enables choosing a latent type representation in a fully differentiable way, and thus can further facilitate our design of the sparsity loss to do an approximate minimization of the size of the typing pairs T .With the Gumbel decision probability P ∈ R N ×Vc , the latent type representations Z ∈ R N ×dm are obtained through a marginalization of P over the latent type embeddings C ∈ R Vc×dm , where c 1 is the zero type vector.The final fused representation X is obtained by an element-wise multiplication, Intuitively, if P i,k is entirely concentrated on c 1 as τ → 0, i.e., P i,1 = 1, we effectively eliminate token w i (or its representation x i ).
During the evaluation stage of our latent type classifier, the latent type embedding with the largest logit score is selected as the latent type representation for each of the token vectors x i , To alleviate the discrepancy between the training and the evaluation, we adopt the temperature annealing (Jang et al., 2017) trick to the Gumbel-Softmax for a better differentiable approximation of the argmax operator.

Training Objectives
Based on our problem formulation, we adopt three types of training loss for the end-to-end training of our model: (1) Typing Sparsity Loss that encourages the latent type classifier to choose more zero types, (2) KL-Divergence with respect to a uniform prior distribution to encourage the diverse selection of the latent types, (3) Reconstruction Loss that ensures the latent representation maintains essential information of the input text.
Typing Sparsity Loss An important property of Gumbel-Softmax is that when the temperature τ → 0, the decision probability P i ∈ R Vc will tend to be an one-hot index vector sampled from the underlying categorical distribution, , where L ∈ R N ×Vc is the logits before doing Gumbel-Softmax.This means that we can control the decision behavior of the model through modulating the shape of this categorical distribution.Therefore, our typing sparsity loss is designed as the negative log-likelihood of the global averaged probability of choosing the zero type c 1 , where N is the number of tokens in the input sentence.Intuitively, if Pi converges to an onehot vector, then Pi (C = 1) ∈ {1, 0}, and N i=1 Pi (C = 1) = N − T becomes the number of the tokens that are not selected for typing, which is equivalent to what we want to maximize in the problem formulation of Equation ( 1).

KL-Divergence
To encourage a diverse selection of the latent types, we assume a uniform prior distribution of the latent type representations p(z) = 1/V c .The KL-divergence term is calculated between the global averaged probability Pv and the uniform prior, i.e., Pv log( Pv ), where V c is the number of the latent types.
Reconstruction Loss Our reconstruction loss directly follows our problem formulation, i.e., where B is the batch size of the sampled text data.
When pre-training a masked language model, we also include a Masked Language Model loss following BERT (Devlin et al., 2019), where f is the text encoder, and si is the corrupted sequence.
The total loss function is a weighted sum of the above four losses, where α, β, γ ∈ R ≥0 are weighting factors.

Experiments
In our experiments, we first conduct intrinsic evaluation to investigate whether the model can successfully learn word selections with meaningful latent types during pre-training.Then, we apply our model on both supervised and few-shot IE tasks to evaluate the effectiveness of our pre-training framework on downstream tasks.

Sparse Latent Type Learning
Pre-training Setup We adopt the VOA corpus constructed by (Li et al., 2020) for sparse latent type pre-training, which was extracted from 108,693 multimedia news articles openly available on the Voice of America website between 2006 and 2017.We use the bert-base-uncased version of the BERT (Devlin et al., 2019) model as our encoder, and a single transformer decoder layer to reconstruct the sentence, following Kasai et al. (2020); Montero et al. (2021).While our approach is generally applicable for both encoder-only Masked Language Model (MLM) and the encoder-decoder Denoising Language Model (e.g.BART (Lewis et al., 2020b)), we focus on MLM because MLM is more widely used in the downstream information extraction tasks.The implementation details can be found in Appendix A.

Downstream Evaluation
To evaluate the validity of our latent typing approach, we apply our pre-trained model by fine-tuning on downstream tasks.We focus on Information Extraction specifically, and adopt Supervised Joint Information Extraction (Lin et al., 2020) and Few-shot Named Entity Recognition (Ding et al., 2021) as two typical IE tasks to evaluate our model on both supervised and few-shot IE settings.We initialize the BERT model with bert-base-uncased weights and continue pre-training 100,000 steps using the combined loss defined in (3) on the VOA corpus.Since we only focus on evaluating our model on the IE tasks, only the pre-trained text-encoder is used for fine-tuning on the downstream tasks.More details can be found in Appendix A.

Supervised Information Extraction
Datasets We evaluate our pre-trained model on the English subset of ACE-2005 dataset1 and the ERE dataset, which are the most widely-used eventcentric IE dataset containing annotations for extracting entities, events, and relations.Following the preprocessing steps and dataset splits in (Lin et al., 2020) Baselines We compare the performances of finetuning BERT on supervised IE with the following pre-training approaches: 1) BERT-Vanilla: we directly use the bert-base-uncased checkpoint to finetune on the supervised IE tasks, which is also the same as what the baseline models do in OneIE (Lin et al., 2020).2) BERT-MLM: we initialize the BERT model with the bert-base-uncased checkpoint and then fine-tune on the VOA corpus for 100,000 steps only using the masked language modeling loss L MLM defined in (2). 3) BERT-SparseLT: our proposed approach.We pretrain the BERT model from the bert-base-uncased checkpoint on the VOA corpus for 100,000 steps by encouraging the model to learn sparse latent types using the loss function defined in (3), with the hyperparameters α = 0.05, β = 0.05, γ = 0.1.We did not compare our model with knowledge-enhanced pretrained language models like ERNIE (Zhang et al., 2019) and ERICA (Qin et al., 2021)

Analysis
In this section, we address the following research questions on Sparse Latent Typing.
How are the latent types distributed over the encoded token representations?We draw a t-SNE (van der Maaten and Hinton, 2008) plot of the encoded token representations x of 1,000 sentences (30792 tokens in total) randomly sampled from the pre-training corpus in Figure 3.The token representations are colored with their corresponding latent type indices.From the figure, we can observe that the token representations begin to be distributed as individual islands with the same colors after 300k steps of pre-training from scratch.This implies that our sparse latent typing objective can effectively encourage the clustering of the token representations in a small latent space defined by 64 randomly initialized type embeddings.We can also observe a similar trend of latent clustering for the BERT-SparseLT model, which is illustrated in Appendix B, Figure 4.
Can Typing Sparsity Loss effectively control the sparsity of token selections?We pretrained three BERT-SparseLT models with different weighting factors β of the Typing Sparsity Loss to investigate its influence on latent type selections and sentence reconstruction.(Additional samples for β = 0.05 are included in Appendix B, Table 14) From Table 4, we can observe that as the β in-creases the model will select fewer tokens with non-zero types.The corresponding sentence reconstruction also degenerates significantly as fewer tokens are selected by our Gumbel latent type classifier.This means that our proposed typing sparsity loss can effectively control the number of the typed tokens and thus affect the quality of reconstructed sentences.We also did the same experiments for the models pre-trained from scratch (Appendix B, Table 11) to illustrate that such sparse selection behavior is independent of the initialization of the model parameters.
To what extent are the learned latent types interpretable?In Table 5, we calculate the frequencies for the top-10 most selected latent types v and the probability P (x|z = v) of the top-5 frequent tokens tagged by the type v.The statistics are computed over 100 randomly sampled sentences from the pre-training VOA corpus with the BERT-SparseLT model.We can observe that the zero type (index 0) is mostly associated with less meaningful tokens such as ",", "CLS","SEP", which are used for delimiting the semantics.This means that the model can effectively learn to select more meaningful tokens through our sparse latent typing objective.We can also observe that the type 61 seems more correlated with the physical locations and the type 50 is mostly related to functional words.We also do the same analysis for the model pre-trained from scratch in Appendix B,  Table 5: The frequencies of the top-10 most selected latent types v and the corresponding top-5 frequent tokens tagged by the type v.The frequencies are computed over 100 randomly sampled sentences from the pre-training corpus and are noted in the parentheses after the tokens.The model is continually pretrained from a bert-baseuncased checkpoint.
appear to be more interpretable due to more consistent training dynamics than continual pre-training from a checkpoint.

How could different loss combinations affect the model performance?
We conduct an ablation study of the loss weighting factors of the BERT-SparseLT model to illustrate the influence of different loss combinations on the few-shot NER performance in Table 7.Including all the four losses produces the best test performance on the 5-way 1∼2 shot evaluation of the Intra setting of the FewN-ERD benchmark.This proves that all the training objectives are necessary for improving the generalizability of the token representations learned by the encoder.We also include the reconstruction and the latent typing results of the models trained with α = 0.05 in Appendix B, Table 13 for further qualitative analyses.
Can sparse latent typing improve the sentencelevel Natural Language Understanding (NLU) ability?We evaluate BERT-Vanilla, BERT-MLM and the BERT-SparseLT models on the General Language Understanding Evaluation (GLUE) benchmark to demonstrate the influence of sparse latent typing on NLU, and the results are shown in Table 6.The finetuning hyperparameters are shown in Appendix B, Table 9.Our BERT-SparseLT model obtains slightly worse results than the vanilla BERT, but still has marginal improvement over the MLM baseline that excludes the sparse latent typing objectives.The inferior results are as expected for two reasons: 1) The evaluation of the GLUE benchmark heavily relies on the [CLS] token which is always latent typed as a zero-type by the BERT-SparseLT model and thus lacks the enough training signals for fine-grained clustering in the latent space.
2) The VOA corpus for continual pretraining is in the specific news domain, which may not be beneficial for general NLU.We hypothesize that large-scale pretraining from scratch of an encoderdecoder model should overcome these limitations, and we leave it as the future work due to the limitation of the computation resources.

Conclusion
In  to sparsely extract sentence-level keywords with meaningful latent types in a completely selfsupervised manner.Experimental results and analysis demonstrate that incorporating sparse latent type learning early in the pre-training stage will not only facilitate the model to learn sentence-level keyword selections with interpretable latent types, but also improves downstream Information Extraction tasks in both supervised and few-shot settings.

Limitations
One primary limitation of our framework is that the language model pretrained with sparse latent typing might only improve performance on Information Extraction tasks.Although this is intuitive since IE shares essential similarity with latent typing, it is exciting to see whether our model can improve other downstream tasks such as natural language generation and abstractive summarization.Another limitation of our work is that, due to the lack of the computation resources, we did not con-  duct experiments of large-scale pretraining from scratch for a comprehensive examination of our framework's ability on improving the general NLU performance.For future works, it is also worth exploring on whether the sparse latent typing objective can improve the machine translation performance by regularizing a sparse and unified latent space for cross-lingual meaning representations.Finally, our model is only capable of extracting sentence-level keywords with latent types, but is not designed to learn a comprehensive graph structure for each input sentence.Although it is debatable whether a more complex latent graph representation is better than concise latent types, it is still worth adding this into future work plans.
to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.We also thank Suyu Ge for early discussions on this project.
with warm-up is adopted.For Gumbel latent typing, we adopt the following temperature annealing schedule, τ = max(5 × 0.99997 T , 0.5) where T is the number of training steps.We continually pretrain a BERT-base model from the bertbase-uncased checkpoint for 100k steps.For the experiment of pre-training from scratch, we only train a RoBERTa-large (Liu et al., 2019) model on the VOA-corpus for 300k steps, given the limitation of our computation resources.The detailed hyper-parameters are summarized in Table 8 and  Table 10.
For fine-tuning on downstream tasks, we replace the BERT model used in the state-of-the-arts with our pre-trained BERT-SparseLT model and follow the same hyper-parameter settings.Specifically, for supervised IE, we use the codebase from OneIE (Lin et al., 2020)
Table 13: The latent typing and the reconstruction results of various BERT-SparseLT models continually pretrained with different α, β and γ values.The input sentence is "She was murdered in her New York office, just days after learning that Waitress had been accepted into the Sundance Film Festival.".From the table, we can see that removing the typing sparsity loss (β = 0) will result the model to select all the input tokens, and removing the KL-divergence term (γ = 0) will cause the model to either not select any tokens (assigning all the tokens as the zero type) or have almost the same latent type (e.g.type 4) for all the input tokens.The corresponding latent type indices for each of the tokens are noted in the parentheses.

Input Tokens
Latent Typing Reconstructed Sentence

In-domain Sentences
A 45-year-old man who was tackled down and arrested by police after allegedly attacking officers with a knife in South Roxana, Illinois, has been charged by the local prosecutors.
Table 14: Sample latent typing and sentence reconstruction results of the continually pretrained BERT-SparseLT model for both the in-domain and the out-of-domain sentences.The in-domain sentences are sampled from the VOA corpus and the Wikipedia, while the out-of-domain sentences are from the main content of this paper.

Figure 2 :
Figure 2: The proposed architecture for pre-training a language model with Gumbel Latent Typing, where d m is the length of the token representation vectors, V c is the pre-defined size of the latent types, and ⊙ means the matrix multiplication.The white block in the latent type embedding is the zero type vector.

Figure 3 :
Figure 3: The t-SNE visualization of the encoded token representations x with the corresponding latent type indices after 300k steps of pre-training with the sparse latent typing objective.The pre-training process is conducted on the VOA-corpus with randomly initialized parameters.

Table 1 :
, we keep 7 entity types, 6 relation types, 33 event types, and 22 argument roles for the ACE-2005 dataset, and 7 entity types, 5 relation types, 38 event types, and 20 argument roles for the ERE dataset.More detailed dataset statistics are shown in Table 1.Dataset statistics for supervised IE.

Table 2 :
Entity Extraction, Relation Extraction, Event Detection and Event Argument Role Labeling, and the results are shown in Table 2.In general, our BERT-SparseLT model has the best performance among all model competitors, even better than the OneIE model which uses global features for Overall test F1-scores (%) of Supervised Joint Information Extraction.
ResultsWe report the F1 scores for four different IE subtasks on both ACE-2005 and ERE datasets:

Table 12 .
The results

Table 4 :
The latent typing and the reconstruction results of the input tokens, "She was murdered in her New York office, just days after learning that Waitress had been accepted into the Sundance Film Festival.",with different BERT-SparseLT models continually pre-trained with different values of the weighting factors β.The corresponding latent type indices for each of the tokens are noted in the parentheses.

Table 6 :
Devlin et al. (2019)se a novel language model pre-training framework that encourages the model The evaluation results on the development sets of the GLUE benchmark.FollowingDevlin et al. (2019), we report the F1 scores for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores for the other tasks.A fixed random seed is applied for all the experiments for fair comparisons.

Table 7 :
The test F1 scores on the INTRA 5-way 1∼2 shot setting of the FewNERD dataset for different BERT-SparseLT models continually pre-trained with different loss weighting factors, α, β, γ.

Table 8 :
3, and for few-shot IE, the codebase from CONTaiNER (Das et al., 2022) is adopted.4.Detailed settings for model hyper-parameters when pretraining from a bert-base-uncased checkpoint.