Pre-training Universal Language Representation

Despite the well-developed cut-edge representation learning for language, most language representation models usually focus on specific levels of linguistic units. This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space. We propose the training objective MiSAD that utilizes meaningful n-grams extracted from large unlabeled corpus by a simple but effective algorithm for pre-trained language models. Then we empirically verify that well designed pre-training scheme may effectively yield universal language representation, which will bring great convenience when handling multiple layers of linguistic objects in a unified way. Especially, our model achieves the highest accuracy on analogy tasks in different language levels and significantly improves the performance on downstream tasks in the GLUE benchmark and a question answering dataset.


Introduction
In this paper, we propose universal language representation (ULR) that uniformly embeds linguistic units in different hierarchies in the same vector space. A universal language representation model encodes linguistic units such as words, phrases or sentences into fixed-sized vectors and handles multiple layers of linguistic objects in a unified way. ULR learning may offer a great convenience when confronted with sequences of different lengths, especially in tasks such as Natural Language Understanding (NLU) and Question Answering (QA), hence it is of great importance in both scientific research and industrial applications.
As is well known, embedding representation for a certain linguistic unit (i.e., word) enables linguistics-meaningful arithmetic calculation among different vectors, also known as word analogy (Mikolov et al., 2013). For example: In fact, manipulating embeddings in the vector space reveals syntactic and semantic relations between the original symbol sequences and this feature is indeed useful in true applications. For example, "London is the capital of England" can be formulized as: England + capital ≈ London Then given two documents one of which contains "England" and "capital", the other contains "London", we consider them relevant. While a ULR model may generalize such good analogy features onto free text with all language levels involved together. For example, Eat an onion : Vegetable :: Eat a pear : Fruit.
ULR has practical values in dialogue systems, by which human-computer communication will go far beyond executing instructions. One of the main challenges of dialogue systems is Dialogue State Tracking (DST). It can be formulated as a semantic parsing task (Cheng et al., 2020), namely, converting natural language utterances with any length into unified representations. Thus this is essentially a problem that can be conveniently solved by mapping sequences with similar semantic meanings into similar representations in the same vector space according to a ULR model.
Another use of ULR is in the Frequently Asked Questions (FAQ) retrieval task, where the goal is to answer a user's question by retrieving question paraphrases that already have an answer from the database. Such task can be accurately done by only manipulating vectors such as calculating and ranking vector distance (i.e., cosine similarity). The core is to embed sequences of different lengths in the same vector space. Then a ULR model retrieves the correct question-answer pair for the user query according to vector distance.
In this paper, we propose a universal language representation learning method that generates fixedsized vectors for sequences of different lengths based on pre-trained language models (Devlin et al., 2019;Lan et al., 2019;Clark et al., 2020). We first introduce an efficient approach to extract and prune meaningful n-grams from unlabeled corpus. Then we present a new pre-training objective, Minimizing Symbol-vector Algorithmic Difference (MiSAD), that explicitly applies a penalty over different levels of linguistic units if their representations tend not to be in the same vector space.
To investigate our model's ability of capturing different levels of language information, we introduce an original universal analogy task derived from Google's word analogy dataset, where our model significantly improves the performance of previous pre-trained language models. Evaluation on a wide range of downstream tasks also demonstrates the effectiveness of our ULR model. Overall, our ULR-BERT reaches the highest average accuracy on the universal analogy dataset and obtains 1.1% gain over Google BERT on the GLUE benchmark. Extensive experimental results on a question answering task verifies that our model can be easily applied to real-world applications in an extremely convenient way.

Related Work
Previous language representation learning methods such as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), LASER (Artetxe and Schwenk, 2019), InferSent (Conneau et al., 2017) and USE (Cer et al., 2018) focus on specific granular linguistic units, e.g., words or sentences. Later proposed ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2019) and XLNet  learns contextualized representation for each input token. Although such pre-trained language models (PrLMs) more or less are capable of offering universal language representation through their general-purpose training objectives, all the PrLMs devote into the contex-tualized representations from a generic text background and pay little attention on our concerned universal language presentation.
As a typical PrLM, BERT is trained on a large amount of unlabeled data including two training targets: Masked Language Model (MLM), and Next Sentence Prediction (NSP). ALBERT (Lan et al., 2019) is trained with Sentence-Order Prediction (SOP) as a replacement of NSP. StructBERT (Wang et al., 2020) combines NSP and SOP to learn inter-sentence structural information. Nevertheless, RoBERTa  and SpanBERT (Joshi et al., 2020) show that single-sequence training is better than the sentence-pair scenario. Besides, BERT-wwm (Cui et al., 2019), StructBERT (Joshi et al., 2020), SpanBERT (Wang et al., 2020) perform MLM on higher linguistic levels, augmenting the MLM objective by masking whole words, trigrams or spans, respectively. ELECTRA (Clark et al., 2020) further improves pre-training through a generator and discriminator architecture. The aforementioned models may seemingly handle different sized input sequences, but all of them focus on sentence-level specific representation still for each word, which may cause unsatisfactory performance in real-world situations.
There are a series of downstream NLP tasks especially on question answering which may be conveniently and effectively solved through ULR like solution. Actually, though in different forms, these tasks more and more tend to be solved by our suggested ULR model, including dialogue utterance regularization (Cao et al., 2020), question paraphrasing (Bonadiman et al., 2019), measuring QA similarities in FAQ tasks (Damani et al., 2020;Sakata et al., 2019).

Model
As pre-trained contextualized language models show their powerfulness in generic language representation for various downstream NLP tasks, we present a BERT-style ULR model that is especially designed to effectively learn universal, fixed-sized representations for input sequences of any granularity, i.e., words, phrases, and sentences. Our proposed pre-training method is furthermore strengthened in three-fold. First, we extract a large number of meaningful n-grams from monolingual corpus based on point-wise mutual information to leverage the multi-granular structural information. Second, inspired by word and phrase representation and their compositionality, we introduce a novel pre-training objective that directly models the input sequences and the extracted n-grams through manipulating their representations. Finally, we implement a normalized score for each n-gram to guide their sampling for training.

n-gram Extracting
Given a symbol sentence, Joshi et al. (2020) utilize span-level information by randomly masking and predicting contiguous segments. Different from such random sampling strategy, our method is based on point-wise mutual information (PMI) (Church and Hanks, 1989) that makes efficient use of statistics and automatically extracts meaningful n-grams from unlabeled corpus.
Mutual information (MI) describes the association between two tokens by comparing the probability of observing them together with the probabilities of observing them independently. Higher mutual information indicates stronger association between the tokens. To be specific, an n-gram is denoted as w = (x 1 , . . . , x |w| ), where |w| is the number of tokens in w and |w| > 1. Therefore, we present an extended PMI formula as follows: where the probabilities are estimated by counting the number of observations of each token and ngram in the corpus, and normalizing by the corpus size. 1 |w| is an additional normalization factor which avoids extremely low scores for long n-grams.
We first collect all n-grams with lengths up to N using the SRILM toolkit 1 (Stolcke, 2002), and compute PMI scores for all the n-grams based on their occurrences. Then, only n-grams with PMI scores higher than the chosen threshold are selected and input sequences are marked with the corresponding n-grams.

Training Objective
While the MLM training objective as in BERT (Devlin et al., 2019) and its extensions (Cui et al., 2019;Joshi et al., 2020;Wang et al., 2020) are widely used for pre-trained contextualized language modeling, they do not focus on our concerned ULR, 1 http://www.speech.sri.com/projects/srilm/download.html which demands an arithmetic corresponding relationship between the symbol and its represented vector. In order to directly model such demand, we propose a novel training target -Minimizing Symbol-vector Algorithmic Difference (MiSAD) -that leverages the vector space regularity of different granular linguistic units. For example, the following symbol sequence equation "London is" + "the capital of England" ="London is the capital of England" (1) indicates a vector algorithmic equation according to our ULR goal, vector("London is") + vector("the capital of Thus, if the symbol equation (1) cannot imply the respective vector equation (2), we may set a training objective to let the ULR model forcedly learn such relationship. Formally, we denote the input sequence by S = {x 1 , . . . , x m }, where m is the number of tokens in S. After n-gram extracting and pruning by means of PMI, each sequence is marked with several n-grams. During pre-training, only one of them is selected by the n-gram scoring function, which will be introduced in detail in Section 3.3, and the input sequence is represented as is a sub-sequence of S. Then we convert S into two independent parts -the n-gram w and the rest of the tokens R = {x 1 , . . . , x i−1 , x j+1 , . . . , x m }which are fed into the model separately along with the original complete sequence.
The Transformer encoder generates a contextualized representation for each token in the sequence. To derive fixed-sized vectors for sequences of different lengths, we use the pooled output of the [CLS] token as sequence embeddings. The model is trained to minimize the following Mean Square Error (MSE) loss: where E w , E R and E S are representations of w, R and S, respectively, and are all normalized to unit lengths. To enhance the robustness of the model, we jointly train MiSAD and the MLM objective L M LM as in BERT with equal weights. Since the input sentence S is split into w + R, we must avoid masking out the n-gram w in the original sentence in order not to affect the semantics after vector space combination. However, tokens in n-grams other than w have equal weights of being replaced with [MASK] as other tokens. The final loss function is as follows:

n-gram Sampling
For a given sequence, the importance of different n-grams and the degree to which the model understands their semantics are different. Instead of sampling n-grams at random, we let the model decide which n-gram to choose based on the knowledge learned in the pre-training stage. Following Tamborrino et al. (2020), we employ a normalized score for each n-gram in the input sequence using the masked language modeling head.
We mask one n-gram at a time and the model outputs probabilities of the masked tokens given their surrounding context. The score of an n-gram w is calculated as the average probabilities of all tokens in it.
where |w| is the length of w and S \w is the notation of an input sequence S with all tokens within w replaced by the special token [MASK]. Finally, we choose the n-gram with the lowest score for our training target.

Implementation of ULR Pre-training
This section introduces our ULR pre-training details.
As for the pre-training corpus, we download the English Wikipedia Corpus 2 and pre-process with process wiki.py 3 , which extracts text from xml files. When processing paragraphs from Wikipedia, we find that a large number of entities are annotated with special marks, which may be useful for our task. Therefore, we identify all the entities and treat them as high-quality n-grams. Then, we remove punctuation marks and characters in other languages based on regular expressions, and finally get a corpus of 2,266M words.
As for n-gram pruning, PMI scores of all ngrams with a maximum length of N = 6 are calculated for each document. We manually evaluate the extracted n-grams and find more than 50% of the top 2000 n-grams contain 2 ∼ 3 words, and only less than 3% n-grams are longer than 4. Although a larger n-gram vocabulary can cover longer ngrams, it will cause too many meaningless n-grams at the same time. Therefore, we empirically retain the top 3000 n-grams for each document. Finally, we randomly sample 10M sentences from the entire corpus to reduce training time.
During pre-training, BERT packs sentence pairs into a single sequence and use the special [CLS] token as sentence-pair representation. However, our MiSAD training objective requires singlesentence inputs. Thus in our experiments, each input is an n-ngram or a single sequence with a maximum length of 128. Special tokens [CLS] and [SEP] are added at the front and end of each input, respectively. Instead of training from scratch, we initialize our model with the officially released checkpoints of BERT (Devlin et al., 2019), AL-BERT (Lan et al., 2019) and ELECTRA (Clark et al., 2020). We use Adam optimizer (Kingma and Ba, 2017) with initial learning rate of 5e-5 and linear warmup over the first 10% of the training steps. Batch size is 64 and dropout rate is 0.1. Each model is trained for one epoch over 10M training examples on four Nvidia Tesla P40 GPUs.

Tasks
We construct a universal analogy dataset in terms of words, phrases and sentences and experiment with multiple representation models to examine their ability of representing different levels of linguistic units through a task-independent evaluation 4 . Furthermore, we conduct experiments on a wide range of downstream tasks from the GLUE benchmark and a question answering task.

Universal Analogy
Our universal analogy dataset is based on Google's word analogy dataset and contains three levels of tasks: words, phrases and sentences.  Word-level Recall that in a word analogy task (Mikolov et al., 2013), two pairs of words that share the same type of relationship, denoted as A : B :: C : D, are involved. The goal is to retrieve the last word from the vocabulary given the first three words. To facilitate comparison between models with different vocabularies, we construct a closedvocabulary analogy task based on Google's word analogy dataset through negative sampling. Concretely, for each original question, we use GloVe to rank every word in the vocabulary and the top 5 results are considered to be candidate words. If GloVe fails to retrieve the correct answer, we manually add it to make sure it is included in the candidates. During evaluation, the model is expected to select the correct answer from 5 candidate words. Table 1 shows examples from our word anlogy dataset. Phrase-/Sentence-level To derive higher level analogy datasets, we put word pairs from the wordlevel dataset into contexts so that the resulting phrase and sentence pairs also have linear relationships. Phrase and sentence templates are extrated from the English Wikipedia Corpus. Both phrase and sentence datasets have four types of semantic analogy and three kinds of syntactic analogy. Please refer to Appendix A for details about our approach of constructing the universal analogy dataset.

GLUE
The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of tasks that are widely used to evaluate the performance of a model in language understanding. We divide NLU tasks from the GLUE benchmark into three main categories.
Single-Sentence Classification Single-sentence classification tasks includes SST-2 (Socher et al., 2013), a sentiment classification task, and CoLA (Warstadt et al., 2019), a task that is to determine whether a sentence is grammatically acceptable. Natural Language Inference GLUE contains four NLI tasks: MNLI (Williams et al., 2018), QNIL (Rajpurkar et al., 2016), RTE (Bentivogli et al., 2009) and WNLI (Levesque et al., 2012). However, we exclude the problematic WNLI in accordance with Devlin et al. (2019). Semantic Similarity MRPC (Dolan and Brockett, 2005), QQP (Chen et al., 2018) and STS-B (Cer et al., 2017) are semantic similarity tasks, where the model is required to either determine whether the two sentences are equivalent or assign a similarity score for them.
In the fine-tuning stage, pairs of sentences are concatenated into a single sequence with a special token [SEP] in between. For both single sentence and sentence pair tasks, the hidden state of the first token [CLS] is used for softmax classification. We use the same sets of hyperparameters for all the evaluated models. Experiments are ran with batch sizes in {8, 16, 32, 64} and learning rate of 3e-5 for 3 epochs.

GEOGRANNO
GEOGRANNO (Herzig and Berant, 2019) contains natural language paraphrases paired with logical forms. The dataset is manually annotated: For each natural language utterance, a correct canonical utterance paraphrase is selected. The train/dev sets have 487 and 59 paraphrase pairs, respectively. In our experiments, we focus on question paraphrase retrieval, whose task is to retrieve the correct paraphrase from all 158 different sentences when given a question. Most of the queries have only one correct answer while some have two or more matches. Evaluation metrics are Top-1/5/10 accuracy.
For GEOGRANNO and the universal analogy task, we apply three pooling strategies on top of the PrLM: Using the vector of the [CLS] token, mean-pooling of all token embeddings and maxpooling over time of all embeddings. The default setting is mean-pooling.
MLM-BERT BERT models trained with the same additional steps with our model on Wikipedia using only the MLM objective.
ULR-BERT Our universal language representation model trained on Wikipedia with MLM and MiSAD.

Universal Analogy
Results on our universal analogy dataset are reported in Table 2. Generally, semantic analogies are more challenging than the syntactic ones and higher-level relationships between sequences are more difficult to capture, which is observed in almost all the evaluated models. On the word analogy task, GloVe achieves the highest accuracy (80.3%) while its performance drops sharply on higher-level tasks. All well trained PrLMs like BERT, ALBERT 5 https://gluebenchmark.com and ELECTRA hardly exhibit arithmetic characteristics and increasing the model size usually leads to a decrease in accuracy.
However, training models with our properly designed MiSAD objective greatly improves the performance. Especially, ULR-BERT obtains 15% ∼ 25% absolute gains on word-level analogy, such results are so strong to be comparable to GloVe, which especially focuses on the linear word analogy feature from its training scheme. Meanwhile GloVe performs far worse than our model on higher-level analogies. Overall, ULR-BERT achieves the highest average accuracy (45.8%), an absolute gain of 8.1% over BERT, indicating that it has indeed more effectively learned universal language representations across different linguistic units. It demonstrates that our pre-training method is effective and can be adapted to different PrLMs. Table 3 shows the performance on the GLUE benchmark. Our model improves the BERT BASE and BERT LARGE by 1.1% and 0.7% on average, respectively. Since our model is established on the released checkpoints of Google BERT, we make additional comparison with MLM-BERT that is trained under the same procedure as our model except for the pre-training objective. While the model trained with more MLM updates may improve the performance on some tasks, it underper-  Table 3: Test results on the GLUE benchmark scored by the evaluation server 5 . We exclude the problematic WNLI dataset and recalculate the "Avg." score. Results for BERT BASE and BERT LARGE are obtained from Devlin et al. (2019). "mc" and "pc" are Matthews correlation coefficient (Matthews, 1975) and Pearson correlation coefficient, respectively.  forms BERT on datasets such as MRPC, RTE and SST-2. Our model exceeds MLM-BERT BASE and MLM-BERT LARGE by 0.9% and 0.7% on average respectively. The main gains from the base model are in CoLA (+4.6%) and RTE (+1.4%), which are entirely contributed by our MiSAD training objective. Overall, our model improves the performance of its baseline on every dataset in the GLUE benchmark, demonstrating its effectiveness in real applications of natural language understanding. Table 4 shows the performance on GEOGRANNO.

GEOGRANNO
As we can see, 4 out of 6 evaluated pre-trained language models significantly outperform BM25 for Top-1 accuracy, indicating the superiority of contextualized embedding-based models over the statistical method. Among all the evaluated models, ULR-BERT yields the highest accuracies (39.7%/68.8%/77.3%). To be specific, our ULR models exceeds BERT BASE and BERT LARGE by 10.1% and 19.2% and obtains 2.7% and 10.6% improvements compared with MLM-BERT BASE and MLM-BERT LARGE in terms of Top-1 accuracy, respectively, which are consistent with the results on the GLUE benchmark. Since n-grams and sentences of different lengths are involved in the pre-training of our model, it is especially better at understanding the semantics of input sequences and mapping queries to their paraphrases according to the learned sense of semantic equality.

Ablation Study
In this section, we explore to what extent does our model benefit from the MiSAD objective and sampling strategy, and further confirm that our pretraining procedure improves the model's ability of encoding variable-length sequences.

Effect of Training Objectives
To make a fair comparison, we train BERT with the same additional updates using different combinations of training tasks: NSP-BERT is trained with MLM and NSP, whose goal is to distinguish whether two input sentences are consecutive. For each sentence, we choose its following sentence 50% of the time and randomly sample a sentence 50% of the time.
SOP-BERT is trained with MLM and SOP, a substitute of the NSP task that aims at better modeling the coherence between sentences. Consistent with Lan et al. (2019), we sample two consecutive sentences in the same document as a positive   sample, and reverse their order 50% of the time to create a negative sample.
For both baselines and ULR, we use the same set of parameters for 5 runs, and average scores on the GLUE test set are reported in Table 5. Although we expect NSP and SOP to help the model better understand the relationship between sentences and benefit tasks like natural language inference, they hardly improve the performance on GLUE according to our strict implementation. Specifically, NSP-BERT outperforms MLM-BERT on datasets such as CoLA, QNLI and QQP while less satisfactory on other tasks. SOP-BERT is on a par with MLM-BERT on three NLI tasks but it sharply decreases the score on other datasets. In general, single-sentence training with only the MLM objective accounts for better performance as described by ; Joshi et al. (2020). Besides, our training strategy which combines MLM and MiSAD yields the most considerable gains compared with other training objectives. Table 6 shows standard deviation, mean and maximum performance on CoLA/RTE/MRPC dev set when fine-tuning BERT and ULR-BERT over 5 random seeds, which clearly shows that our model is generally more stable and yields better results compared with BERT.

Effect of Sampling Strategies
We compare our PMI-based n-gram sampling scheme with two alternatives. Specifically, we train the following two baseline models under the same model settings except for the sampling strategy. Random Spans We replace our n-gram module with the masking strategy as proposed by Joshi et al. (2020), where the sampling probability of span length l is based on a geometric distribution l ∼ Geo(p). The parameter p is set to 0.2 and maximum span length l max = 6. Named Entities We only retain named entities that are annotated in the Wikipedia Corpus. Table 7 shows the effect of different sampling schemes on the GLUE dev set. As we can see, our PMI-based n-gram sampling is preferable to other strategies on 6 out of 8 tasks. CoLA and RTE are more sensible to sampling strategies than other tasks. On average, using named entities and meaningful n-grams is better than randomly sampled spans. We attribute the source to the reason is that random span sampling ignores important semantic and syntactic structure of a sequence, resulting in a large number of meaningless segments. Compared with using only named entities, our PMI-based approach automatically discovers structures within any sequence and is not limited to any granularity, which is critical to pre-training universal language representation.

Application to Different Models
Experiments on the universal analogy task reveal that our proposed training scheme can be adapted to various pre-trained langauge models. In this subsection, we compare our model with BERT, ALBERT and ELECTRA on GEOGRANNO and the GLUE benchmark. Table 8 shows the results on GEOGRANNO and the GLUE dev set, where our approach can enhance the performance of all three pre-trained mod-    els. Among all the evaluated models, ULR-BERT achieves the largest gains on GLUE while ULR-ELECTRA obtains the most significant improvement on GEOGRANNO. It further verifies the effectiveness and universality of our model.

Effect of Sequence Length
In previous experiments on GEOGRANNO, our model has shown considerable improvement over all three evaluated PrLMs. The task involves text matching between linguistic units at different levels where queries are sentences and labels are often phrases. Thus the performance on such task highly depends on the model's ability to uniformly deal with linguistic units of different granularities. In the following, we explore deeper details and interpretability of how our proposed objective act at different levels of linguistic units. Specifically, we intuitively show the consistency of the representations learned by ULR-BERT by grouping the dataset according to query length |q| and the absolute difference between query length and Question length abs(|q| − |Q|), respectively.
Results are shown in Table 9, which clearly shows that as the length of the query increases, the performance of BERT drops sharply. Similarly, BERT is more sensible to the difference between query length and Question length. In contrast, ULR-BERT is more stable when dealing with sequences of different lengths and is superior to BERT in terms of representation consistency, which we speculate is due to the interaction between different levels of linguistic units in the pretraining procedure.

Conclusion
This work formally introduces universal language representation learning to enable unified vector operations among different language hierarchies. For such a purpose, we propose three highlighted ULR learning enhancement, including the newly designed training objective, Minimizing Symbolvector Algorithmic Difference (MiSAD). In detailed model implementation, we extend BERT's pre-training objective to a more general level, which leverages information from sequences of different lengths in a comprehensive way. In addition, we provide a universal analogy dataset as a task-independent evaluation benchmark. Overall experimental results show that our proposed ULR model is generally effective in a broad range of NLP tasks including natural language question answering and so on.