Rethinking Denoised Auto-Encoding in Language Pre-Training

Pre-trained self-supervised models such as BERT have achieved striking success in learning sequence representations, especially for natural language processing. These models typically corrupt the given sequences with certain types of noise, such as masking, shuffling, or substitution, and then try to recover the original input. However, such pre-training approaches are prone to learning representations that are covariant with the noise, leading to the discrepancy between the pre-training and fine-tuning stage. To remedy this, we present ContrAstive Pre-Training (CAPT) to learn noise invariant sequence representations. The proposed CAPT encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals. In this way, it not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also aids the pre-trained model in better capturing global semantics of the input via more effective sentence-level supervision. Different from most prior work that focuses on a particular modality, comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6% absolute gain on GLUE benchmarks and 0.8% absolute increment on NLVR2.


Introduction
Recently, pre-trained self-supervised models such as BERT (Devlin et al., 2019) have attracted an increasing amount of attention in natural language processing and vision-language processing. Benefiting from common knowledge contained in massive unlabeled data , the pretraining-finetuning framework has become * Equal Contribution.
UNITER  Mask tokens/regions LXMERT (Tan and Bansal, 2019) Mask tokens/regions Table 1: The type of noise used in the current natural language (upper) and vision-language (lower) sequence representation models. a representative paradigm for advancing various language-related downstream tasks.
Most endeavors on pre-trained representation models rely on elaborately designed selfsupervised tasks, which typically corrupt the given sequence with certain types of noise (e.g., masking in BERT in Table 1), and then train the model to recover the original sequence. As a consequence, the learned representations tend to be covariant with the input noise of pre-training in this paradigm. However, when transferred to downstream tasks, the pre-trained model is responsible for encoding the original sequence without noise, and is expected to obtain noise invariant representations. Such pretrain-finetune discrepancy not only impedes fast fine-tuning, but also may result in suboptimal sequence representations, thus affecting the performance in downstream tasks.
To remedy this, we present ContrAstive Pre-Training (CAPT) to learn noise invariant (or denoised) sequence representations. The core idea of CAPT is to enhance the consistency between semantic representations of the original sequence and that of corresponding corrupted version (e.g. the masked sequence) via unsupervised instancewise training signals. As shown in Figure 1, our approach strives to pull the representation of the corrupted sequence towards that of the original instance in the semantic space, while pushing it away from representations of other instances. Such training objectives are formulated as a multi-class classification task, which aims at classifying the original sequence to the class of its corrupted version and vice versa, while classifying different instances into different classes. Moreover, in order to enable the model to learn from more "difficult" and "diverse" instances, two effective methods are proposed to further enhance the capability of the model to extract noise-concentrated and instancediffused features. With such training objective, the pre-trained model is encouraged to learn noise invariant representations, thereby alleviating the pretrain-finetune discrepancy to some extent.
As an additional benefit, our approach also assists the pre-trained model to more effectively capture the global semantics of the input. Most prior work only focuses on token-level pre-training tasks (e.g. masked language modeling), which lacks the modeling of global semantics of the input. Some other efforts alleviate this problem by introducing sentence-level pre-training tasks (e.g. next sentence prediction) that rely on the relative position of segments in the document. However, the semantic connection between these segments tends to be excessively loose, which may result in confusing gradient signals . By contrast, our CAPT offers incentives for representations of inputs sharing the same semantics (the original instance and its corrupted version) to be similar, while the representations of inputs expressing dif-ferent semantics (different instances) are penalized to be distinguished from each other. Such more reasonable sentence-level supervision enables our approach to look beyond the local structures of input sequences and become more aware of the global semantics.
We perform the evaluation on a comprehensive suite of benchmark, covering 8 natural language understanding and 3 cross-modal tasks. Extensive empirical evidence demonstrates that our approach can achieve consistent improvements over the baselines in both language and vision-language domains. To be more specific, our CAPT raises the performance of RoBERTa  from 88.9% to 89.5% on the GLUE dev set, and also surpasses LXMERT (Tan and Bansal, 2019) by 0.5%, 0.6% and 0.8% on VQA, GQA and NLVR 2 , respectively.

Contrastive Pre-training
The proposed CAPT has excellent versatility, which can be built on various pre-trained models in either language or vision-language domains. Therefore, we use the symbol E to represent a series of generalized pre-trained models. Starting from the property of the semantic representation that inputs sharing semantics should exhibit similar representations, our CPAT strives to capture global semantics of the input more effectively. Different from prior work that tends to learn representations covariant with the noise of pre-training, our CAPT aims at aiding E in learning noise invariant sequence representations by enhancing the consistency between representations of the original sequence and its corrupted version.
Specifically, for a pre-trained model E and an input sequence x, the model-specific noise (e.g. masking in BERT) can be added to x to construct its corrupted versionx. Then, the pre-trained model E encodes x orx with self-attention mechanism (Vaswani et al., 2017) to obtain hidden representations h(x) = E(x) or h(x) = E(x). Both h(x) and h(x) belong to the representation space R m×d , where m denotes the length of the input sequence and d is the dimension of hidden representation.
Different from prior work (Devlin et al., 2019), we apply an extra aggregation layer A to obtain the global semantic representation of the input. Here A can be implemented as a multi-layer perceptron x 1 ) to the class of its corrupted version (e.g.x 1 ) and vice versa to encourage them to be similar in the semantic space, while classifying different instances into other classes to encourage them to be distant (see Eq. (2)).
with the representation of special classification token or the mean-pooling of all token representations as input. The final global semantic representations s(x) ∈ R d and s(x) ∈ R d of x andx are computed as: where norm 2 (·) represents 2 -normalization and • denotes the composition of operations. In order to obtain noise invariant sequence representations, we expect s(x) and s(x) to be as similar as possible, which can also be derived from the characteristic that x andx share semantics. At the same time, the global semantic representations of different instances should be distinguished from each other to extract the high-level specific signals of the input. Motivated by this, we employ contrastive loss (Hadsell et al., 2006) to model such training objectives, which can be formalized as a multi-class classification task. We represent a training batch of the original sequences as {x 1 , · · · , x n } where n is the batch size, and its corresponding corrupted data is denoted as {x 1 , · · · ,x n }. Intuitively, the loss should be low when s i is similar to its corrupted versionŝ i (positive example) and dissimilar to all other inputs (negative examples). Thus, the training loss for the original sequence x i is defined as: where s i = s(x i ),ŝ i = s(x i ), and τ is the temperature presented in Section 2.2. Similarly, the training loss for the corrupted sequencex i can be defined as: Eq.
(3) essentially correspond to the log loss of a softmax-based classifier measuring semantic similarity by dot product. The classifier treats each instance as a distinct class, and aims to classify x i to the class ofx i and vice versa. More vividly, as shown in Figure 2, Eq.
(2) and Eq. (3) strive to pull the original representation s i towards the representationŝ i of the corrupted sequencex i , and push it away from global semantic representations of other sequences. By maximizing the semantic similarity of global representations of x i andx i , the model is encouraged to learn noise-invariant and instance-diffused representations. On this account, the self-supervised representation model is pre-trained in a manner that is more applicable for noise-free data distribution. This alleviates the pretrain-finetune discrepancy induced by the noise of pre-training to some extent, leading to improved performance on downstream scenarios. Besides, by introducing more reasonable sentence-level supervision, our approach can also capture global semantics of the input more effectively. For the original training batch {x 1 , · · · , x n } and the constructed corrupted inputs {x 1 , · · · ,x n }, the final contrastive loss is the total sum of losses of all instances, which can be formulated as:

Model Extensions
We improve the proposed CAPT methods by proposing two extensions to the model: adaptive temperature and memory queue.
Adaptive Temperature. Prior work (Chen et al., 2020) has illustrated that the temperature τ controlling the concentration level of the sample distribution in Eq.
(2) and Eq. (3) exhibits significant impact on model performance. A suitable temperature can help the model learn from hard negatives via the gradient (Chen et al., 2020). Thus, it needs to be tuned elaborately to obtain a satisfactory fixed value. However, the optimal τ is constantly evolving as training proceeds. In fact, the gradient of the loss function Eq.
(2) with respect to the representation s i can be derived as: We found that the norm of the gradient in Eq. (5) tends to be inversely proportional to τ . Ideally, the model update should be carefully controlled not both in the early stage of contrastive pre-training to stabilize the training, but also in the later stage to avoid ending up bouncing around the minimum or getting stuck in local optima. Therefore, we implement the temperature τ as the following inverted triangle schedule instead of a predetermined fixed value: where t denotes the learning step and T refers to the preset total number of updates.
Memory Queue. As depicted in Figure 2, we treat different instances from the same batch as negative samples. Several recent studies (Chen et al., 2020;Dai and Lin, 2017) have illustrated that contrastive learning can benefit from larger negative representations. However, due to massive model parameters and limited machine memory, implementation with large batch size tends to be infeasible in many circumstances. To remedy this, we employ a dynamic memory queue Q to store the desired negative representations (He et al., 2019). At each learning step, the aggregated representations of the current batch of original inputs and its corresponding corrupted inputs are enqueued into Q.
Once reaching the preset capacity of Q, the oldest redundant representations are dequeued. Different from (Wu et al., 2018), the negative representations stored in Q are updated along with the training process to provide competitive confounders and informative signals for positive instances.

Experiments on Language Tasks
This section presents experiments on language tasks, including specific implementation and detailed results.

Implementation
For learning language representations, the noise corrupting the input sentence x can be implemented as masking like BERT or shuffling like BART. In our implementation, the main experiments follow the same corruption as BERT. That is, we randomly mask 15% tokens of x to construct its corrupted versionx. Then, both x andx in the training batch are fed into the encoder to compute the CAPT loss and mask language model (MLM) loss simultaneously. The final training loss is the sum of the above two. 1 More analysis of the influence of other corruption approaches can be found in Section 5.2. During pre-training, we train a small model (Section 5.1) to validate the influence of key components of CAPT, and a large model (Section 3.3 and Section 5.2) to demonstrate the effectiveness of CAPT for learning denoised text representation at a large scale. The small model, which is designed as a 6-layer Transformer with 256 hidden size and 4 attention heads, is trained on BookCorpus and English Wikipedia datasets. For the large CAPT model, we adopt RoBERT-Large model and training settings, which is a 24-layer Transformer with 1024 hidden size, 16 attention heads, and is trained on larger datasets. Readers can refer to  for the statistics of the dataset and processing details. The aggregation layer A that takes the representation of special classification token as input is implemented as a nonlinear projection with one hidden layer. The inner hidden size of A is set to the same as FFN inner hidden size and the output hidden size is set to the same as Transformer hidden size. The queue size is set to 8192 and we use Adam optimizer. The peak learning rate with linear warmup and decay is set to 5e-4 and 6e-4 for small and large models, respectively 2 .  Table 2: GLUE test and dev results of large models (24-layer transformer). We list the results on each set that are available in the published papers. "Avg" denotes the average score in terms of the reported metrics, which is slightly different from that in the GLUE leaderboard.

Evaluation Tasks
We perform the evaluation on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019a). Following previous work (Devlin et al., 2019;, we experiment on 8 natural language understanding tasks, including linguistic acceptability (CoLA), sentiment analysis (SST), text paraphrase (MRPC and QQP), sentence similarity (STS-B), and natural language inference (MNLI, QNLI, and RTE). For fine-tuning, we adopt the same settings and hyper-parameters as those of RoBERTa. All GLUE tasks are framed as single-sentence or sentence-pair classification tasks, except for STS-B which is a regression task. Extra multi-layer perceptrons (MLP) are added to perform classification or regression with the representation of special classification token as the input. The evaluation metrics include Matthews correlation for CoLA, Pearson correlation for STS-B, and accuracy for other tasks.

Results
Following prior work Yang et al., 2019), we report results on both dev and test data sets. For the dev set, we report the median of multiple random fine-tuning runs like RoBERTa to show reliable results. For the test set, since the ground-truth labels are not obtainable, we only made a single-model submission to the GLUE evaluation server. Note that most systems on the GLUE leaderboard adopt different ensemble metrics and task-specific fine-tuning methods (e.g. formulating QNLI as a ranking task or using multi-task fine-tuning), increasing the difficulty for a direct comparison. Thus, following (Devlin et al., 2019), we only report non-ensemble single-task results. Table 2 presents the performance of representative models on the GLUE benchmark. We can see that the proposed CAPT obtains best results on most of datasets. In more detail, CAPT outperforms RoBERTa that is a very strong baseline model on most language understanding datasets, by 0.3% and 0.6% improvements of average test and dev score. Note that CAPT and RoBERTa are nearly identical in terms of the model architecture and fine-tuning hyperparameters, the only difference being the incorporation of contrastive pre-training in CAPT. Thus we can attribute the improvements of performance on downstream tasks to the role contrastive pre-training plays in learning noise invariant sequence representations. We believe the reason behind its success is the capability of CAPT to alleviate the pretrain-finetune discrepancy induced by the noise of pre-training.
In particular, we find that CAPT performs extremely well on natural language inference (RTE, MNLI) which requires a deep understanding of sentence semantics, with 1.0% absolute improvement of average accuracy over RoBERTa on the dev set. This phenomenon can be possibly explained by the fact that our CAPT can better capture the global semantics of the input sequence due to the more effective sentence-level supervision provided by the contrastive training via negative sampling from a memory queue, resulting in superior model performance therein. More analysis about the influence of memory queue can be found in Section 5.1.  Table 3: Comparison to the state-of-the-art systems with the single model on VQA, GQA and NLVR 2 . The results of both VQA and GQA are reported on the "test-dev" split (used for validation on the official server) and the "test-std" split (used for maintaining the public leaderboard). The NLVR 2 results are reported on the local dev set ("dev") and the public test set ("test-p"). The results of baselines except LXMERT are obtained from prior work.

Implementation
Different from language tasks, the input sequence in the vision-language domain consists of visual region features paired with textual words. In this scenario, we build CAPT on LXMERT (Tan and Bansal, 2019), a representative cross-modal representation model that separately encodes visual and textual features and then introduces a cross-modal layer to integrate them. Same as Section 3.1, we construct the corrupted versionx of the original input x by masking part of visual features or textual words. In addition to the proposed CAPT which can learn sequence-level representations, following (Tan and Bansal, 2019), we also adopt three other pre-training tasks to learn more fine-grained word/region-level representations. These tasks include: masked language modeling (MLM) that predicts the masked words based on the corrupted inputx, masked region modeling (MRM) that predicts the masked visual region objects based on the corrupted inputx, and image-text matching (ITM) that predicts whether the input word sequence is semantically matched with the visual features. Due to space limitations, we do not elaborate on the detailed model architecture and these pre-training tasks here. We strongly recommend readers to refer to (Tan and Bansal, 2019) for the details. We also set the queue size to 8192 and the final training loss for visual-linguistic CAPT is defined as the sum of all the above training objectives. We use the preprocessed data provided by (Tan and Bansal, 2019), which mainly includes CoCo (Lin et al., 2014) and Visual Genome (Krishna et al., 2017). Only 36 objects detected by Faster-RCNN are kept for each image. The model architecture is the same as (Tan and Bansal, 2019), which consists of 9 language layers, 5 vision layers, and 5 cross-attention layers, with 768 hidden size. We employ Adam optimizer with the peak learning rate 1e − 4 paired with linear warmup and decay. The batch size and dropout are set to 512 and 0.1, respectively.

Evaluation Tasks
We perform evaluation on three benchmark tasks: VQA, GQA, and NLVR 2 . VQA (Goyal et al., 2017) aims to select the correct answer based on both the question and its paired image, while GQA (Hudson and Manning, 2019) shares the same task setting but require more reasoning. The goal of NLVR 2 (Suhr et al., 2019) is to predict whether the statement correctly describes the two images. All three tasks use accuracy (Acc) as the evaluation metric.
For VQA and GQA, we add extra multi-layer perceptrons (MLP) that take the representation of [CLS] as the input to perform classification. Since each instance in NLVR 2 is composed of two images (v 1 , v 2 ) and a sentence s, we use the representation model to encode (v 1 , s) and (v 2 , s), respectively. Then, a similar MLP takes the concatenation of the [CLS] representations of both (v 1 , s) and (v 2 , s) as the input to perform classification. We adopt the same hyper-parameters with LXMERT for fine-tuning to make a fair comparison. Table 3 presents the comparison between our approach and several representative systems on the vision-language tasks. We observe the consistent performance boost for our CAPT on all three tasks. For instance, it yields a 0.6% gain over the base architecture LXMERT on GQA and also surpasses various baselines on two other tasks. Such improvements indicate the enhanced capability of CAPT to learn noise invariant representations as well as capture the joint semantic representation of the input image-text pair. It is also worth noting that the performance gain of our approach on GQA is greater than that on VQA. The reason may be that GQA pays more attention to visual reasoning, which imposes higher demands on the modeling of joint semantics of visual-textual information. Correspondingly, our approach displays competence in the modeling of the global semantics of the input by introducing more effective sentence-level supervision, thereby attaining better results. Different from VQA and GQA, the goal of NLVR 2 is to determine whether a language caption is true about a pair of images. The increased accuracy on this task demonstrates the universal efficacy of our CAPT under a variety of task settings.

Ablation Study
We conduct an ablation study to verify the effectiveness of adaptive temperature and memory queue proposed in Section 2.2. As illustrated in Figure 3 (Left), fixing the size of negative samples, adaptive temperature exhibits consistent superiority over manually tuned constant temperature. Our elaborately designed inverted triangle schedule regarding temperature allows the self-adjustment of the gradients at different stages of contrastive pre-training, leading to a significant gain in model performance. Figure 3 (Left) also demonstrates that the GLUE score of the model consistently increases with the size of stored negative representations, and the ab-sence of memory queue (corresponding to the size of 128 in the figure) could result in a considerable degradation in performance. As depicted in prior work (Chen et al., 2020), large-scale negative samples can assist the model to capture distinctive high-level information of the inputs, therefore enhancing its capacity of feature extraction.

Analysis of Corruption noise
We conduct deeper analysis by constructingx by means of the shuffling noise to gain further insight into the influence of different corruption methods. For implementation, we apply a random permutation to the input sequence x within a fixed window k to construct its corrupted versionx. Following previous work (Lample et al.;Wang et al., 2019b), we set the window k to 3. Figure 3 (Right) presents the GLUE average score on the dev set when the corrupted sequencê x is constructed by shuffling or masking part of tokens. It suggests that both masking and shuffling contribute to the improvement of performance over the baseline which does not perform any corrpution noise to compute contrastive loss. The results also reveal the superiority of the masking operation over the shuffling operation. We speculate that the reason behind this phenomenon may be twofold: In the first place, since the baseline model is pre-trained with masked language modeling alone, the learned representations are covariant only with the masking noise. Thus, using the masking noise when applying CAPT to the baseline model makes more sense than using the shuffling noise.
The other reason may also lie in the observation that masking endows the model with the ability to "associate" words, while shuffling only allows the model to learn to "reorder" words. In other words, masking is likely to bring more informative supervision signals than shuffling when learning language representations. Due to the analysis above, we opt for masking as the noise when implementing our final version of CAPT.

Alleviation of Pretrain-finetune Discrepancy
In order to verify that the proposed CAPT can effectively alleviate the pretrain-finetune discrepancy induced by the noise of pre-training, we plot validation curves regarding the accuracy of both CAPT and the base architecture LXMERT on GQA. Figure 3 (Center) presents the corresponding results, illustrating that the self-supervised model pre-trained by CAPT not only exhibits increased fine-tuning speed but also obtains better performance. As shown in Figure 3 (Center), in the early stage of fine-tuning (the first 1K steps), the proposed CAPT managed to maintain an absolute lead of 5%-10% over LXMERT consistently. This demonstrates that representations learned by CAPT are more applicable for the data distribution in downstream tasks, rendering model transfer more effective. By narrowing the differences between representations of the original sequence and its corrupted version, the model is encouraged to learn noise invariant sequence representations. In this way, the pre-trained representation model is congruous with the task setting of noiseless inputs in downstream scenarios, leading to better model performance.

Related Work
Pre-trained Language Representations. This task strives to build linguistic representations benefiting various downstream tasks. One line of research focuses on autoregressive (AR) pre-training, while the other centers on denoising autoencoding (DAE). Representative work of AR pre-training includes ELMo (Peters et al., 2018) and GPT (Radford, 2018), which aim to predict the next word based on previous tokens but lack the modeling of bidirectional context. The other research line is built upon DAE, which strives to reconstruct the original sequence based on the corrupted input by jointly attending to both the left and right context. Main efforts focus on token-level pre-training tasks. However, DAE introduces the noise discarded on downstream tasks during pre-training, which is prone to learn representations covariant with the input noise, leading to the pretrain-finetune discrepancy. XLNet takes one step forward to solve this problem via permutation language modeling (PLM), but leave its own limitation: Each token does not know the "position" of future tokens in the permuted sentence . Therefore, XLNet also brings a discrepancy between pre-training and fine-tuning. Since CAPT can be adapted to model with arbitrary noise transformations to the input, it can also be built based on the shuffling noise of PLM. Besides, most AR and DAE based pre-training tasks neglect the modeling of global semantics of the input. Some DAE based approaches address this problem by incorporating supervisions regarding the entire segment through sentence-level tasks (e.g. next or adjacent sentence prediction (Devlin et al., 2019;Wang et al., 2019b)). However, such training relies heavily on the relative position of segments, which suffers from excessively loose semantic connections. Thus, it tends to result in confusing gradient signals. In comparison, CAPT encourages the semantic consistency of the original sequence and its corrupted version via unsupervised contrastive loss. This not only alleviates the pretrain-finetune discrepancy, but also better captures the global semantics of the input.
Pre-trained Vision-language Representations. This direction attempts to build generic representation models for vision-language tasks. In terms of model architecture, one research line focuses on one-stream BERT-based architecture, which strives to learn generic image-text representations with a unified model. The corresponding representative work includes VideoBERT (Sun et al., 2019), Vi-sualBERT (Li et al., 2019b), UNITER , Unicoder-VL (Li et al., 2019a), etc. In contrast, the other line such as ViLBERT  and LXMERT (Tan and Bansal, 2019) focuses on the two-stream architecture. They first separately encode visual and textual features and then interact with each other in the co-attention layers. As for pre-training tasks, different work exhibits commonalities, all focusing on MRM, MLM, and several specific tasks (e.g. ITM). However, most of these tasks are prone to learning noise covariant representations in the pre-training stage. Compared with these endeavors, our CAPT benefits the pre-trained model to learn noise invariant visionlanguage representations via elaborately-designed semantic contrastive loss, thereby bringing better model performance.
Contrastive Learning. It serves as an unsupervised objective, with the main idea is to construct or collect pairs of related (similar) data as positive samples and pairs of unrelated data as negative samples, and then learn to classify them via the contrastive loss. For example, the positive samples can be nodes connected by the same edge in graph representation learning (Bordes et al., 2013;Grover and Leskovec, 2016;Velickovic et al., 2019), images processed by pretext tasks in image representation learning (Wu et al., 2018;Ye et al., 2019), and bilingual sentence pairs in cross-lingual pretraining (Chi et al., 2020), etc. The contrastive loss can come in several forms, including noise contrastive estimation (Gutmann and Hyvärinen, 2010;Oord et al., 2018;, instance-wise classification (Wu et al., 2018), and etc. Inspired by these works, we adapt contrastive learning to the natural language and vision-language domains to learn noise invariant sequence representations, demonstrating its effectiveness in improving various pre-trained models.

Conclusion
This work presents contrastive pre-training for learning denoised sequence representations in a self-supervised manner. By enhancing the consistency between representations of the original sequence and the corresponding corrupted version, the pre-trained model is encouraged to learn noise invariant sequence representations. On this account, the proposed approach not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also better captures the global semantics of the input via more effective sentence-level supervision. Extensive experiments demonstrate the effectiveness and versatility of our approach, which can achieve consistent improvements over baselines in both language and visionlanguage domains.

Broader Impact
This section highlights the potential impact of this work, detailed as follows.
Beneficiaries of this work and detailed benefits. Both academic researchers and industrial engineers engaged in language-related tasks can benefit from our work. Pre-trained self-supervised representation models have become a research mainstream in the field of language. Plenty of efforts has been paid to promote relevant research. However, existing systems are prone to learn noise-covariant representations as well as cannot effectively capture the global semantic representation of the input, which results in the sub-optimal performance in downstream tasks. Our proposed CAPT not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also aids the pretrained model in better capturing global semantics of the input via more effective sentence-level supervision. This not only provides helpful insight for advancing related research in this field, but also offers an effective means for engineers to improve industrial systems and user experience.
Consequences of system failure. Existing deep learning systems, especially pre-trained representation models, require large-scale computational overhead. We will spare no effort to provide the community with detailed instructions on this work. However, the pre-training may suffer from the risk of divergence due to some unexpected reasons such as the wrong setting of hyper-parameters. In such extremely unfortunate situations, the wasted computing resources may result in unnecessary pollution to the environment.
Potential adverse effects. As an academic research paper, our work aims at providing the community with more effective tools and helpful insights. However, everything has its doublesidedness. The key factor determining the outcome lies in the individual who owns the tool, not the tool itself. As a representative deep learning paradigm, our approach may also adhere to the risks suffered from all deep learning models. When employed improperly by others, it may cause potential adverse effects on society. For instance, almost all pre-trained models (also including our approach) can be used to make fake texts. Nevertheless, we still want to emphasize that technology is innocent.