We introduce Self-CRItic Pretraining Transformers (SCRIPT) for representation learning of text. The popular masked language modeling (MLM) pretraining methods like BERT replace some tokens with [MASK] and an encoder is trained to recover them, while ELECTRA trains a discriminator to detect replaced tokens proposed by a generator. In contrast, we train a language model as in MLM and further derive a discriminator or critic on top of the encoder without using any additional parameters. That is, the model itself is a critic. SCRIPT combines MLM training and discriminative training for learning rich representations and compute- and sample-efficiency. We demonstrate improved sample-efficiency in pretraining and enhanced representations evidenced by improved downstream task performance on GLUE and SQuAD over strong baselines. Also, the self-critic scores can be directly used as pseudo-log-likelihood for efficient scoring.
Latent variable models for text, when trained successfully, accurately model the data distribution and capture global semantic and syntactic features of sentences. The prominent approach to train such models is variational autoencoders (VAE). It is nevertheless challenging to train and often results in a trivial local optimum where the latent variable is ignored and its posterior collapses into the prior, an issue known as posterior collapse. Various techniques have been proposed to mitigate this issue. Most of them focus on improving the inference model to yield latent codes of higher quality. The present work proposes a short run dynamics for inference. It is initialized from the prior distribution of the latent variable and then runs a small number (e.g., 20) of Langevin dynamics steps guided by its posterior distribution. The major advantage of our method is that it does not require a separate inference model or assume simple geometry of the posterior distribution, thus rendering an automatic, natural and flexible inference engine. We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse. Analyses of the latent space show that interpolation in the latent space is able to generate coherent sentences with smooth transition and demonstrate improved classification over strong baselines with latent features from unsupervised pretraining. These results together expose a well-structured latent space of our generative model.
Inferring social relations from dialogues is vital for building emotionally intelligent robots to interpret human language better and act accordingly. We model the social network as an And-or Graph, named SocAoG, for the consistency of relations among a group and leveraging attributes as inference cues. Moreover, we formulate a sequential structure prediction task, and propose an 𝛼-𝛽-𝛾 strategy to incrementally parse SocAoG for the dynamic inference upon any incoming utterance: (i) an 𝛼 process predicting attributes and relations conditioned on the semantics of dialogues, (ii) a 𝛽 process updating the social relations based on related attributes, and (iii) a 𝛾 process updating individual’s attributes based on interpersonal social relations. Empirical results on DialogRE and MovieGraph show that our model infers social relations more accurately than the state-of-the-art methods. Moreover, the ablation study shows the three processes complement each other, and the case study demonstrates the dynamic relational inference.
Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific pretraining is often not robust. In particular, the performance considerably varies as the random seed changes or the number of pretraining and/or fine-tuning iterations varies, and the fine-tuned model is vulnerable to adversarial attack. We propose a simple yet effective adapter-based approach to mitigate these issues. Specifically, we insert small bottleneck layers (i.e., adapter) within each layer of a pretrained model, then fix the pretrained layers and train the adapter layers on the downstream task data, with (1) task-specific unsupervised pretraining and then (2) task-specific supervised training (e.g., classification, sequence labeling). Our experiments demonstrate that such a training scheme leads to improved stability and adversarial robustness in transfer learning to various downstream tasks.