Knowledge Distillation (KD) is one of the most effective approaches to deploying large-scale pre-trained language models in low-latency environments by transferring the knowledge contained in the large-scale models to smaller student models. Prior KD approaches use the soft labels and intermediate activations generated by the teacher to transfer knowledge to the student model parameters alone. In this paper, we show that having access to non-parametric memory in the form of a knowledge base with the teacher’s soft labels and predictions can further improve student generalization. To enable the student to retrieve from the knowledge base effectively, we propose a new framework and loss function that preserves the semantic similarities of teacher and student training examples. We show through extensive experiments that our retrieval mechanism can achieve state-of-the-art performance for task-specific knowledge distillation on the GLUE benchmark.
Contrastive learning has been demonstrated effective in unsupervised sentence representation learning. Given one sentence, positive pairs are obtained by passing the sentence to the encoder twice using the different dropout masks, and negative pairs are obtained by taking another sentence in the same mini-batch. However, the method suffers from the surface structure bias, i.e., sentences with similar surface structures will be regarded as close in semantics while sentences with dissimilar surface structures will be viewed as distinct in semantics. This leads to the result that paraphrasing a sentence that is dissimilar in surface structure will receive a lower semantic similarity score than inserting a negative word into the sentence. In this paper, we first verify the bias by collecting a sentence transformation testset. Then we systematically probe the existing models by proposing novel splits based on benchmark datasets in accordance with semantic and surface structure similarity. We tackle the bias in two aspects: balancing the learning target by augmenting with data that counters the bias, and meanwhile preserving word semantics by leveraging recall loss to prevent catastrophic forgetting. We evaluate our model on standard semantic textual similarity (STS) tasks using different pre-trained backbones and achieve state-of-the-art averaged performance across the STS benchmarks. Particularly, our models that are fine-tuned with RoBERTabase and RoBERTalarge achieve significantly better performance on most benchmark datasets.
Content-planning is an essential part of data-to-text generation to determine the order of data mentioned in generated texts. Recent neural data-to-text generation models employ Pointer Networks to explicitly learn content-plan given a set of attributes as input. They use LSTM to encode the input, which assumes a sequential relationship in the input. This may be sub-optimal to encode a set of attributes, where the attributes have a composite structure: the attributes are disordered while each attribute value is an ordered list of tokens. We handle this problem by proposing a neural content-planner that can capture both local and global contexts of such a structure. Specifically, we propose a novel attention mechanism called GSC-attention. A key component of the GSC-attention is grouped-attention, which is token-level attention constrained within each input attribute that enables our proposed model captures both local and global context. Moreover, our content-planner explicitly learns content-selection, which is integrated into the content-planner to select the most important data to be included in the generated text via an attention masking procedure. Experimental results show that our model outperforms the competitors by 4.92%, 4.70%, and 16.56% in terms of Damerau-Levenshtein Distance scores on three real-world datasets.