Distilling Knowledge Learned in BERT for Text Generation
Yen-Chun Chen | Zhe Gan | Yu Cheng | Jingzhou Liu | Jingjing Liu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supervision to improve conventional Seq2Seq models (student) for better text generation performance. By leveraging BERT’s idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization. Our proposed model also achieves new state of the art on IWSLT German-English and English-Vietnamese MT datasets.
Stack-Pointer Networks for Dependency Parsing
Xuezhe Ma | Zecong Hu | Jingzhou Liu | Nanyun Peng | Graham Neubig | Eduard Hovy
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce a novel architecture for dependency parsing: stack-pointer networks (StackPtr). Combining pointer networks (Vinyals et al., 2015) with an internal stack, the proposed model first reads and encodes the whole sentence, then builds the dependency tree top-down (from root-to-leaf) in a depth-first fashion. The stack tracks the status of the depth-first search and the pointer networks select one child for the word at the top of the stack at each step. The StackPtr parser benefits from the information of whole sentence and all previously derived subtree structures, and removes the left-to-right restriction in classical transition-based parsers. Yet the number of steps for building any (non-projective) parse tree is linear in the length of the sentence just as other transition-based parsers, yielding an efficient decoding algorithm with O(n2) time complexity. We evaluate our model on 29 treebanks spanning 20 languages and different dependency annotation schemas, and achieve state-of-the-art performances on 21 of them
- Xuezhe Ma 1
- Zecong Hu 1
- Nanyun Peng 1
- Graham Neubig 1
- Eduard Hovy 1
- show all...