Huanru Henry Mao
2022
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
Huanru Henry Mao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention’s performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.
2019
Improving Neural Story Generation by Targeted Common Sense Grounding
Huanru Henry Mao
|
Bodhisattwa Prasad Majumder
|
Julian McAuley
|
Garrison Cottrell
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Stories generated with neural language models have shown promise in grammatical and stylistic consistency. However, the generated stories are still lacking in common sense reasoning, e.g., they often contain sentences deprived of world knowledge. We propose a simple multi-task learning scheme to achieve quantitatively better common sense reasoning in language models by leveraging auxiliary training signals from datasets designed to provide common sense grounding. When combined with our two-stage fine-tuning pipeline, our method achieves improved common sense reasoning and state-of-the-art perplexity on the WritingPrompts (Fan et al., 2018) story generation dataset.
Search