Attention Temperature Matters in Abstractive Summarization Distillation

Shengqiang Zhang, Xingxing Zhang, Hangbo Bao, Furu Wei


Abstract
Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and with minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves vanilla pseudo-labeling based methods. Further empirical analysis shows that both pseudo labels and summaries produced by our students are shorter and more abstractive.
Anthology ID:
2022.acl-long.11
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
127–141
Language:
URL:
https://aclanthology.org/2022.acl-long.11
DOI:
10.18653/v1/2022.acl-long.11
Bibkey:
Cite (ACL):
Shengqiang Zhang, Xingxing Zhang, Hangbo Bao, and Furu Wei. 2022. Attention Temperature Matters in Abstractive Summarization Distillation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 127–141, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Attention Temperature Matters in Abstractive Summarization Distillation (Zhang et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.11.pdf
Software:
 2022.acl-long.11.software.zip
Code
 shengqiang-zhang/plate
Data
CNN/Daily MailNew York Times Annotated CorpusXSum