TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation

Sajad Sotudeh, Nazli Goharian


Abstract
Many scientific papers such as those in arXiv and PubMed data collections have abstracts with varying lengths of 50-1000 words and average length of approximately 200 words, where longer abstracts typically convey more information about the source paper. Up to recently, scientific summarization research has typically focused on generating short, abstract-like summaries following the existing datasets used for scientific summarization. In domains where the source text is relatively long-form, such as in scientific documents, such summary is not able to go beyond the general and coarse overview and provide salient information from the source document. The recent interest to tackle this problem motivated curation of scientific datasets, arXiv-Long and PubMed-Long, containing human-written summaries of 400-600 words, hence, providing a venue for research in generating long/extended summaries. Extended summaries facilitate a faster read while providing details beyond coarse information. In this paper, we propose TSTR, an extractive summarizer that utilizes the introductory information of documents as pointers to their salient information. The evaluations on two existing large-scale extended summarization datasets indicate statistically significant improvement in terms of Rouge and average Rouge (F1) scores (except in one case) as compared to strong baselines and state-of-the-art. Comprehensive human evaluations favor our generated extended summaries in terms of cohesion and completeness.
Anthology ID:
2022.naacl-main.25
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
325–335
Language:
URL:
https://aclanthology.org/2022.naacl-main.25
DOI:
10.18653/v1/2022.naacl-main.25
Bibkey:
Cite (ACL):
Sajad Sotudeh and Nazli Goharian. 2022. TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 325–335, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation (Sotudeh & Goharian, NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.25.pdf
Video:
 https://aclanthology.org/2022.naacl-main.25.mp4
Code
 georgetown-ir-lab/tstrsum +  additional community code