Learning to Prioritize: Precision-Driven Sentence Filtering for Long Text Summarization

Alex Mei, Anisha Kabir, Rukmini Bapat, John Judge, Tony Sun, William Yang Wang


Abstract
Neural text summarization has shown great potential in recent years. However, current state-of-the-art summarization models are limited by their maximum input length, posing a challenge to summarizing longer texts comprehensively. As part of a layered summarization architecture, we introduce PureText, a simple yet effective pre-processing layer that removes low- quality sentences in articles to improve existing summarization models. When evaluated on popular datasets like WikiHow and Reddit TIFU, we show up to 3.84 and 8.57 point ROUGE-1 absolute improvement on the full test set and the long article subset, respectively, for state-of-the-art summarization models such as BertSum and BART. Our approach provides downstream models with higher-quality sentences for summarization, improving overall model performance, especially on long text articles.
Anthology ID:
2022.lrec-1.33
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
313–318
Language:
URL:
https://aclanthology.org/2022.lrec-1.33
DOI:
Bibkey:
Cite (ACL):
Alex Mei, Anisha Kabir, Rukmini Bapat, John Judge, Tony Sun, and William Yang Wang. 2022. Learning to Prioritize: Precision-Driven Sentence Filtering for Long Text Summarization. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 313–318, Marseille, France. European Language Resources Association.
Cite (Informal):
Learning to Prioritize: Precision-Driven Sentence Filtering for Long Text Summarization (Mei et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.33.pdf
Data
Reddit TIFUWikiHow