LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization

Laura Nguyen, Thomas Scialom, Benjamin Piwowarski, Jacopo Staiano


Abstract
Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets – consistently built from scholar resources – covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models – two orthogonal approaches – and obtain state-of-the-art results, showing the importance of combining both lines of research.
Anthology ID:
2023.eacl-main.46
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
636–651
Language:
URL:
https://aclanthology.org/2023.eacl-main.46
DOI:
10.18653/v1/2023.eacl-main.46
Award:
 EACL Outstanding Paper
Bibkey:
Cite (ACL):
Laura Nguyen, Thomas Scialom, Benjamin Piwowarski, and Jacopo Staiano. 2023. LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 636–651, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization (Nguyen et al., EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.46.pdf
Video:
 https://aclanthology.org/2023.eacl-main.46.mp4