A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports

Xinyu Wang, Lin Gui, Yulan He


Abstract
Table of contents (ToC) extraction centres on structuring documents in a hierarchical manner. In this paper, we propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022. These reports pose significant challenges due to their diverse structures and extensive length. To address these challenges, we propose a new framework for Toc extraction, consisting of three steps: (1) Constructing an initial tree of text blocks based on reading order and font sizes; (2) Modelling each tree node (or text block) independently by considering its contextual information captured in node-centric subtree; (3) Modifying the original tree by taking appropriate action on each tree node (Keep, Delete, or Move). This construction-modelling-modification (CMM) process offers several benefits. It eliminates the need for pairwise modelling of section headings as in previous approaches, making document segmentation practically feasible. By incorporating structured information, each section heading can leverage both local and long-distance context relevant to itself. Experimental results show that our approach outperforms the previous state-of-the-art baseline with a fraction of running time. Our framework proves its scalability by effectively handling documents of any length.
Anthology ID:
2023.emnlp-main.816
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13215–13229
Language:
URL:
https://aclanthology.org/2023.emnlp-main.816
DOI:
10.18653/v1/2023.emnlp-main.816
Bibkey:
Cite (ACL):
Xinyu Wang, Lin Gui, and Yulan He. 2023. A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13215–13229, Singapore. Association for Computational Linguistics.
Cite (Informal):
A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports (Wang et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.816.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.816.mp4