DNLP@FinTOC’20: Table of Contents Detection in Financial Documents

Dijana Kosmajac, Stacey Taylor, Mozhgan Saeidi


Abstract
Title Detection and Table of Contents Generation are important components in detecting document structure. In particular, these two elements serve to provide the skeleton of the document, providing users with an understanding of organization, as well as the relevance of information, and where to find information within the document. Here, we show that using tesseract with Levenstein distance, a feature set inspired by Alk et al., we were able to correctly classify the title to an F1 measure 0.73 and 0.87, and the table-of-contents to a harmonic mean of 0.36 and 0.39, in English and French respectively. Our methodology works with both PDF and scanned documents, giving it a wide range of applicability within the document engineering and storage domains.
Anthology ID:
2020.fnp-1.29
Volume:
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Dr Mahmoud El-Haj, Dr Vasiliki Athanasakou, Dr Sira Ferradans, Dr Catherine Salzedo, Dr Ans Elhag, Dr Houda Bouamor, Dr Marina Litvak, Dr Paul Rayson, Dr George Giannakopoulos, Nikiforos Pittaras
Venue:
FNP
SIG:
Publisher:
COLING
Note:
Pages:
169–173
Language:
URL:
https://aclanthology.org/2020.fnp-1.29
DOI:
Bibkey:
Cite (ACL):
Dijana Kosmajac, Stacey Taylor, and Mozhgan Saeidi. 2020. DNLP@FinTOC’20: Table of Contents Detection in Financial Documents. In Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, pages 169–173, Barcelona, Spain (Online). COLING.
Cite (Informal):
DNLP@FinTOC’20: Table of Contents Detection in Financial Documents (Kosmajac et al., FNP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.fnp-1.29.pdf