Taxy.io@FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning

Frederic Haase, Steffen Kirchhoff


Abstract
In this paper we describe our system submitted to the FinTOC-2020 shared task on financial doc- ument structure extraction. We propose a two-step approach to identify titles in financial docu- ments and to extract their table of contents (TOC). First, we identify text blocks as candidates for titles using unsupervised learning based on character-level information of each document. Then, we apply supervised learning on a self-constructed regression task to predict the depth of each text block in the document structure hierarchy using transfer learning combined with document features and layout features. It is noteworthy that our single multilingual model performs well on both tasks and on different languages, which indicates the usefulness of transfer learning for title detection and TOC generation. Moreover, our approach is independent of the presence of actual TOC pages in the documents. It is also one of the few submissions to the FinTOC-2020 shared task addressing both subtasks in both languages, English and French, with one single model.
Anthology ID:
2020.fnp-1.28
Volume:
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Dr Mahmoud El-Haj, Dr Vasiliki Athanasakou, Dr Sira Ferradans, Dr Catherine Salzedo, Dr Ans Elhag, Dr Houda Bouamor, Dr Marina Litvak, Dr Paul Rayson, Dr George Giannakopoulos, Nikiforos Pittaras
Venue:
FNP
SIG:
Publisher:
COLING
Note:
Pages:
163–168
Language:
URL:
https://aclanthology.org/2020.fnp-1.28
DOI:
Bibkey:
Cite (ACL):
Frederic Haase and Steffen Kirchhoff. 2020. Taxy.io@FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning. In Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, pages 163–168, Barcelona, Spain (Online). COLING.
Cite (Informal):
Taxy.io@FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning (Haase & Kirchhoff, FNP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.fnp-1.28.pdf