Logical Layout Analysis Applied to Historical Newspapers

Nicolas Gutehrlé, Iana Atanassova


Abstract
In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of logical layout analysis applied to historical documents. We propose a method which is based on the study of a dataset in order to identify rules that assign logical labels to both block and lines of text from XML ALTO documents. Our dataset contains newspapers in French, published in the first half of the 20th century. The evaluation shows that our methodology performs well for the identification of first lines of paragraphs and text lines, with F1 above 0.9. The identification of titles obtains an F1 of 0.64. This method can be applied to preprocess XML ALTO documents in preparation for downstream tasks, and also to annotate large-scale datasets to train machine learning and deep learning algorithms.
Anthology ID:
2021.nlp4dh-1.10
Volume:
Proceedings of the Workshop on Natural Language Processing for Digital Humanities
Month:
December
Year:
2021
Address:
NIT Silchar, India
Editors:
Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter
Venue:
NLP4DH
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
85–94
Language:
URL:
https://aclanthology.org/2021.nlp4dh-1.10
DOI:
Bibkey:
Cite (ACL):
Nicolas Gutehrlé and Iana Atanassova. 2021. Logical Layout Analysis Applied to Historical Newspapers. In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 85–94, NIT Silchar, India. NLP Association of India (NLPAI).
Cite (Informal):
Logical Layout Analysis Applied to Historical Newspapers (Gutehrlé & Atanassova, NLP4DH 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.nlp4dh-1.10.pdf