Language Resources for Historical Newspapers: the Impresso Collection

Maud Ehrmann, Matteo Romanello, Simon Clematide, Phillip Benjamin Ströbel, Raphaël Barman


Abstract
Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents.
Anthology ID:
2020.lrec-1.121
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
958–968
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.121
DOI:
Bibkey:
Cite (ACL):
Maud Ehrmann, Matteo Romanello, Simon Clematide, Phillip Benjamin Ströbel, and Raphaël Barman. 2020. Language Resources for Historical Newspapers: the Impresso Collection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 958–968, Marseille, France. European Language Resources Association.
Cite (Informal):
Language Resources for Historical Newspapers: the Impresso Collection (Ehrmann et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.121.pdf