A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection

Filip Dobranić, Bojan Evkoski, Nikola Ljubešić


Abstract
Preparing historical newspaper collections is a complicated endeavour, consisting of multiple steps that have to be carefully adapted to the specific content in question, including imaging, layout prediction, optical character recognition, and linguistic annotation. To address the high costs associated with the process, we present a lightweight approach to producing high-quality corpora and apply it to a massive collection of Slovenian historical newspapers from the 18th, 19th and 20th century resulting in a billion-word giga-corpus. We start with noisy OCR-ed data produced by different technologies in varying periods by the National and University Library of Slovenia. To address the inherent variability in the quality of textual data, a challenge commonly encountered in digital libraries globally, we perform a targeted post-digitisation correction procedure, coupled with a robust curation mechanism for noisy texts via language model inference. Subsequently, we subject the corrected and filtered output to comprehensive linguistic annotation, enriching the corpus with part-of-speech tags, lemmas, and named entity labels. Finally, we perform an analysis through topic modeling at the noun lemma level, along with a frequency analysis of the named entities, to confirm the viability of our corpus preparation method.
Anthology ID:
2024.lrec-main.61
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
695–703
Language:
URL:
https://aclanthology.org/2024.lrec-main.61
DOI:
Bibkey:
Cite (ACL):
Filip Dobranić, Bojan Evkoski, and Nikola Ljubešić. 2024. A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 695–703, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection (Dobranić et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.61.pdf