Challenging America: Modeling language in longer time scales

Jakub Pokrywka, Filip Graliński, Krzysztof Jassem, Karol Kaczmarek, Krzysztof Jurkiewicz, Piotr Wierzchon


Abstract
The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.
Anthology ID:
2022.findings-naacl.56
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
737–749
Language:
URL:
https://aclanthology.org/2022.findings-naacl.56
DOI:
10.18653/v1/2022.findings-naacl.56
Bibkey:
Cite (ACL):
Jakub Pokrywka, Filip Graliński, Krzysztof Jassem, Karol Kaczmarek, Krzysztof Jurkiewicz, and Piotr Wierzchon. 2022. Challenging America: Modeling language in longer time scales. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 737–749, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Challenging America: Modeling language in longer time scales (Pokrywka et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-naacl.56.pdf
Video:
 https://aclanthology.org/2022.findings-naacl.56.mp4
Data
GLUESuperGLUE