NLP for Digital Humanities: Processing Chronological Text Corpora

Adam Pawłowski, Tomasz Walkowiak


Abstract
The paper focuses on the integration of Natural Language Processing (NLP) techniques to analyze extensive chronological text corpora. This research underscores the synergy between humanistic inquiry and computational methods, especially in the processing and analysis of sequential textual data known as lexical series. A reference workflow for chronological corpus analysis is introduced, outlining the methodologies applicable to the ChronoPress corpus, a data set that encompasses 22 years of Polish press from 1945 to 1966. The study showcases the potential of this approach in uncovering cultural and historical patterns through the analysis of lexical series. The findings highlight both the challenges and opportunities present in leveraging lexical series analysis within Digital Humanities, emphasizing the necessity for advanced data filtering and anomaly detection algorithms to effectively manage the vast and intricate datasets characteristic of this field.
Anthology ID:
2024.nlp4dh-1.10
Volume:
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:
November
Year:
2024
Address:
Miami, USA
Editors:
Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
105–112
Language:
URL:
https://aclanthology.org/2024.nlp4dh-1.10
DOI:
Bibkey:
Cite (ACL):
Adam Pawłowski and Tomasz Walkowiak. 2024. NLP for Digital Humanities: Processing Chronological Text Corpora. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 105–112, Miami, USA. Association for Computational Linguistics.
Cite (Informal):
NLP for Digital Humanities: Processing Chronological Text Corpora (Pawłowski & Walkowiak, NLP4DH 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4dh-1.10.pdf