Emil.RuleZ! – An exploratory pilot study of handling a real-life longitudinal email archive

Balázs Indig, Luca Horváth, Dorottya Henrietta Szemigán, Mihály Nagy


Abstract
An entire generation that predominantly used email for official communication throughout their lives is about to leave behind a significant amount of preservable digital heritage. Memory institutions in the USA (e.g. Internet Archive, Stanford University Library) recognised this endeavor of preservation early on, therefore, available solutions are focused on English language public archives, neglecting the problem of different languages with different encodings in a single archive and the heterogeneity of standards that have changed considerably since their first form in the 1970s. Since online services enable the convenient creation of email archives in MBOX format it is important to evaluate how existing tools handle non-homogeneous longitudinal archives containing diverse states of email standards, as opposed to often archived monolingual public mailing lists, and how such data can be made ready for research. We use distant reading methods on a real-life archive, the legacy of a deceased individual containing 11,245 emails from 2010 to 2023 in multiple languages and encodings, and demonstrate how existing available tools can be surpassed. Our goal is to enhance data homogeneity to make it accessible for researchers in a queryable database format. We utilise rule-based methods and GPT-3.5 to extract the cleanest form of our data.
Anthology ID:
2023.nlp4dh-1.21
Volume:
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Month:
December
Year:
2023
Address:
Tokyo, Japan
Editors:
Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter
Venues:
NLP4DH | IWCLUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
172–178
Language:
URL:
https://aclanthology.org/2023.nlp4dh-1.21
DOI:
Bibkey:
Cite (ACL):
Balázs Indig, Luca Horváth, Dorottya Henrietta Szemigán, and Mihály Nagy. 2023. Emil.RuleZ! – An exploratory pilot study of handling a real-life longitudinal email archive. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 172–178, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Emil.RuleZ! – An exploratory pilot study of handling a real-life longitudinal email archive (Indig et al., NLP4DH-IWCLUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nlp4dh-1.21.pdf