CodE Alltag 2.0 — A Pseudonymized German-Language Email Corpus

Elisabeth Eder, Ulrike Krieg-Holz, Udo Hahn


Abstract
The vast amount of social communication distributed over various electronic media channels (tweets, blogs, emails, etc.), so-called user-generated content (UGC), creates entirely new opportunities for today’s NLP research. Yet, data privacy concerns implied by the unauthorized use of these text streams as a data resource are often neglected. In an attempt to reconciliate the diverging needs of unconstrained raw data use and preservation of data privacy in digital communication, we here investigate the automatic recognition of privacy-sensitive stretches of text in UGC and provide an algorithmic solution for the protection of personal data via pseudonymization. Our focus is directed at the de-identification of emails where personally identifying information does not only refer to the sender but also to those people, locations, dates, and other identifiers mentioned in greetings, boilerplates and the content-carrying body of emails. We evaluate several de-identification procedures and systems on two hitherto non-anonymized German-language email corpora (CodE AlltagS+d and CodE AlltagXL), and generate fully pseudonymized versions for both (CodE Alltag 2.0) in which personally identifying information of all social actors addressed in these mails has been camouflaged (to the greatest extent possible).
Anthology ID:
2020.lrec-1.550
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4466–4477
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.550
DOI:
Bibkey:
Cite (ACL):
Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2020. CodE Alltag 2.0 — A Pseudonymized German-Language Email Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4466–4477, Marseille, France. European Language Resources Association.
Cite (Informal):
CodE Alltag 2.0 — A Pseudonymized German-Language Email Corpus (Eder et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.550.pdf