Language Resources From Prominent Born-Digital Humanities Texts are Still Needed in the Age of LLMs

Natalie Hervieux, Peiran Yao, Susan Brown, Denilson Barbosa


Abstract
The digital humanities (DH) community fundamentally embraces the use of computerized tools for the study and creation of knowledge related to language, history, culture, and human values, in which natural language plays a prominent role. Many successful DH tools rely heavily on Natural Language Processing methods, and several efforts exist within the DH community to promote the use of newer and better tools. Nevertheless, most NLP research is driven by web corpora that are noticeably different from texts commonly found in DH artifacts, which tend to use richer language and refer to rarer entities. Thus, the near-human performance achieved by state-of-the-art NLP tools on web texts might not be achievable on DH texts. We introduce a dataset carefully created by computer scientists and digital humanists intended to serve as a reference point for the development and evaluation of NLP tools. The dataset is a subset of a born-digital textbase resulting from a prominent and ongoing experiment in digital literary history, containing thousands of multi-sentence excerpts that are suited for information extraction tasks. We fully describe the dataset and show that its language is demonstrably different than the corpora normally used in training language resources in the NLP community.
Anthology ID:
2024.nlp4dh-1.9
Volume:
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:
November
Year:
2024
Address:
Miami, USA
Editors:
Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
85–104
Language:
URL:
https://aclanthology.org/2024.nlp4dh-1.9
DOI:
Bibkey:
Cite (ACL):
Natalie Hervieux, Peiran Yao, Susan Brown, and Denilson Barbosa. 2024. Language Resources From Prominent Born-Digital Humanities Texts are Still Needed in the Age of LLMs. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 85–104, Miami, USA. Association for Computational Linguistics.
Cite (Informal):
Language Resources From Prominent Born-Digital Humanities Texts are Still Needed in the Age of LLMs (Hervieux et al., NLP4DH 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4dh-1.9.pdf