Measuring the Effects of Bias in Training Data for Literary Classification

Sunyam Bagga, Andrew Piper


Abstract
Downstream effects of biased training data have become a major concern of the NLP community. How this may impact the automated curation and annotation of cultural heritage material is currently not well known. In this work, we create an experimental framework to measure the effects of different types of stylistic and social bias within training data for the purposes of literary classification, as one important subclass of cultural material. Because historical collections are often sparsely annotated, much like our knowledge of history is incomplete, researchers often cannot know the underlying distributions of different document types and their various sub-classes. This means that bias is likely to be an intrinsic feature of training data when it comes to cultural heritage material. Our aim in this study is to investigate which classification methods may help mitigate the effects of different types of bias within curated samples of training data. We find that machine learning techniques such as BERT or SVM are robust against reproducing the different kinds of bias within our test data, except in the most extreme cases. We hope that this work will spur further research into the potential effects of bias within training data for other cultural heritage material beyond the study of literature.
Anthology ID:
2020.latechclfl-1.9
Volume:
Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
December
Year:
2020
Address:
Online
Venues:
CLFL | COLING | LaTeCH | LaTeCHCLfL
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
74–84
Language:
URL:
https://aclanthology.org/2020.latechclfl-1.9
DOI:
Bibkey:
Cite (ACL):
Sunyam Bagga and Andrew Piper. 2020. Measuring the Effects of Bias in Training Data for Literary Classification. In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 74–84, Online. International Committee on Computational Linguistics.
Cite (Informal):
Measuring the Effects of Bias in Training Data for Literary Classification (Bagga & Piper, LaTeCHCLfL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.latechclfl-1.9.pdf
Code
 sunyam/bias-literary-classification