Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

Amanda Myntti, Liina Repo, Elian Freyermuth, Antti Kanner, Veronika Laippala, Erik Henriksson


Abstract
Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.
Anthology ID:
2024.nlp4dh-1.38
Volume:
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:
November
Year:
2024
Address:
Miami, USA
Editors:
Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
386–397
Language:
URL:
https://aclanthology.org/2024.nlp4dh-1.38
DOI:
Bibkey:
Cite (ACL):
Amanda Myntti, Liina Repo, Elian Freyermuth, Antti Kanner, Veronika Laippala, and Erik Henriksson. 2024. Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 386–397, Miami, USA. Association for Computational Linguistics.
Cite (Informal):
Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora (Myntti et al., NLP4DH 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4dh-1.38.pdf