Antti Kanner


2024

pdf bib
Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora
Amanda Myntti | Liina Repo | Elian Freyermuth | Antti Kanner | Veronika Laippala | Erik Henriksson
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.

2023

pdf bib
Detection and attribution of quotes in Finnish news media: BERT vs. rule-based approach
Maciej Janicki | Antti Kanner | Eetu Mäkelä
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We approach the problem of recognition and attribution of quotes in Finnish news media. Solving this task would create possibilities for large-scale analysis of media wrt. the presence and styles of presentation of different voices and opinions. We describe the annotation of a corpus of media texts, numbering around 1500 articles, with quote attribution and coreference information. Further, we compare two methods for automatic quote recognition: a rule-based one operating on dependency trees and a machine learning one built on top of the BERT language model. We conclude that BERT provides more promising results even with little training data, achieving 95% F-score on direct quote recognition and 84% for indirect quotes. Finally, we discuss open problems and further associated tasks, especially the necessity of resolving speaker mentions to entity references.