Merlin Humml


2024

pdf bib
Leveraging High-Precision Corpus Queries for Text Classification via Large Language Models
Nathan Dykes | Stephanie Evert | Philipp Heinrich | Merlin Humml | Lutz Schröder
Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024

We use query results from manually designed corpus queries for fine-tuning an LLM to identify argumentative fragments as a text mining task. The resulting model outperforms both an LLM fine-tuned on a relatively large manually annotated gold standard of tweets as well as a rule-based approach. This proof-of-concept study demonstrates the usefulness of corpus queries to generate training data for complex text categorisation tasks, especially if the targeted category has low prevalence (so that a manually annotated gold standard contains only a small number of positive examples).