Merlin Humml
2024
Leveraging High-Precision Corpus Queries for Text Classification via Large Language Models
Nathan Dykes
|
Stephanie Evert
|
Philipp Heinrich
|
Merlin Humml
|
Lutz Schröder
Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024
We use query results from manually designed corpus queries for fine-tuning an LLM to identify argumentative fragments as a text mining task. The resulting model outperforms both an LLM fine-tuned on a relatively large manually annotated gold standard of tweets as well as a rule-based approach. This proof-of-concept study demonstrates the usefulness of corpus queries to generate training data for complex text categorisation tasks, especially if the targeted category has low prevalence (so that a manually annotated gold standard contains only a small number of positive examples).