Leveraging High-Precision Corpus Queries for Text Classification via Large Language Models

Nathan Dykes, Stephanie Evert, Philipp Heinrich, Merlin Humml, Lutz Schröder


Abstract
We use query results from manually designed corpus queries for fine-tuning an LLM to identify argumentative fragments as a text mining task. The resulting model outperforms both an LLM fine-tuned on a relatively large manually annotated gold standard of tweets as well as a rule-based approach. This proof-of-concept study demonstrates the usefulness of corpus queries to generate training data for complex text categorisation tasks, especially if the targeted category has low prevalence (so that a manually annotated gold standard contains only a small number of positive examples).
Anthology ID:
2024.delite-1.7
Volume:
Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Annette Hautli-Janisz, Gabriella Lapesa, Lucas Anastasiou, Valentin Gold, Anna De Liddo, Chris Reed
Venue:
DELITE
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
52–57
Language:
URL:
https://aclanthology.org/2024.delite-1.7
DOI:
Bibkey:
Cite (ACL):
Nathan Dykes, Stephanie Evert, Philipp Heinrich, Merlin Humml, and Lutz Schröder. 2024. Leveraging High-Precision Corpus Queries for Text Classification via Large Language Models. In Proceedings of the First Workshop on Language-driven Deliberation Technology (DELITE) @ LREC-COLING 2024, pages 52–57, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Leveraging High-Precision Corpus Queries for Text Classification via Large Language Models (Dykes et al., DELITE 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.delite-1.7.pdf