What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

Andrew Halterman; Katherine A. Keith

What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

Abstract

Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, we focus on the steps before and after LLM prompting: conceptualization of the categories to classify and using LLM predictions in downstream statistical inference. We argue these steps have been overlooked in much of LLM-era CSS and LLMs can tempt analysts to skip conceptualization altogether. For example, a political scientist classifying "protest" with LLMs may never be forced to craft a definition: unlike human annotators who would ask clarifying questions, an LLM can silently accept an underspecified concept to classify and return plausible-looking labels. Using simulations, we show that conceptualization failures induce downstream inferential bias that cannot be corrected solely by a more accurate LLM or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice for pursuing low-cost, unbiased, low-variance downstream estimates.

Anthology ID:: 2026.acl-long.92
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2043–2059
Language:
URL:: https://aclanthology.org/2026.acl-long.92/
DOI:
Bibkey:
Cite (ACL):: Andrew Halterman and Katherine A. Keith. 2026. What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2043–2059, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification (Halterman & Keith, ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.92.pdf
Checklist:: 2026.acl-long.92.checklist.pdf

PDF Cite Search Checklist Fix data