From Documents to Segments: A Contextual Reformulation for Topic Assignment

Hoonsang Yoon; Takyoung Kim; Wonkee Lee; Ilmin Cho; Dilek Hakkani-Tur; Stanley Jungkyu Choi

From Documents to Segments: A Contextual Reformulation for Topic Assignment

Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-Tür, Stanley Jungkyu Choi

Abstract

Traditional topic modeling treats each document as a single, coherent unit of topic, which can cause topic contamination when documents cover multiple topics. This becomes especially problematic when stakeholders are interested in identifying documents that focus on a specific topic. We introduce segment-based topic allocation, a novel paradigm that redefines topic assignment at the level of segments, coherent textual spans conveying distinct topical content. This granularity improves topic purity, interpretability, and applicability to multi-theme corpora such as reviews or survey responses. To support this paradigm, we construct SemEval-STM, a benchmark derived from aspect-based sentiment datasets, where segments are automatically extracted using large language models (LLMs) and post-processed with human supervision. We further propose the segment intrusion task (SIT), a novel evaluation method extending word intrusion to the span level, enabling human-centric assessment of topical coherence. Empirical results across diverse metrics and models demonstrate that SBTA significantly outperforms traditional document-based methods in clustering and interpretability. Our framework provides a practical and scalable solution for fine-grained topic analysis in heterogeneous text corpora.

Anthology ID:: 2026.findings-acl.1278
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25586–25624
Language:
URL:: https://aclanthology.org/2026.findings-acl.1278/
DOI:
Bibkey:
Cite (ACL):: Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-Tür, and Stanley Jungkyu Choi. 2026. From Documents to Segments: A Contextual Reformulation for Topic Assignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25586–25624, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: From Documents to Segments: A Contextual Reformulation for Topic Assignment (Yoon et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1278.pdf
Checklist:: 2026.findings-acl.1278.checklist.pdf

PDF Cite Search Checklist Fix data