A Reference-free Segmentation Quality Index (SegReFree)

Evan Lucas, Dylan Kangas, Timothy Havens


Abstract
Topic segmentation, in the context of natural language processing, is the process of finding boundaries in a sequence of sentences that separate groups of adjacent sentences at shifts in semantic meaning. Currently, assessing the quality of a segmentation is done by comparing segmentation boundaries selected by a human or algorithm to those selected by a known good reference. This means that it is not possible to quantify the quality of a segmentation without a human annotator, which can be costly and time consuming. This work seeks to improve assessment of segmentation by proposing a reference-free segmentation quality index (SegReFree). The metric takes advantage of the fact that segmentation at a sentence level generally seeks to identify segment boundaries at semantic boundaries within the text. The proposed metric uses a modified cluster validity metric with semantic embeddings of the sentences to determine the quality of the segmentation. Multiple segmentation data sets are used to compare our proposed metric with existing reference-based segmentation metrics by progressively degrading the reference segmentation while computing all possible metrics; through this process, a strong correlation with existing segmentation metrics is shown. A Python library implementing the metric is released under the GNU General Public License and the repository is available at https://github.com/evan-person/reference_free_segmentation_metric.
Anthology ID:
2023.findings-emnlp.195
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2957–2968
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.195
DOI:
10.18653/v1/2023.findings-emnlp.195
Bibkey:
Cite (ACL):
Evan Lucas, Dylan Kangas, and Timothy Havens. 2023. A Reference-free Segmentation Quality Index (SegReFree). In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2957–2968, Singapore. Association for Computational Linguistics.
Cite (Informal):
A Reference-free Segmentation Quality Index (SegReFree) (Lucas et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.195.pdf