Seeded Hierarchical Clustering for Expert-Crafted Taxonomies

Anish Saha, Amith Ananthram, Emily Allaway, Heng Ji, Kathleen McKeown


Abstract
Practitioners from many disciplines (e.g., political science) use expert-crafted taxonomies to make sense of large, unlabeled corpora. In this work, we study Seeded Hierarchical Clustering (SHC): the task of automatically fitting unlabeled data to such taxonomies using a small set of labeled examples. We propose HierSeed, a novel weakly supervised algorithm for this task that uses only a small set of labeled seed examples in a computation and data efficient manner. HierSeed assigns documents to topics by weighing document density against topic hierarchical structure. It outperforms unsupervised and supervised baselines for the SHC task on three real-world datasets.
Anthology ID:
2022.findings-emnlp.115
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1595–1609
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.115
DOI:
10.18653/v1/2022.findings-emnlp.115
Bibkey:
Cite (ACL):
Anish Saha, Amith Ananthram, Emily Allaway, Heng Ji, and Kathleen McKeown. 2022. Seeded Hierarchical Clustering for Expert-Crafted Taxonomies. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1595–1609, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Seeded Hierarchical Clustering for Expert-Crafted Taxonomies (Saha et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.115.pdf