Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models

Chia-Hsuan Chang, Tien Yuan Huang, Yi-Hang Tsai, Chia-Ming Chang, San-Yih Hwang


Abstract
Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.
Anthology ID:
2025.bucc-1.6
Volume:
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Serge Sharoff, Ayla Rigouts Terryn, Pierre Zweigenbaum, Reinhard Rapp
Venues:
BUCC | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
46–56
Language:
URL:
https://aclanthology.org/2025.bucc-1.6/
DOI:
Bibkey:
Cite (ACL):
Chia-Hsuan Chang, Tien Yuan Huang, Yi-Hang Tsai, Chia-Ming Chang, and San-Yih Hwang. 2025. Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models. In Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC), pages 46–56, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models (Chang et al., BUCC 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.bucc-1.6.pdf