Unsupervised Concept Representation Learning for Length-Varying Text Similarity

Xuchao Zhang, Bo Zong, Wei Cheng, Jingchao Ni, Yanchi Liu, Haifeng Chen


Abstract
Measuring document similarity plays an important role in natural language processing tasks. Most existing document similarity approaches suffer from the information gap caused by context and vocabulary mismatches when comparing varying-length texts. In this paper, we propose an unsupervised concept representation learning approach to address the above issues. Specifically, we propose a novel Concept Generation Network (CGNet) to learn concept representations from the perspective of the entire text corpus. Moreover, a concept-based document matching method is proposed to leverage advances in the recognition of local phrase features and corpus-level concept features. Extensive experiments on real-world data sets demonstrate that new method can achieve a considerable improvement in comparing length-varying texts. In particular, our model achieved 6.5% better F1 Score compared to the best of the baseline models for a concept-project benchmark dataset.
Anthology ID:
2021.naacl-main.445
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5611–5620
Language:
URL:
https://aclanthology.org/2021.naacl-main.445
DOI:
10.18653/v1/2021.naacl-main.445
Bibkey:
Cite (ACL):
Xuchao Zhang, Bo Zong, Wei Cheng, Jingchao Ni, Yanchi Liu, and Haifeng Chen. 2021. Unsupervised Concept Representation Learning for Length-Varying Text Similarity. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5611–5620, Online. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Concept Representation Learning for Length-Varying Text Similarity (Zhang et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.445.pdf
Video:
 https://aclanthology.org/2021.naacl-main.445.mp4