An Intent-based and Annotation-free Method for Duplicate Question Detection in CQA Forums

Yubo Shu; Hansu Gu; Peng Zhang; Tun Lu; Ning Gu

doi:10.18653/v1/2023.findings-emnlp.596

An Intent-based and Annotation-free Method for Duplicate Question Detection in CQA Forums

Yubo Shu, Hansu Gu, Peng Zhang, Tun Lu, Ning Gu

Abstract

With the advent of large language models (LLMs), Community Question Answering (CQA) forums offer well-curated questions and answers that can be utilized for instruction-tuning, effectively training LLMs to be aligned with human intents. However, the issue of duplicate questions arises as the volume of content within CQA continues to grow, posing a threat to content quality. Recent research highlights the benefits of detecting and eliminating duplicate content. It not only enhances the LLMs’ ability to generalize across diverse intents but also improves the efficiency of training data utilization while addressing concerns related to information leakage. However, existing methods for detecting duplicate questions in CQA typically rely on generic text-pair matching models, overlooking the intent behind the questions. In this paper, we propose a novel intent-based duplication detector named Intent-DQD that comprehensively leverages intent information to address the problem of duplicate question detection in CQA. Intent-DQD first leverages the characteristics in CQA forums and extracts training labels to recognize and match intents without human annotation. Intent-DQD then effectively aggregates intent-level relations and establishes question-level relations to enable intent-aware duplication detection. Experimental results on fifteen distinct domains from both CQADupStack and Stack Overflow datasets demonstrate the effectiveness of Intent-DQD. Reproducible codes and datasets will be released upon publication of the paper.

Anthology ID:: 2023.findings-emnlp.596
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8889–8899
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.596
DOI:: 10.18653/v1/2023.findings-emnlp.596
Bibkey:
Cite (ACL):: Yubo Shu, Hansu Gu, Peng Zhang, Tun Lu, and Ning Gu. 2023. An Intent-based and Annotation-free Method for Duplicate Question Detection in CQA Forums. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8889–8899, Singapore. Association for Computational Linguistics.
Cite (Informal):: An Intent-based and Annotation-free Method for Duplicate Question Detection in CQA Forums (Shu et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.596.pdf

PDF Cite Search