2024
pdf
bib
abs
Breaking the Hourglass Phenomenon of Residual Quantization: Enhancing the Upper Bound of Generative Retrieval
Zhirui Kuai
|
Zuxu Chen
|
Huimu Wang
|
Mingming Li
|
Dadong Miao
|
Wang Binbin
|
Xusong Chen
|
Li Kuang
|
Yuxing Han
|
Jiaxing Wang
|
Guoyu Tang
|
Lin Liu
|
Songlin Wang
|
Jingwei Zhuo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Generative retrieval (GR) has emerged as a transformative paradigm in search and recommender systems, leveraging numeric-based identifier representations to enhance efficiency and generalization. Notably, methods like TIGER, which employ Residual Quantization-based Semantic Identifiers (RQ-SID), have shown significant promise in e-commerce scenarios by effectively managing item IDs. However, a critical issue termed the "Hourglass" phenomenon, occurs in RQ-SID, where intermediate codebook tokens become overly concentrated, hindering the full utilization of generative retrieval methods. This paper analyses and addresses this problem by identifying data sparsity and long-tailed distribution as the primary causes. Through comprehensive experiments and detailed ablation studies, we analyze the impact of these factors on codebook utilization and data distribution. Our findings reveal that the “Hourglass” phenomenon substantially impacts the performance of RQ-SID in generative retrieval. We propose effective solutions to mitigate this issue, thereby significantly enhancing the effectiveness of generative retrieval in real-world E-commerce applications.
2014
pdf
bib
abs
Clustering tweets usingWikipedia concepts
Guoyu Tang
|
Yunqing Xia
|
Weizhi Wang
|
Raymond Lau
|
Fang Zheng
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Two challenging issues are notable in tweet clustering. Firstly, the sparse data problem is serious since no tweet can be longer than 140 characters. Secondly, synonymy and polysemy are rather common because users intend to present a unique meaning with a great number of manners in tweets. Enlightened by the recent research which indicates Wikipedia is promising in representing text, we exploit Wikipedia concepts in representing tweets with concept vectors. We address the polysemy issue with a Bayesian model, and the synonymy issue by exploiting the Wikipedia redirections. To further alleviate the sparse data problem, we further make use of three types of out-links in Wikipedia. Evaluation on a twitter dataset shows that the concept model outperforms the traditional VSM model in tweet clustering.
2012
pdf
bib
abs
CLTC: A Chinese-English Cross-lingual Topic Corpus
Yunqing Xia
|
Guoyu Tang
|
Peng Jin
|
Xia Yang
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Cross-lingual topic detection within text is a feasible solution to resolving the language barrier in accessing the information. This paper presents a Chinese-English cross-lingual topic corpus (CLTC), in which 90,000 Chinese articles and 90,000 English articles are organized within 150 topics. Compared with TDT corpora, CLTC has three advantages. First, CLTC is bigger in size. This makes it possible to evaluate the large-scale cross-lingual text clustering methods. Second, articles are evenly distributed within the topics. Thus it can be used to produce test datasets for different purposes. Third, CLTC can be used as a cross-lingual comparable corpus to develop methods for cross-lingual information access. A preliminary evaluation with CLTC corpus indicates that the corpus is effective in evaluating cross-lingual topic detection methods.
2011
pdf
bib
CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering
Guoyu Tang
|
Yunqing Xia
|
Min Zhang
|
Haizhou Li
|
Fang Zheng
Proceedings of 5th International Joint Conference on Natural Language Processing