2024
pdf
bib
abs
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
Sunhao Dai
|
Weihao Liu
|
Yuqi Zhou
|
Liang Pang
|
Rongju Ruan
|
Gang Wang
|
Zhenhua Dong
|
Jun Xu
|
Ji-Rong Wen
Findings of the Association for Computational Linguistics: ACL 2024
The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at https://github.com/KID-22/Cocktail.
2023
pdf
bib
abs
An Effective Deployment of Contrastive Learning in Multi-label Text Classification
Nankai Lin
|
Guanqiu Qin
|
Gang Wang
|
Dong Zhou
|
Aimin Yang
Findings of the Association for Computational Linguistics: ACL 2023
The effectiveness of contrastive learning technology in natural language processing tasks is yet to be explored and analyzed. How to construct positive and negative samples correctly and reasonably is the core challenge of contrastive learning. It is even harder to discover contrastive objects in multi-label text classification tasks. There are very few contrastive losses proposed previously. In this paper, we investigate the problem from a different angle by proposing five novel contrastive losses for multi-label text classification tasks. These are Strict Contrastive Loss (SCL), Intra-label Contrastive Loss (ICL), Jaccard Similarity Contrastive Loss (JSCL), Jaccard Similarity Probability Contrastive Loss (JSPCL), and Stepwise Label Contrastive Loss (SLCL). We explore the effectiveness of contrastive learning for multi-label text classification tasks by the employment of these novel losses and provide a set of baseline models for deploying contrastive learning techniques on specific tasks. We further perform an interpretable analysis of our approach to show how different components of contrastive learning losses play their roles. The experimental results show that our proposed contrastive losses can bring improvement to multi-label text classification tasks. Our work also explores how contrastive learning should be adapted for multi-label text classification tasks.
2019
pdf
bib
abs
Easy First Relation Extraction with Information Redundancy
Shuai Ma
|
Gang Wang
|
Yansong Feng
|
Jinpeng Huai
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Many existing relation extraction (RE) models make decisions globally using integer linear programming (ILP). However, it is nontrivial to make use of integer linear programming as a blackbox solver for RE. Its cost of time and memory may become unacceptable with the increase of data scale, and redundant information needs to be encoded cautiously for ILP. In this paper, we propose an easy first approach for relation extraction with information redundancies, embedded in the results produced by local sentence level extractors, during which conflict decisions are resolved with domain and uniqueness constraints. Information redundancies are leveraged to support both easy first collective inference for easy decisions in the first stage and ILP for hard decisions in a subsequent stage. Experimental study shows that our approach improves the efficiency and accuracy of RE, and outperforms both ILP and neural network-based methods.
2010
pdf
bib
Automatic Generation of Semantic Fields for Annotating Web Images
Gang Wang
|
Tat Seng Chua
|
Chong-Wah Ngo
|
Yong Cheng Wang
Coling 2010: Posters
2003
pdf
bib
Extracting Key Semantic Terms from Chinese Speech Query for Web Searches
Gang Wang
|
Tat-Seng Chua
|
Yong-Cheng Wang
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics