Anlei Dong


2023

pdf bib
CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion
Xingwei He | Yeyun Gong | A-Long Jin | Hang Zhang | Anlei Dong | Jian Jiao | Siu Yiu | Nan Duan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.

2022

pdf bib
SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval
Kun Zhou | Yeyun Gong | Xiao Liu | Wayne Xin Zhao | Yelong Shen | Anlei Dong | Jingwen Lu | Rangan Majumder | Ji-rong Wen | Nan Duan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Sampling proper negatives from a large document pool is vital to effectively train a dense retrieval model. However, existing negative sampling strategies suffer from the uninformative or false negative problem. In this work, we empirically show that according to the measured relevance scores, the negatives ranked around the positives are generally more informative and less likely to be false negatives. Intuitively, these negatives are not too hard (may be false negatives) or too easy (uninformative). They are the ambiguous negatives and need more attention during training.Thus, we propose a simple ambiguous negatives sampling method, SimANS, which incorporates a new sampling probability distribution to sample more ambiguous negatives.Extensive experiments on four public and one industry datasets show the effectiveness of our approach.We made the code and models publicly available in https://github.com/microsoft/SimXNS.

2012

pdf bib
Iterative Viterbi A* Algorithm for K-Best Sequential Decoding
Zhiheng Huang | Yi Chang | Bo Long | Jean-Francois Crespo | Anlei Dong | Sathiya Keerthi | Su-Lin Wu
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2010

pdf bib
Learning Recurrent Event Queries for Web Search
Ruiqiang Zhang | Yuki Konda | Anlei Dong | Pranam Kolari | Yi Chang | Zhaohui Zheng
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Empirical Exploitation of Click Data for Task Specific Ranking
Anlei Dong | Yi Chang | Shihao Ji | Ciya Liao | Xin Li | Zhaohui Zheng
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing