Dan Li


2023

pdf bib
Enhancing Extreme Multi-Label Text Classification: Addressing Challenges in Model, Data, and Evaluation
Dan Li | Zi Long Zhu | Janneke van de Loo | Agnes Masip Gomez | Vikrant Yadav | Georgios Tsatsaronis | Zubair Afzal
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Extreme multi-label text classification is a prevalent task in industry, but it frequently encounters challenges in terms of machine learning perspectives, including model limitations, data scarcity, and time-consuming evaluation. This paper aims to mitigate these issues by introducing novel approaches. Firstly, we propose a label ranking model as an alternative to the conventional SciBERT-based classification model, enabling efficient handling of large-scale labels and accommodating new labels. Secondly, we present an active learning-based pipeline that addresses the data scarcity of new labels during the update of a classification system. Finally, we introduce ChatGPT to assist with model evaluation. Our experiments demonstrate the effectiveness of these techniques in enhancing the extreme multi-label text classification task.

2022

pdf bib
VIRT: Improving Representation-based Text Matching via Virtual Interaction
Dan Li | Yang Yang | Hongyin Tang | Jiahao Liu | Qifan Wang | Jingang Wang | Tong Xu | Wei Wu | Enhong Chen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Text matching is a fundamental research problem in natural language understanding. Interaction-based approaches treat the text pair as a single sequence and encode it through cross encoders, while representation-based models encode the text pair independently with siamese or dual encoders. Interaction-based models require dense computations and thus are impractical in real-world applications. Representation-based models have become the mainstream paradigm for efficient text matching. However, these models suffer from severe performance degradation due to the lack of interactions between the pair of texts. To remedy this, we propose a Virtual InteRacTion mechanism (VIRT) for improving representation-based text matching while maintaining its efficiency. In particular, we introduce an interactive knowledge distillation module that is only applied during training. It enables deep interaction between texts by effectively transferring knowledge from the interaction-based model. A light interaction strategy is designed to fully leverage the learned interactive knowledge. Experimental results on six text matching benchmarks demonstrate the superior performance of our method over several state-of-the-art representation-based models. We further show that VIRT can be integrated into existing methods as plugins to lift their performances.

pdf bib
Unsupervised Dense Retrieval for Scientific Articles
Dan Li | Vikrant Yadav | Zubair Afzal | George Tsatsaronis
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

In this work, we build a dense retrieval based semantic search engine on scientific articles from Elsevier. The major challenge is that there is no labeled data for training and testing. We apply a state-of-the-art unsupervised dense retrieval model called Generative Pseudo Labeling that generates high-quality pseudo training labels. Furthermore, since the articles are unbalanced across different domains, we select passages from multiple domains to form balanced training data. For the evaluation, we create two test sets: one manually annotated and one automatically created from the meta-information of our data. We compare the semantic search engine with the currently deployed lexical search engine on the two test sets. The results of the experiment show that the semantic search engine trained with pseudo training labels can significantly improve search performance.