2024
pdf
bib
abs
Generalized Category Discovery with Large Language Models in the Loop
Wenbin An
|
Wenkai Shi
|
Feng Tian
|
Haonan Lin
|
QianYing Wang
|
Yaqiang Wu
|
Mingxiang Cai
|
Luyan Wang
|
Yan Chen
|
Haiping Zhu
|
Ping Chen
Findings of the Association for Computational Linguistics: ACL 2024
Generalized Category Discovery (GCD) is a crucial task that aims to recognize both known and novel categories from a set of unlabeled data by utilizing a few labeled data with only known categories. Due to the lack of supervision and category information, current methods usually perform poorly on novel categories and struggle to reveal semantic meanings of the discovered clusters, which limits their applications in the real world. To mitigate the above issues, we propose Loop, an end-to-end active-learning framework that introduces Large Language Models (LLMs) into the training loop, which can boost model performance and generate category names without relying on any human efforts. Specifically, we first propose Local Inconsistent Sampling (LIS) to select samples that have a higher probability of falling to wrong clusters, based on neighborhood prediction consistency and entropy of cluster assignment probabilities. Then we propose a Scalable Query strategy to allow LLMs to choose true neighbors of the selected samples from multiple candidate samples. Based on the feedback from LLMs, we perform Refined Neighborhood Contrastive Learning (RNCL) to pull samples and their neighbors closer to learn clustering-friendly representations. Finally, we select representative samples from clusters corresponding to novel categories to allow LLMs to generate category names for them. Extensive experiments on three benchmark datasets show that Loop outperforms SOTA models by a large margin and generates accurate category names for the discovered clusters. Code and data are available at https://github.com/Lackel/LOOP.
2023
pdf
bib
abs
A Diffusion Weighted Graph Framework for New Intent Discovery
Wenkai Shi
|
Wenbin An
|
Feng Tian
|
Qinghua Zheng
|
QianYing Wang
|
Ping Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
New Intent Discovery (NID) aims to recognize both new and known intents from unlabeled data with the aid of limited labeled data containing only known intents. Without considering structure relationships between samples, previous methods generate noisy supervisory signals which cannot strike a balance between quantity and quality, hindering the formation of new intent clusters and effective transfer of the pre-training knowledge. To mitigate this limitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to capture both semantic similarities and structure relationships inherent in data, enabling more sufficient and reliable supervisory signals. Specifically, for each sample, we diffuse neighborhood relationships along semantic paths guided by the nearest neighbors for multiple hops to characterize its local structure discriminately. Then, we sample its positive keys and weigh them based on semantic similarities and local structures for contrastive learning. During inference, we further propose Graph Smoothing Filter (GSF) to explicitly utilize the structure relationships to filter high-frequency noise embodied in semantically ambiguous samples on the cluster boundary. Extensive experiments show that our method outperforms state-of-the-art models on all evaluation metrics across multiple benchmark datasets. Code and data will be made public.
pdf
bib
abs
DNA: Denoised Neighborhood Aggregation for Fine-grained Category Discovery
Wenbin An
|
Feng Tian
|
Wenkai Shi
|
Yan Chen
|
Qinghua Zheng
|
QianYing Wang
|
Ping Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Discovering fine-grained categories from coarsely labeled data is a practical and challenging task, which can bridge the gap between the demand for fine-grained analysis and the high annotation cost. Previous works mainly focus on instance-level discrimination to learn low-level features, but ignore semantic similarities between data, which may prevent these models learning compact cluster representations. In this paper, we propose Denoised Neighborhood Aggregation (DNA), a self-supervised framework that encodes semantic structures of data into the embedding space. Specifically, we retrieve k-nearest neighbors of a query as its positive keys to capture semantic similarities between data and then aggregate information from the neighbors to learn compact cluster representations, which can make fine-grained categories more separatable. However, the retrieved neighbors can be noisy and contain many false-positive keys, which can degrade the quality of learned embeddings. To cope with this challenge, we propose three principles to filter out these false neighbors for better representation learning. Furthermore, we theoretically justify that the learning objective of our framework is equivalent to a clustering loss, which can capture semantic similarities between data to form compact fine-grained clusters. Extensive experiments on three benchmark datasets show that our method can retrieve more accurate neighbors (21.31% accuracy improvement) and outperform state-of-the-art models by a large margin (average 9.96% improvement on three metrics). Our code and data are available at https://github.com/Lackel/DNA.
2022
pdf
bib
abs
Fine-grained Category Discovery under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning
Wenbin An
|
Feng Tian
|
Ping Chen
|
Siliang Tang
|
Qinghua Zheng
|
QianYing Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Novel category discovery aims at adapting models trained on known categories to novel categories. Previous works only focus on the scenario where known and novel categories are of the same granularity.In this paper, we investigate a new practical scenario called Fine-grained Category Discovery under Coarse-grained supervision (FCDC). FCDC aims at discovering fine-grained categories with only coarse-grained labeled data, which can adapt models to categories of different granularity from known ones and reduce significant labeling cost. It is also a challenging task since supervised training on coarse-grained categories tends to focus on inter-class distance (distance between coarse-grained classes) but ignore intra-class distance (distance between fine-grained sub-classes) which is essential for separating fine-grained categories.Considering most current methods cannot transfer knowledge from coarse-grained level to fine-grained level, we propose a hierarchical weighted self-contrastive network by building a novel weighted self-contrastive module and combining it with supervised learning in a hierarchical manner.Extensive experiments on public datasets show both effectiveness and efficiency of our model over compared methods.
2021
pdf
bib
abs
Contrastive Learning of Sentence Representations
Hefei Qiu
|
Wei Ding
|
Ping Chen
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
Learning sentence representations which capture rich semantic meanings has been crucial for many NLP tasks. Pre-trained language models such as BERT have achieved great success in NLP, but sentence embeddings extracted directly from these models do not perform well without fine-tuning. We propose Contrastive Learning of Sentence Representations (CLSR), a novel approach which applies contrastive learning to learn universal sentence representations on top of pre-trained language models. CLSR utilizes semantic similarity of two sentences to construct positive instance for contrastive learning. Semantic information that has been captured by the pre-trained models is kept by getting sentence embeddings from these models with proper pooling strategy. An encoder followed by a linear projection takes these embeddings as inputs and is trained under a contrastive objective. To evaluate the performance of CLSR, we run experiments on a range of pre-trained language models and their variants on a series of Semantic Contextual Similarity tasks. Results show that CLSR gains significant performance improvements over existing SOTA language models.
2015
pdf
bib
Extended Topic Model for Word Dependency
Tong Wang
|
Vish Viswanath
|
Ping Chen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
2010
pdf
bib
TreeMatch: A Fully Unsupervised WSD System Using Dependency Knowledge on a Specific Domain
Andrew Tran
|
Chris Bowes
|
David Brown
|
Ping Chen
|
Max Choly
|
Wei Ding
Proceedings of the 5th International Workshop on Semantic Evaluation
2009
pdf
bib
A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge
Ping Chen
|
Wei Ding
|
Chris Bowes
|
David Brown
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics