Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework

Yiming Chen, Yan Zhang, Bin Wang, Zuozhu Liu, Haizhou Li


Abstract
Most sentence embedding techniques heavily rely on expensive human-annotated sentence pairs as the supervised signals. Despite the use of large-scale unlabeled data, the performance of unsupervised methods typically lags far behind that of the supervised counterparts in most downstream tasks. In this work, we propose a semi-supervised sentence embedding framework, GenSE, that effectively leverages large-scale unlabeled data. Our method include three parts: 1) Generate: A generator/discriminator model is jointly trained to synthesize sentence pairs from open-domain unlabeled corpus; 2) Discriminate: Noisy sentence pairs are filtered out by the discriminator to acquire high-quality positive and negative sentence pairs; 3) Contrast: A prompt-based contrastive approach is presented for sentence representation learning with both annotated and synthesized data. Comprehensive experiments show that GenSE achieves an average correlation score of 85.19 on the STS datasets and consistent performance improvement on four domain adaptation tasks, significantly surpassing the state-of-the-art methods and convincingly corroborating its effectiveness and generalization ability.
Anthology ID:
2022.emnlp-main.558
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8150–8161
Language:
URL:
https://aclanthology.org/2022.emnlp-main.558
DOI:
10.18653/v1/2022.emnlp-main.558
Bibkey:
Cite (ACL):
Yiming Chen, Yan Zhang, Bin Wang, Zuozhu Liu, and Haizhou Li. 2022. Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8150–8161, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework (Chen et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.558.pdf