Smoothed Contrastive Learning for Unsupervised Sentence Embedding

Xing Wu, Chaochen Gao, Yipeng Su, Jizhong Han, Zhongyuan Wang, Songlin Hu


Abstract
Unsupervised contrastive sentence embedding models, e.g., unsupervised SimCSE, use the InfoNCE loss function in training. Theoretically, we expect to use larger batches to get more adequate comparisons among samples and avoid overfitting. However, increasing batch size leads to performance degradation when it exceeds a threshold, which is probably due to the introduction of false-negative pairs through statistical observation. To alleviate this problem, we introduce a simple smoothing strategy upon the InfoNCE loss function, termed Gaussian Smoothed InfoNCE (GS-InfoNCE). In other words, we add random Gaussian noise as an extension to the negative pairs without increasing the batch size. Through experiments on the semantic text similarity tasks, though simple, the proposed smoothing strategy brings improvements to unsupervised SimCSE.
Anthology ID:
2022.coling-1.434
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4902–4906
Language:
URL:
https://aclanthology.org/2022.coling-1.434
DOI:
Bibkey:
Cite (ACL):
Xing Wu, Chaochen Gao, Yipeng Su, Jizhong Han, Zhongyuan Wang, and Songlin Hu. 2022. Smoothed Contrastive Learning for Unsupervised Sentence Embedding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4902–4906, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Smoothed Contrastive Learning for Unsupervised Sentence Embedding (Wu et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.434.pdf
Code
 caskcsg/sentemb +  additional community code