A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings

Haochen Tan; Wei Shao; Han Wu; Ke Yang; Linqi Song

doi:10.18653/v1/2022.findings-acl.22

A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings

Haochen Tan, Wei Shao, Han Wu, Ke Yang, Linqi Song

Abstract

Contrastive learning has shown great potential in unsupervised sentence embedding tasks, e.g., SimCSE (CITATION).However, these existing solutions are heavily affected by superficial features like the length of sentences or syntactic structures. In this paper, we propose a semantic-aware contrastive learning framework for sentence embeddings, termed Pseudo-Token BERT (PT-BERT), which is able to explore the pseudo-token space (i.e., latent semantic space) representation of a sentence while eliminating the impact of superficial features such as sentence length and syntax. Specifically, we introduce an additional pseudo token embedding layer independent of the BERT encoder to map each sentence into a sequence of pseudo tokens in a fixed length. Leveraging these pseudo sequences, we are able to construct same-length positive and negative pairs based on the attention mechanism to perform contrastive learning. In addition, we utilize both the gradient-updating and momentum-updating encoders to encode instances while dynamically maintaining an additional queue to store the representation of sentence embeddings, enhancing the encoder’s learning performance for negative examples. Experiments show that our model outperforms the state-of-the-art baselines on six standard semantic textual similarity (STS) tasks. Furthermore, experiments on alignments and uniformity losses, as well as hard examples with different sentence lengths and syntax, consistently verify the effectiveness of our method.

Anthology ID:: 2022.findings-acl.22
Volume:: Findings of the Association for Computational Linguistics: ACL 2022
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 246–256
Language:
URL:: https://aclanthology.org/2022.findings-acl.22/
DOI:: 10.18653/v1/2022.findings-acl.22
Bibkey:
Cite (ACL):: Haochen Tan, Wei Shao, Han Wu, Ke Yang, and Linqi Song. 2022. A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings. In Findings of the Association for Computational Linguistics: ACL 2022, pages 246–256, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings (Tan et al., Findings 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.findings-acl.22.pdf

PDF Cite Search Fix data