VAEGPT-Sim: Improving Sentence Representation with Limited Corpus Using Gradually-Denoising VAE

Zhenyi Wang, Haiyan Ning, Qing Ling, Dan Wang


Abstract
Text embedding requires a highly efficient method for training domain-specific models on limited data, as general models trained on large corpora lack universal applicability in highly specific fields. Therefore, we have introduced VAEGPT-Sim, an innovative model for generating synonyms that combines a denoising variational autoencoder with a target-specific discriminator to generate synonymous sentences that closely resemble human language. Even when trained with completely unsupervised settings, it maintains a harmonious balance between semantic similarity and lexical diversity, as shown by a comprehensive evaluation metric system with the highest average scores compared to other generative models. When VAEGPT-Sim is utilized as a module for contrastive learning in text representation, it delivers state-of-the-art results in small-dataset training on STS benchmarks, surpassing ConSERT by 2.8 points. This approach optimizes the effectiveness of text representation despite a limited corpus, signifying an advancement in domain-specific embedding technology.
Anthology ID:
2024.findings-acl.513
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8666–8681
Language:
URL:
https://aclanthology.org/2024.findings-acl.513
DOI:
Bibkey:
Cite (ACL):
Zhenyi Wang, Haiyan Ning, Qing Ling, and Dan Wang. 2024. VAEGPT-Sim: Improving Sentence Representation with Limited Corpus Using Gradually-Denoising VAE. In Findings of the Association for Computational Linguistics ACL 2024, pages 8666–8681, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
VAEGPT-Sim: Improving Sentence Representation with Limited Corpus Using Gradually-Denoising VAE (Wang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.513.pdf