Zhilan Wang


2025

pdf bib
Unsupervised Sentence Representation Learning with Syntactically Aligned Negative Samples
Zhilan Wang | Zekai Zhi | Rize Jin | Kehui Song | He Wang | Da-Jung Cho
Findings of the Association for Computational Linguistics: NAACL 2025

Sentence representation learning benefits from data augmentation strategies to improve model performance and generalization, yet existing approaches often encounter issues such as semantic inconsistencies and feature suppression. To address these limitations, we propose a method for generating Syntactically Aligned Negative (SAN) samples through a semantic importance-aware Masked Language Model (MLM) approach. Our method quantifies semantic contributions of individual words to produce negative samples that have substantial textual overlap with the original sentences while conveying different meanings. We further introduce Hierarchical-InfoNCE (HiNCE), a novel contrastive learning objective employing differential temperature weighting to optimize the utilization of both in-batch and syntactically aligned negative samples. Extensive evaluations across seven semantic textual similarity benchmarks demonstrate consistent improvements over state-of-the-art models.