Universal Sentence Representation Learning with Conditional Masked Language Model

Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve


Abstract
This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search. We explore the same language bias of the learned representations, and propose a simple, post-training and model agnostic approach to remove the language identifying information from the representation while still retaining sentence semantics.
Anthology ID:
2021.emnlp-main.502
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6216–6228
Language:
URL:
https://aclanthology.org/2021.emnlp-main.502
DOI:
10.18653/v1/2021.emnlp-main.502
Bibkey:
Cite (ACL):
Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and Eric Darve. 2021. Universal Sentence Representation Learning with Conditional Masked Language Model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6216–6228, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Universal Sentence Representation Learning with Conditional Masked Language Model (Yang et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.502.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.502.mp4
Data
PAWS-XSICKSSTSentEval