Universal Sentence Representation Learning with Conditional Masked Language Model
Ziyi Yang | Yinfei Yang | Daniel Cer | Jax Law | Eric Darve
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search. We explore the same language bias of the learned representations, and propose a simple, post-training and model agnostic approach to remove the language identifying information from the representation while still retaining sentence semantics.
Multilingual Universal Sentence Encoder for Semantic Retrieval
Yinfei Yang | Daniel Cer | Amin Ahmad | Mandy Guo | Jax Law | Noah Constant | Gustavo Hernandez Abrego | Steve Yuan | Chris Tar | Yun-hsuan Sung | Brian Strope | Ray Kurzweil
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
We present easy-to-use retrieval focused multilingual sentence embedding models, made available on TensorFlow Hub. The models embed text from 16 languages into a shared semantic space using a multi-task trained dual-encoder that learns tied cross-lingual representations via translation bridge tasks (Chidambaram et al., 2018). The models achieve a new state-of-the-art in performance on monolingual and cross-lingual semantic retrieval (SR). Competitive performance is obtained on the related tasks of translation pair bitext retrieval (BR) and retrieval question answering (ReQA). On transfer learning tasks, our multilingual embeddings approach, and in some cases exceed, the performance of English only sentence embeddings.
- Yinfei Yang 2
- Daniel Cer 2
- Ziyi Yang 1
- Eric Darve 1
- Amin Ahmad 1
- show all...