Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model

Muthu Chidambaram, Yinfei Yang, Daniel Cer, Steve Yuan, Yunhsuan Sung, Brian Strope, Ray Kurzweil


Abstract
The scarcity of labeled training data across many languages is a significant roadblock for multilingual neural language processing. We approach the lack of in-language training data using sentence embeddings that map text written in different languages, but with similar meanings, to nearby embedding space representations. The representations are produced using a dual-encoder based model trained to maximize the representational similarity between sentence pairs drawn from parallel data. The representations are enhanced using multitask training and unsupervised monolingual corpora. The effectiveness of our multilingual sentence embeddings are assessed on a comprehensive collection of monolingual, cross-lingual, and zero-shot/few-shot learning tasks.
Anthology ID:
W19-4330
Volume:
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Isabelle Augenstein, Spandana Gella, Sebastian Ruder, Katharina Kann, Burcu Can, Johannes Welbl, Alexis Conneau, Xiang Ren, Marek Rei
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
250–259
Language:
URL:
https://aclanthology.org/W19-4330
DOI:
10.18653/v1/W19-4330
Bibkey:
Cite (ACL):
Muthu Chidambaram, Yinfei Yang, Daniel Cer, Steve Yuan, Yunhsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 250–259, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model (Chidambaram et al., RepL4NLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4330.pdf
Data
MultiNLISNLISentEvalXNLI