Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining

Chih-chan Tien, Shane Steinert-Threlkeld


Abstract
This work presents methods for learning cross-lingual sentence representations using paired or unpaired bilingual texts. We hypothesize that the cross-lingual alignment strategy is transferable, and therefore a model trained to align only two languages can encode multilingually more aligned representations. We thus introduce dual-pivot transfer: training on one language pair and evaluating on other pairs. To study this theory, we design unsupervised models trained on unpaired sentences and single-pair supervised models trained on bitexts, both based on the unsupervised language model XLM-R with its parameters frozen. The experiments evaluate the models as universal sentence encoders on the task of unsupervised bitext mining on two datasets, where the unsupervised model reaches the state of the art of unsupervised retrieval, and the alternative single-pair supervised model approaches the performance of multilingually supervised models. The results suggest that bilingual training techniques as proposed can be applied to get sentence representations with multilingual alignment.
Anthology ID:
2022.acl-long.595
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8696–8706
Language:
URL:
https://aclanthology.org/2022.acl-long.595
DOI:
10.18653/v1/2022.acl-long.595
Bibkey:
Cite (ACL):
Chih-chan Tien and Shane Steinert-Threlkeld. 2022. Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8696–8706, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining (Tien & Steinert-Threlkeld, ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.595.pdf
Video:
 https://aclanthology.org/2022.acl-long.595.mp4
Code
 cctien/bimultialign