xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Mingda Chen; Kevin Heffernan; Onur Çelebi; Alexandre Mourachko; Holger Schwenk

doi:10.18653/v1/2023.acl-short.10

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Mingda Chen, Kevin Heffernan, Onur Çelebi, Alexandre Mourachko, Holger Schwenk

Abstract

We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xsim++. In comparison to xsim, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xsim, we show that xsim++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xsim++ also reports performance for different error types, offering more fine-grained feedbacks for model development.

Anthology ID:: 2023.acl-short.10
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 101–109
Language:
URL:: https://aclanthology.org/2023.acl-short.10
DOI:: 10.18653/v1/2023.acl-short.10
Bibkey:
Cite (ACL):: Mingda Chen, Kevin Heffernan, Onur Çelebi, Alexandre Mourachko, and Holger Schwenk. 2023. xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 101–109, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages (Chen et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-short.10.pdf
Video:: https://aclanthology.org/2023.acl-short.10.mp4

PDF Cite Search Video