xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Mingda Chen, Kevin Heffernan, Onur Çelebi, Alexandre Mourachko, Holger Schwenk


Abstract
We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xsim++. In comparison to xsim, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xsim, we show that xsim++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xsim++ also reports performance for different error types, offering more fine-grained feedbacks for model development.
Anthology ID:
2023.acl-short.10
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
101–109
Language:
URL:
https://aclanthology.org/2023.acl-short.10
DOI:
10.18653/v1/2023.acl-short.10
Bibkey:
Cite (ACL):
Mingda Chen, Kevin Heffernan, Onur Çelebi, Alexandre Mourachko, and Holger Schwenk. 2023. xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 101–109, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages (Chen et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-short.10.pdf
Video:
 https://aclanthology.org/2023.acl-short.10.mp4