Embedding-Enhanced GIZA++: Improving Low-Resource Word Alignment Using Embeddings

Kelly Marchisio; Conghao Xiong; Philipp Koehn

Embedding-Enhanced GIZA++: Improving Low-Resource Word Alignment Using Embeddings

Kelly Marchisio, Conghao Xiong, Philipp Koehn

Abstract

A popular natural language processing task decades ago, word alignment has been dominated until recently by GIZA++, a statistical method based on the 30-year-old IBM models. New methods that outperform GIZA++ primarily rely on large machine translation models, massively multilingual language models, or supervision from GIZA++ alignments itself. We introduce Embedding-Enhanced GIZA++, and outperform GIZA++ without any of the aforementioned factors. Taking advantage of monolingual embedding spaces of source and target language only, we exceed GIZA++’s performance in every tested scenario for three languages pairs. In the lowest-resource setting, we outperform GIZA++ by 8.5, 10.9, and 12 AER for RoEn, De-En, and En-Fr, respectively. We release our code at www.blind-review.code.

Anthology ID:: 2022.amta-research.20
Volume:: Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:: September
Year:: 2022
Address:: Orlando, USA
Editors:: Kevin Duh, Francisco Guzmán
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:: 264–273
Language:
URL:: https://aclanthology.org/2022.amta-research.20/
DOI:
Bibkey:
Cite (ACL):: Kelly Marchisio, Conghao Xiong, and Philipp Koehn. 2022. Embedding-Enhanced GIZA++: Improving Low-Resource Word Alignment Using Embeddings. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 264–273, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):: Embedding-Enhanced GIZA++: Improving Low-Resource Word Alignment Using Embeddings (Marchisio et al., AMTA 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.amta-research.20.pdf

PDF Cite Search Fix data