Anh Khoa Ngo Ho
Also published as: Anh Khoa Ngo Ho
2021
Optimizing Word Alignments with Better Subword Tokenization
Anh Khoa Ngo Ho
|
François Yvon
Proceedings of Machine Translation Summit XVIII: Research Track
Word alignment identify translational correspondences between words in a parallel sentence pair and are used and for example and to train statistical machine translation and learn bilingual dictionaries or to perform quality estimation. Subword tokenization has become a standard preprocessing step for a large number of applications and notably for state-of-the-art open vocabulary machine translation systems. In this paper and we thoroughly study how this preprocessing step interacts with the word alignment task and propose several tokenization strategies to obtain well-segmented parallel corpora. Using these new techniques and we were able to improve baseline word-based alignment models for six language pairs.
2020
Generative latent neural models for automatic word alignment
Anh Khoa Ngo Ho
|
François Yvon
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
2019
Neural Baselines for Word Alignment
Anh Khoa Ngo Ho
|
François Yvon
Proceedings of the 16th International Conference on Spoken Language Translation
Word alignments identify translational correspondences between words in a parallel sentence pair and is used, for instance, to learn bilingual dictionaries, to train statistical machine translation systems, or to perform quality estimation. In most areas of natural lan- guage processing, neural network models nowadays constitute the preferred approach, a situation that might also apply to word align- ment models. In this work, we study and comprehensively evaluate neural models for unsupervised word alignment for four language pairs, contrasting several variants of neural models. We show that in most settings, neural versions of the IBM-1 and hidden Markov models vastly outperform their discrete counterparts. We also analyze typical alignment errors of the baselines that our models over- come to illustrate the benefits — and the limitations — of these new models for morphologically rich languages.