Small Model and In-Domain Data Are All You Need

Hui Zeng


Abstract
I participated in the WMT shared news translation task and focus on one high resource language pair: English and Chinese (two directions, Chinese to English and English to Chinese). The submitted systems (ZengHuiMT) focus on data cleaning, data selection, back translation and model ensemble. The techniques I used for data filtering and selection include filtering by rules, language model and word alignment. I used a base translation model trained on initial corpus to obtain the target versions of the WMT21 test sets, then I used language models to find out the monolingual data that is most similar to the target version of test set, such monolingual data was then used to do back translation. On the test set, my best submitted systems achieve 35.9 and 32.2 BLEU for English to Chinese and Chinese to English directions respectively, which are quite high for a small model.
Anthology ID:
2021.wmt-1.24
Volume:
Proceedings of the Sixth Conference on Machine Translation
Month:
November
Year:
2021
Address:
Online
Editors:
Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre Martins, Makoto Morishita, Christof Monz
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
255–259
Language:
URL:
https://aclanthology.org/2021.wmt-1.24
DOI:
Bibkey:
Cite (ACL):
Hui Zeng. 2021. Small Model and In-Domain Data Are All You Need. In Proceedings of the Sixth Conference on Machine Translation, pages 255–259, Online. Association for Computational Linguistics.
Cite (Informal):
Small Model and In-Domain Data Are All You Need (Zeng, WMT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wmt-1.24.pdf