2022
pdf
bib
abs
Language Tokens: Simply Improving Zero-Shot Multi-Aligned Translation in Encoder-Decoder Models
Muhammad N ElNokrashy
|
Amr Hendy
|
Mohamed Maher
|
Mohamed Afify
|
Hany Hassan
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
This paper proposes a simple and effective method to improve direct translation for the zero-shot case and when direct data is available. We modify the input tokens at both the encoder and decoder to include signals for the source and target languages. We show a performance gain when training from scratch, or finetuning a pretrained model with the proposed setup. In in-house experiments, our method shows nearly a 10.0 BLEU points difference depending on the stoppage criteria. In a WMT-based setting, we see 1.3 and 0.4 BLEU points improvement for the zero-shot setting, and when using direct data for training, respectively, while from-English performance improves by 4.17 and 0.85 BLEU points. In the low-resource setting, we see a 1.5 ∼ 1.7 point improvement when finetuning on directly translated domain data.
pdf
bib
abs
Domain Specific Sub-network for Multi-Domain Neural Machine Translation
Amr Hendy
|
Mohamed Abdelghaffar
|
Mohamed Afify
|
Ahmed Y. Tawfik
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
This paper presents Domain-Specific Sub-network (DoSS). It uses a set of masks obtained through pruning to define a sub-network for each domain and finetunes the sub-network parameters on domain data. This performs very closely and drastically reduces the number of parameters compared to finetuning the whole network on each domain. Also a method to make masks unique per domain is proposed and shown to greatly improve the generalization to unseen domains. In our experiments on German to English machine translation the proposed method outperforms the strong baseline of continue training on multi-domain (medical, tech and religion) data by 1.47 BLEU points. Also continue training DoSS on new domain (legal) outperforms the multi-domain (medical, tech, religion, legal) baseline by 1.52 BLEU points.
2021
pdf
bib
abs
Ensembling of Distilled Models from Multi-task Teachers for Constrained Resource Language Pairs
Amr Hendy
|
Esraa A. Gad
|
Mohamed Abdelghaffar
|
Jailan S. ElMosalami
|
Mohamed Afify
|
Ahmed Y. Tawfik
|
Hany Hassan Awadalla
Proceedings of the Sixth Conference on Machine Translation
This paper describes the Microsoft Egypt Development Center (EgDC) submission to the constrained track of WMT21 shared news translation task. We focus on the three relatively low resource language pairs Bengali ↔ Hindi, English ↔ Hausa and Xhosa ↔ Zulu. To overcome the limitation of relatively low parallel data we train a multilingual model using a multitask objective employing both parallel and monolingual data. In addition, we augment the data using back translation. We also train a bilingual model incorporating back translation and knowledge distillation then combine the two models using sequence-to-sequence mapping. We see around 70% relative gain in BLEU point for En ↔ Ha and around 25% relative improvements for Bn ↔ Hi and Xh ↔ Zu compared to bilingual baselines.
2020
pdf
bib
abs
Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions
Muhammad ElNokrashy
|
Amr Hendy
|
Mohamed Abdelghaffar
|
Mohamed Afify
|
Ahmed Tawfik
|
Hany Hassan Awadalla
Proceedings of the Fifth Conference on Machine Translation
This paper presents the description of our submission to WMT20 sentence filtering task. We combine scores from custom LASER built for each source language, a classifier built to distinguish positive and negative pairs and the original scores provided with the task. For the mBART setup, provided by the organizers, our method shows 7% and 5% relative improvement, over the baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.