Investigating the Impact of Various Partial Diacritization Schemes on Arabic-English Statistical Machine Translation

Sawsan Alqahtani, Mahmoud Ghoneim, Mona Diab


Abstract
Most diacritics in Arabic represent short vowels. In Arabic orthography, such diacritics are considered optional. The absence of these diacritics naturally leads to significant word ambiguity to top the inherent ambiguity present in fully diacritized words. Word ambiguity is a significant impediment for machine translation. Despite the ambiguity presented by lack of diacritization, context helps ameliorate the situation. Identifying the appropriate amount of diacritic restoration to reduce word sense ambiguity in the context of machine translation is the object of this paper. Diacritic marks help reduce the number of possible lexical word choices assigned to a source word which leads to better quality translated sentences. We investigate a variety of (linguistically motivated) partial diacritization schemes that preserve some of the semantics that in essence complement the implicit contextual information present in the sentences. We also study the effect of training data size and report results on three standard test sets that represent a combination of different genres. The results show statistically significant improvements for some schemes compared to two baselines: text with no diacritics (the typical writing system adopted for Arabic) and text that is fully diacritized.
Anthology ID:
2016.amta-researchers.15
Volume:
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track
Month:
October 28 - November 1
Year:
2016
Address:
Austin, TX, USA
Editors:
Spence Green, Lane Schwartz
Venue:
AMTA
SIG:
Publisher:
The Association for Machine Translation in the Americas
Note:
Pages:
191–204
Language:
URL:
https://aclanthology.org/2016.amta-researchers.15
DOI:
Bibkey:
Cite (ACL):
Sawsan Alqahtani, Mahmoud Ghoneim, and Mona Diab. 2016. Investigating the Impact of Various Partial Diacritization Schemes on Arabic-English Statistical Machine Translation. In Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track, pages 191–204, Austin, TX, USA. The Association for Machine Translation in the Americas.
Cite (Informal):
Investigating the Impact of Various Partial Diacritization Schemes on Arabic-English Statistical Machine Translation (Alqahtani et al., AMTA 2016)
Copy Citation:
PDF:
https://aclanthology.org/2016.amta-researchers.15.pdf