Robust Dictionary Lookup in Multiple Noisy Orthographies

Lingliang Zhang, Nizar Habash, Godfried Toussaint


Abstract
We present the MultiScript Phonetic Search algorithm to address the problem of language learners looking up unfamiliar words that they heard. We apply it to Arabic dictionary lookup with noisy queries done using both the Arabic and Roman scripts. Our algorithm is based on a computational phonetic distance metric that can be optionally machine learned. To benchmark our performance, we created the ArabScribe dataset, containing 10,000 noisy transcriptions of random Arabic dictionary words. Our algorithm outperforms Google Translate’s “did you mean” feature, as well as the Yamli smart Arabic keyboard.
Anthology ID:
W17-1315
Volume:
Proceedings of the Third Arabic Natural Language Processing Workshop
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Nizar Habash, Mona Diab, Kareem Darwish, Wassim El-Hajj, Hend Al-Khalifa, Houda Bouamor, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:
WANLP
SIG:
SEMITIC
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–129
Language:
URL:
https://aclanthology.org/W17-1315
DOI:
10.18653/v1/W17-1315
Bibkey:
Cite (ACL):
Lingliang Zhang, Nizar Habash, and Godfried Toussaint. 2017. Robust Dictionary Lookup in Multiple Noisy Orthographies. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 119–129, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Robust Dictionary Lookup in Multiple Noisy Orthographies (Zhang et al., WANLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1315.pdf