A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages

Anoop Kunchukuttan, Siddharth Jain, Rahul Kejriwal


Abstract
We take up the task of large-scale evaluation of neural machine transliteration between English and Indic languages, with a focus on multilingual transliteration to utilize orthographic similarity between Indian languages. We create a corpus of 600K word pairs mined from parallel translation corpora and monolingual corpora, which is the largest transliteration corpora for Indian languages mined from public sources. We perform a detailed analysis of multilingual transliteration and propose an improved multilingual training recipe for Indic languages. We analyze various factors affecting transliteration quality like language family, transliteration direction and word origin.
Anthology ID:
2021.eacl-main.303
Volume:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:
April
Year:
2021
Address:
Online
Editors:
Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3469–3475
Language:
URL:
https://aclanthology.org/2021.eacl-main.303
DOI:
10.18653/v1/2021.eacl-main.303
Bibkey:
Cite (ACL):
Anoop Kunchukuttan, Siddharth Jain, and Rahul Kejriwal. 2021. A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3469–3475, Online. Association for Computational Linguistics.
Cite (Informal):
A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages (Kunchukuttan et al., EACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eacl-main.303.pdf
Code
 anoopkunchukuttan/indic_transiteration_analysis
Data
Dakshina