Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text

Kamal Kumar Gupta, Soumya Chennabasavaraj, Nikesh Garera, Asif Ekbal


Abstract
Reviews written by the users for a particular product or service play an influencing role for the customers to make an informative decision. Although online e-commerce portals have immensely impacted our lives, available contents predominantly are in English language- often limiting its widespread usage. There is an exponential growth in the number of e-commerce users who are not proficient in English. Hence, there is a necessity to make these services available in non-English languages, especially in a multilingual country like India. This can be achieved by an in-domain robust machine translation (MT) system. However, the reviews written by the users pose unique challenges to MT, such as misspelled words, ungrammatical constructions, presence of colloquial terms, lack of resources such as in-domain parallel corpus etc. We address the above challenges by presenting an English–Hindi review domain parallel corpus. We train an English–to–Hindi neural machine translation (NMT) system to translate the product reviews available on e-commerce websites. By training the Transformer based NMT model over the generated data, we achieve a score of 33.26 BLEU points for English–to–Hindi translation. In order to make our NMT model robust enough to handle the noisy tokens in the reviews, we integrate a character based language model to generate word vectors and map the noisy tokens with their correct forms. Experiments on four language pairs, viz. English-Hindi, English-German, English-French, and English-Czech show the BLUE scores of 35.09, 28.91, 34.68 and 14.52 which are the improvements of 1.61, 1.05, 1.63 and 1.94, respectively, over the baseline.
Anthology ID:
2021.ecnlp-1.21
Volume:
Proceedings of The 4th Workshop on e-Commerce and NLP
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | ECNLP | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
174–183
Language:
URL:
https://aclanthology.org/2021.ecnlp-1.21
DOI:
10.18653/v1/2021.ecnlp-1.21
Bibkey:
Cite (ACL):
Kamal Kumar Gupta, Soumya Chennabasavaraj, Nikesh Garera, and Asif Ekbal. 2021. Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text. In Proceedings of The 4th Workshop on e-Commerce and NLP, pages 174–183, Online. Association for Computational Linguistics.
Cite (Informal):
Product Review Translation: Parallel Corpus Creation and Robustness towards User-generated Noisy Text (Gupta et al., ECNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ecnlp-1.21.pdf
Data
MTNT