Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation

Dama Sravani, Radhika Mamidi


Abstract
Code-Mixing, the act of mixing two or more languages, is a common communicative phenomenon in multi-lingual societies. The lack of quality in code-mixed data is a bottleneck for NLP systems. On the other hand, Monolingual systems perform well due to ample high-quality data. To bridge the gap, creating coherent translations of monolingual sentences to their code-mixed counterparts can improve accuracy in code-mixed settings for NLP downstream tasks. In this paper, we propose a neural machine translation approach to generate high-quality code-mixed sentences by leveraging human judgements. We train filters based on human judgements to identify natural code-mixed sentences from a larger synthetically generated code-mixed corpus, resulting in a three-way silver parallel corpus between monolingual English, monolingual Indian language and code-mixed English with an Indian language. Using these corpora, we fine-tune multi-lingual encoder-decoder models viz, mT5 and mBART, for the translation task. Our results indicate that our approach of using filtered data for training outperforms the current systems for code-mixed generation in Hindi-English. Apart from Hindi-English, the approach performs well when applied to Telugu, a low-resource language, to generate Telugu-English code-mixed sentences.
Anthology ID:
2023.conll-1.15
Volume:
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Jing Jiang, David Reitter, Shumin Deng
Venue:
CoNLL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
211–220
Language:
URL:
https://aclanthology.org/2023.conll-1.15
DOI:
10.18653/v1/2023.conll-1.15
Bibkey:
Cite (ACL):
Dama Sravani and Radhika Mamidi. 2023. Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 211–220, Singapore. Association for Computational Linguistics.
Cite (Informal):
Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation (Sravani & Mamidi, CoNLL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.conll-1.15.pdf