A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning

Deepak Gupta; Asif Ekbal; Pushpak Bhattacharyya

doi:10.18653/v1/2020.findings-emnlp.206

A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning

Deepak Gupta, Asif Ekbal, Pushpak Bhattacharyya

Abstract

Code-mixing, the interleaving of two or more languages within a sentence or discourse is ubiquitous in multilingual societies. The lack of code-mixed training data is one of the major concerns for the development of end-to-end neural network-based models to be deployed for a variety of natural language processing (NLP) applications. A potential solution is to either manually create or crowd-source the code-mixed labelled data for the task at hand, but that requires much human efforts and often not feasible because of the language specific diversity in the code-mixed text. To circumvent the data scarcity issue, we propose an effective deep learning approach for automatically generating the code-mixed text from English to multiple languages without any parallel data. In order to train the neural network, we create synthetic code-mixed texts from the available parallel corpus by modelling various linguistic properties of code-mixing. Our codemixed text generator is built upon the encoder-decoder framework, where the encoder is augmented with the linguistic and task-agnostic features obtained from the transformer based language model. We also transfer the knowledge from a neural machine translation (NMT) to warm-start the training of code-mixed generator. Experimental results and in-depth analysis show the effectiveness of our proposed code-mixed text generation on eight diverse language pairs.

Anthology ID:: 2020.findings-emnlp.206
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2267–2280
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.206/
DOI:: 10.18653/v1/2020.findings-emnlp.206
Bibkey:
Cite (ACL):: Deepak Gupta, Asif Ekbal, and Pushpak Bhattacharyya. 2020. A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2267–2280, Online. Association for Computational Linguistics.
Cite (Informal):: A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning (Gupta et al., Findings 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.findings-emnlp.206.pdf
Optionalsupplementarymaterial:: 2020.findings-emnlp.206.OptionalSupplementaryMaterial.pdf

PDF Cite Search Optionalsupplementarymaterial Fix data