Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Genta Indra Winata; Andrea Madotto; Chien-Sheng Wu; Pascale Fung

doi:10.18653/v1/K19-1026

Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, Pascale Fung

Abstract

Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this issue. However, this require external word alignments or constituency parsers that create erroneous results on distant languages. We propose a sequence-to-sequence model using a copy mechanism to generate code-switching data by leveraging parallel monolingual translations from a limited source of code-switching data. The model learns how to combine words from parallel sentences and identifies when to switch one language to the other. Moreover, it captures code-switching constraints by attending and aligning the words in inputs, without requiring any external knowledge. Based on experimental results, the language model trained with the generated sentences achieves state-of-the-art performance and improves end-to-end automatic speech recognition.

Anthology ID:: K19-1026
Volume:: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Mohit Bansal, Aline Villavicencio
Venue:: CoNLL
SIG:: SIGNLL
Publisher:: Association for Computational Linguistics
Note:
Pages:: 271–280
Language:
URL:: https://aclanthology.org/K19-1026/
DOI:: 10.18653/v1/K19-1026
Bibkey:
Cite (ACL):: Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2019. Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 271–280, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences (Winata et al., CoNLL 2019)
Copy Citation:
PDF:: https://aclanthology.org/K19-1026.pdf
Supplementary material:: K19-1026.Supplementary_Material.pdf

PDF Cite Search Supplementary material Fix data