Bi-Phone: Modeling Inter Language Phonetic Influences in Text

Abhirut Gupta; Ananya B. Sai; Richard Sproat; Yuri Vasilevski; James Ren; Ambarish Jash; Sukhdeep Sodhi; Aravindan Raghuveer

doi:10.18653/v1/2023.acl-long.145

Bi-Phone: Modeling Inter Language Phonetic Influences in Text

Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James Ren, Ambarish Jash, Sukhdeep Sodhi, Aravindan Raghuveer

Abstract

A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1).We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2.These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web.We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the SuperGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

Anthology ID:: 2023.acl-long.145
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2580–2592
Language:
URL:: https://aclanthology.org/2023.acl-long.145/
DOI:: 10.18653/v1/2023.acl-long.145
Bibkey:
Cite (ACL):: Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James Ren, Ambarish Jash, Sukhdeep Sodhi, and Aravindan Raghuveer. 2023. Bi-Phone: Modeling Inter Language Phonetic Influences in Text. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2580–2592, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Bi-Phone: Modeling Inter Language Phonetic Influences in Text (Gupta et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.145.pdf
Video:: https://aclanthology.org/2023.acl-long.145.mp4

PDF Cite Search Video Fix data