Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)

Robert C. Gale, Alexandra C. Salem, Gerasimos Fergadiotis, Steven Bedrick


Abstract
Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improved on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.
Anthology ID:
2023.repl4nlp-1.18
Volume:
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Burcu Can, Maximilian Mozes, Samuel Cahyawijaya, Naomi Saphra, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Chen Zhao, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Lena Voita
Venue:
RepL4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
212–225
Language:
URL:
https://aclanthology.org/2023.repl4nlp-1.18
DOI:
10.18653/v1/2023.repl4nlp-1.18
Bibkey:
Cite (ACL):
Robert C. Gale, Alexandra C. Salem, Gerasimos Fergadiotis, and Steven Bedrick. 2023. Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT). In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 212–225, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT) (Gale et al., RepL4NLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.repl4nlp-1.18.pdf