Graphemes vs. phonemes: battling it out in character-based language models

Bastian Bunzeck; Daniel Duran; Leonie Schade; Sina Zarrieß

Graphemes vs. phonemes: battling it out in character-based language models

Bastian Bunzeck, Daniel Duran, Leonie Schade, Sina Zarrieß

Abstract

We present grapheme-llama and phoneme-llama, character-based language models trained for the 2024 BabyLM challenge. Through these models, we explore an under-researched approach to downsizing: replacing subword-based tokenization with character-level tokenization, drastically reducing the vocabulary size. The grapheme model is trained on a standard BabyLM dataset, while the phoneme model uses a phoneme-converted version of this dataset. Results show that grapheme-based models perform better overall, achieving scores comparable to subword-based models on grammatical benchmarks. Despite lower performance, phoneme models also demonstrate promising grammatical learning. We argue that our results challenge conventional wisdom on language modeling techniques and open up novel research questions with character- and phoneme-based models as objects of inquiry.

Anthology ID:: 2024.conll-babylm.5
Volume:: The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:: November
Year:: 2024
Address:: Miami, FL, USA
Editors:: Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:: CoNLL | BabyLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 54–64
Language:
URL:: https://aclanthology.org/2024.conll-babylm.5/
DOI:
Bibkey:
Cite (ACL):: Bastian Bunzeck, Daniel Duran, Leonie Schade, and Sina Zarrieß. 2024. Graphemes vs. phonemes: battling it out in character-based language models. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 54–64, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):: Graphemes vs. phonemes: battling it out in character-based language models (Bunzeck et al., CoNLL-BabyLM 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.conll-babylm.5.pdf

PDF Cite Search Fix data