T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Björn Deiseroth; Manuel Brack; Patrick Schramowski; Kristian Kersting; Samuel Weinbach

doi:10.18653/v1/2024.emnlp-main.1217

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach

Abstract

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages.To remedy these issues, we propose T-Free, which directly embeds words through sparse activation patterns over character triplets and does not require a reference corpus. T-Free inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-Free shows significant improvements in cross-lingual transfer learning.

Anthology ID:: 2024.emnlp-main.1217
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21829–21851
Language:
URL:: https://aclanthology.org/2024.emnlp-main.1217/
DOI:: 10.18653/v1/2024.emnlp-main.1217
Bibkey:
Cite (ACL):: Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, and Samuel Weinbach. 2024. T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21829–21851, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings (Deiseroth et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.1217.pdf

PDF Cite Search Fix data