Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, Miryam De Lhoneux


Abstract
Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the model’s visual and linguistic understanding. The lower layers of PIXEL predominantly capture superficial visual features, whereas the higher layers gradually learn more syntactic and semantic abstractions. Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features. With this study, we hope to provide insights that aid the further development of pixel-based language models.
Anthology ID:
2024.emnlp-main.194
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3307–3320
Language:
URL:
https://aclanthology.org/2024.emnlp-main.194
DOI:
Bibkey:
Cite (ACL):
Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, and Miryam De Lhoneux. 2024. Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3307–3320, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models (Tatariya et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.194.pdf