Enriching Language Models with Visually-grounded Word Vectors and the Lancaster Sensorimotor Norms

Casey Kennington


Abstract
Language models are trained only on text despite the fact that humans learn their first language in a highly interactive and multimodal environment where the first set of learned words are largely concrete, denoting physical entities and embodied states. To enrich language models with some of this missing experience, we leverage two sources of information: (1) the Lancaster Sensorimotor norms, which provide ratings (means and standard deviations) for over 40,000 English words along several dimensions of embodiment, and which capture the extent to which something is experienced across 11 different sensory modalities, and (2) vectors from coefficients of binary classifiers trained on images for the BERT vocabulary. We pre-trained the ELECTRA model and fine-tuned the RoBERTa model with these two sources of information then evaluate using the established GLUE benchmark and the Visual Dialog benchmark. We find that enriching language models with the Lancaster norms and image vectors improves results in both tasks, with some implications for robust language models that capture holistic linguistic meaning in a language learning context.
Anthology ID:
2021.conll-1.11
Volume:
Proceedings of the 25th Conference on Computational Natural Language Learning
Month:
November
Year:
2021
Address:
Online
Venues:
CoNLL | EMNLP
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
148–157
Language:
URL:
https://aclanthology.org/2021.conll-1.11
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.conll-1.11.pdf
Data
Conceptual CaptionsGLUEVisDial