Benjamin Bergen


2024

pdf bib
Comparing Humans and Large Language Models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME)
Cameron R. Jones | Sean Trott | Benjamin Bergen
Transactions of the Association for Computational Linguistics, Volume 12

We address a growing debate about the extent to which large language models (LLMs) produce behavior consistent with Theory of Mind (ToM) in humans. We present EPITOME: a battery of six experiments that tap diverse ToM capacities, including belief attribution, emotional inference, and pragmatic reasoning. We elicit a performance baseline from human participants for each task. We use the dataset to ask whether distributional linguistic information learned by LLMs is sufficient to explain ToM in humans. We compare performance of five LLMs to a baseline of responses from human comprehenders. Results are mixed. LLMs display considerable sensitivity to mental states and match human performance in several tasks. Yet, they commit systematic errors in others, especially those requiring pragmatic reasoning on the basis of mental state information. Such uneven performance indicates that human-level ToM may require resources beyond distributional information.

pdf bib
A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages
Catherine Arnett | Tyler A. Chang | Benjamin Bergen
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.

pdf bib
Correlations between Multilingual Language Model Geometry and Crosslingual Transfer Performance
Cheril Shah | Yashashree Chandak | Atharv Mahesh Mane | Benjamin Bergen | Tyler A. Chang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A common approach to interpreting multilingual language models is to evaluate their internal representations. For example, studies have found that languages occupy distinct subspaces in the models’ representation spaces, and geometric distances between languages often reflect linguistic properties such as language families and typological features. In our work, we investigate whether geometric distances between language representations correlate with zero-shot crosslingual transfer performance for POS-tagging and NER in three multilingual language models. We consider four distance metrics, including new metrics that identify a basis for a multilingual representation space that sorts axes based on their language-separability. We find that each distance metric either only moderately correlates or does not correlate with crosslingual transfer performance, and metrics do not generalize well across models, layers, and tasks. Although pairwise language separability is a reasonable predictor of crosslingual transfer, representational geometry overall is an inconsistent predictor for the crosslingual performance of multilingual language models.

2023

pdf bib
Rarely a problem? Language models exhibit inverse scaling in their predictions following few-type quantifiers
James Michaelov | Benjamin Bergen
Findings of the Association for Computational Linguistics: ACL 2023

How well do language models deal with quantification? In this study, we focus on ‘few’-type quantifiers, as in ‘few children like toys’, which might pose a particular challenge for language models because the sentence components with out the quantifier are likely to co-occur, and ‘few’-type quantifiers are rare. We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do all the models perform poorly on ‘few’-type quantifiers, but overall the larger the model, the worse its performance. This inverse scaling is consistent with previous work suggesting that larger models increasingly reflect online rather than offline human processing, and we argue that the decreasing performance of larger models may challenge uses of language models as the basis for natural language systems.

2022

pdf bib
Collateral facilitation in humans and language models
James Michaelov | Benjamin Bergen
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Are the predictions of humans and language models affected by similar things? Research suggests that while comprehending language, humans make predictions about upcoming words, with more predictable words being processed more easily. However, evidence also shows that humans display a similar processing advantage for highly anomalous words when these words are semantically related to the preceding context or to the most probable continuation. Using stimuli from 3 psycholinguistic experiments, we find that this is also almost always also the case for 8 contemporary transformer language models (BERT, ALBERT, RoBERTa, XLM-R, GPT-2, GPT-Neo, GPT-J, and XGLM). We then discuss the implications of this phenomenon for our understanding of both human language comprehension and the predictions made by language models.

pdf bib
The Geometry of Multilingual Language Model Representations
Tyler Chang | Zhuowen Tu | Benjamin Bergen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language. Using XLM-R as a case study, we show that languages occupy similar linear subspaces after mean-centering, evaluated based on causal effects on language modeling performance and direct comparisons between subspaces for 88 languages. The subspace means differ along language-sensitive axes that are relatively stable throughout middle layers, and these axes encode information such as token vocabularies. Shifting representations by language means is sufficient to induce token predictions in different languages. However, we also identify stable language-neutral axes that encode information such as token positions and part-of-speech. We visualize representations projected onto language-sensitive and language-neutral axes, identifying language family and part-of-speech clusters, along with spirals, toruses, and curves representing token position information. These results demonstrate that multilingual language models encode information along orthogonal language-sensitive and language-neutral axes, allowing the models to extract a variety of features for downstream tasks and cross-lingual transfer learning.

2021

pdf bib
RAW-C: Relatedness of Ambiguous Words in Context (A New Lexical Resource for English)
Sean Trott | Benjamin Bergen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Most words are ambiguous—-i.e., they convey distinct meanings in different contexts—-and even the meanings of unambiguous words are context-dependent. Both phenomena present a challenge for NLP. Recently, the advent of contextualized word embeddings has led to success on tasks involving lexical ambiguity, such as Word Sense Disambiguation. However, there are few tasks that directly evaluate how well these contextualized embeddings accommodate the more continuous, dynamic nature of word meaning—-particularly in a way that matches human intuitions. We introduce RAW-C, a dataset of graded, human relatedness judgments for 112 ambiguous words in context (with 672 sentence pairs total), as well as human estimates of sense dominance. The average inter-annotator agreement (assessed using a leave-one-annotator-out method) was 0.79. We then show that a measure of cosine distance, computed using contextualized embeddings from BERT and ELMo, correlates with human judgments, but that cosine distance also systematically underestimates how similar humans find uses of the same sense of a word to be, and systematically overestimates how similar humans find uses of different-sense homonyms. Finally, we propose a synthesis between psycholinguistic theories of the mental lexicon and computational models of lexical semantics.

2020

pdf bib
How well does surprisal explain N400 amplitude under different experimental conditions?
James Michaelov | Benjamin Bergen
Proceedings of the 24th Conference on Computational Natural Language Learning

We investigate the extent to which word surprisal can be used to predict a neural measure of human language processing difficulty—the N400. To do this, we use recurrent neural networks to calculate the surprisal of stimuli from previously published neurolinguistic studies of the N400. We find that surprisal can predict N400 amplitude in a wide range of cases, and the cases where it cannot do so provide valuable insight into the neurocognitive processes underlying the response.

2016

pdf bib
Literal and Metaphorical Senses in Compositional Distributional Semantic Models
E. Dario Gutiérrez | Ekaterina Shutova | Tyler Marghetis | Benjamin Bergen
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Finding Non-Arbitrary Form-Meaning Systematicity Using String-Metric Learning for Kernel Regression
E. Dario Gutiérrez | Roger Levy | Benjamin Bergen
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)