Evelina Fedorenko

2024

pdf bib abs
Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models
Carina Kauf | Emmanuele Chersoni | Alessandro Lenci | Evelina Fedorenko | Anna A Ivanova
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Semantic plausibility (e.g. knowing that “the actor won the award” is more likely than “the actor won the battle”) serves as an effective proxy for general world knowledge. Language models (LMs) capture vast amounts of world knowledge by learning distributional patterns in text, accessible via log probabilities (LogProbs) they assign to plausible vs. implausible outputs. The new generation of instruction-tuned LMs can now also provide explicit estimates of plausibility via prompting. Here, we evaluate the effectiveness of LogProbs and basic prompting to measure semantic plausibility, both in single-sentence minimal pairs (Experiment 1) and short context-dependent scenarios (Experiment 2). We find that (i) in both base and instruction-tuned LMs, LogProbs offers a more reliable measure of semantic plausibility than direct zero-shot prompting, which yields inconsistent and often poor results; (ii) instruction-tuning generally does not alter the sensitivity of LogProbs to semantic plausibility (although sometimes decreases it); (iii) across models, context mostly modulates LogProbs in expected ways, as measured by three novel metrics of context-sensitive plausibility and their match to explicit human plausibility judgments. We conclude that, even in the era of prompt-based evaluations, LogProbs constitute a useful metric of semantic plausibility, both in base and instruction-tuned LMs.

pdf bib abs
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Chengxu Zhuang | Evelina Fedorenko | Jacob Andreas
Findings of the Association for Computational Linguistics: ACL 2024

Today’s most accurate language models are trained on orders of magnitude more language data than human language learners receive— but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs’ representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next-token prediction strategy with a contrastive visual grounding objective, focusing on early-layerrepresentations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastiveGrounding not only outperforms standard language-only models in terms of learning efficiency in small and developmentally plausible data regimes, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks compared to other models trained on the same amount of text data. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.

pdf bib abs
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
Chengxu Zhuang | Evelina Fedorenko | Jacob Andreas
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways — requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically — with grounded supervision — exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models’ learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-dataregime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images isnot redundant—models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.

2023

pdf bib abs
A fine-grained comparison of pragmatic language understanding in humans and language models
Jennifer Hu | Sammy Floyd | Olessia Jouravlev | Evelina Fedorenko | Edward Gibson
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pragmatics and non-literal language understanding are essential to human communication, and present a long-standing challenge for artificial language models. We perform a fine-grained comparison of language models and humans on seven pragmatic phenomena, using zero-shot prompting on an expert-curated set of English materials. We ask whether models (1) select pragmatic interpretations of speaker utterances, (2) make similar error patterns as humans, and (3) use similar linguistic cues as humans to solve the tasks. We find that the largest models achieve high accuracy and match human error patterns: within incorrect responses, models favor literal interpretations over heuristic-based distractors. We also find preliminary evidence that models and humans are sensitive to similar linguistic cues. Our results suggest that pragmatic behaviors can emerge in models without explicitly constructed representations of mental states. However, models tend to struggle with phenomena relying on social expectation violations.

Prosody—the suprasegmental component of speech, including pitch, loudness, and tempo—carries critical aspects of meaning. However, the relationship between the information conveyed by prosody vs. by the words themselves remains poorly understood. We use large language models (LLMs) to estimate how much information is redundant between prosody and the words themselves. Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Furthermore, a word’s prosodic information is redundant with both the word itself and the context preceding as well as following it. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words. Along with this paper, we release a general-purpose data processing pipeline for quantifying the relationship between linguistic information and extra-linguistic features.

2022

pdf bib abs
SentSpace: Large-Scale Benchmarking and Evaluation of Text using Cognitively Motivated Lexical, Syntactic, and Semantic Features
Greta Tuckute | Aalok Sathe | Mingye Wang | Harley Yoder | Cory Shain | Evelina Fedorenko
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

SentSpace is a modular framework for streamlined evaluation of text. SentSpacecharacterizes textual input using diverse lexical, syntactic, and semantic features derivedfrom corpora and psycholinguistic experiments. Core sentence features fall into three primaryfeature spaces: 1) Lexical, 2) Contextual, and 3) Embeddings. To aid in the analysis of computed features, SentSpace provides a web interface for interactive visualization and comparison with text from large corpora. The modular design of SentSpace allows researchersto easily integrate their own feature computation into the pipeline while benefiting from acommon framework for evaluation and visualization. In this manuscript we will describe thedesign of SentSpace, its core feature spaces, and demonstrate an example use case by comparing human-written and machine-generated (GPT2-XL) sentences to each other. We findthat while GPT2-XL-generated text appears fluent at the surface level, psycholinguistic normsand measures of syntactic processing reveal key differences between text produced by humansand machines. Thus, SentSpace provides a broad set of cognitively motivated linguisticfeatures for evaluation of text within natural language processing, cognitive science, as wellas the social sciences.