Leon Bergen


2024

pdf bib
IR2: Information Regularization for Information Retrieval
Jianyou Wang | Kaicheng Wang | Xiaoyue Wang | Weili Cao | Ramamohan Paturi | Leon Bergen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline—input, prompt, and output—each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios. All code, prompts and synthetic data are available at https://github.com/Info-Regularization/Information-Regularization.

2020

pdf bib
Speakers enhance contextually confusable words
Eric Meinhardt | Eric Bakovic | Leon Bergen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent work has found evidence that natural languages are shaped by pressures for efficient communication — e.g. the more contextually predictable a word is, the fewer speech sounds or syllables it has (Piantadosi et al. 2011). Research on the degree to which speech and language are shaped by pressures for effective communication — robustness in the face of noise and uncertainty — has been more equivocal. We develop a measure of contextual confusability during word recognition based on psychoacoustic data. Applying this measure to naturalistic speech corpora, we find evidence suggesting that speakers alter their productions to make contextually more confusable words easier to understand.

pdf bib
Predicting Reference: What do Language Models Learn about Discourse Models?
Shiva Upadhye | Leon Bergen | Andrew Kehler
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Whereas there is a growing literature that probes neural language models to assess the degree to which they have latently acquired grammatical knowledge, little if any research has investigated their acquisition of discourse modeling ability. We address this question by drawing on a rich psycholinguistic literature that has established how different contexts affect referential biases concerning who is likely to be referred to next. The results reveal that, for the most part, the prediction behavior of neural language models does not resemble that of human language users.

pdf bib
Word Frequency Does Not Predict Grammatical Knowledge in Language Models
Charles Yu | Ryan Sie | Nicolas Tedeschi | Leon Bergen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Neural language models learn, to varying degrees of accuracy, the grammatical properties of natural languages. In this work, we investigate whether there are systematic sources of variation in the language models’ accuracy. Focusing on subject-verb agreement and reflexive anaphora, we find that certain nouns are systematically understood better than others, an effect which is robust across grammatical tasks and different language models. Surprisingly, we find that across four orders of magnitude, corpus frequency is unrelated to a noun’s performance on grammatical tasks. Finally, we find that a novel noun’s grammatical properties can be few-shot learned from various types of training data. The results present a paradox: there should be less variation in grammatical performance than is actually observed.

2019

pdf bib
Constraint-based Learning of Phonological Processes
Shraddha Barke | Rose Kunkel | Nadia Polikarpova | Eric Meinhardt | Eric Bakovic | Leon Bergen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Phonological processes are context-dependent sound changes in natural languages. We present an unsupervised approach to learning human-readable descriptions of phonological processes from collections of related utterances. Our approach builds upon a technique from the programming languages community called *constraint-based program synthesis*. We contribute a novel encoding of the learning problem into Boolean Satisfiability constraints, which enables both data efficiency and fast inference. We evaluate our system on textbook phonology problems and datasets from the literature, and show that it achieves high accuracy at interactive speeds.

2013

pdf bib
Arguments and Modifiers from the Learner’s Perspective
Leon Bergen | Edward Gibson | Timothy J. O’Donnell
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)