David Bamman


2024

pdf bib
Social Meme-ing: Measuring Linguistic Variation in Memes
Naitian Zhou | David Jurgens | David Bamman
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Much work in the space of NLP has used computational methods to explore sociolinguistic variation in text. In this paper, we argue that memes, as multimodal forms of language comprised of visual templates and text, also exhibit meaningful social variation. We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables, taking advantage of their multimodal structure in doing so. We apply this method to a large collection of meme images from Reddit and make available the resulting SemanticMemes dataset of 3.8M images clustered by their semantic function. We use these clusters to analyze linguistic variation in memes, discovering not only that socially meaningful variation in meme usage exists between subreddits, but that patterns of meme innovation and acculturation within these communities align with previous findings on written language.

pdf bib
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
Li Lucy | Suchin Gururangan | Luca Soldaini | Emma Strubell | David Bamman | Lauren Klein | Jesse Dodge
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models’ (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten “quality” and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.

2023

pdf bib
Grounding Characters and Places in Narrative Text
Sandeep Soni | Amanpreet Sihra | Elizabeth Evans | Matthew Wilkens | David Bamman
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Tracking characters and locations throughout a story can help improve the understanding of its plot structure. Prior research has analyzed characters and locations from text independently without grounding characters to their locations in narrative time. Here, we address this gap by proposing a new spatial relationship categorization task. The objective of the task is to assign a spatial relationship category for every character and location co-mention within a window of text, taking into consideration linguistic context, narrative tense, and temporal scope. To this end, we annotate spatial relationships in approximately 2500 book excerpts and train a model using contextual embeddings as features to predict these relationships. When applied to a set of books, this model allows us to test several hypotheses on mobility and domestic space, revealing that protagonists are more mobile than non-central characters and that women as characters tend to occupy more interior space than men. Overall, our work is the first step towards joint modeling and analysis of characters and places in narrative text.

pdf bib
Dramatic Conversation Disentanglement
Kent Chang | Danica Chen | David Bamman
Findings of the Association for Computational Linguistics: ACL 2023

We present a new dataset for studying conversation disentanglement in movies and TV series. While previous work has focused on conversation disentanglement in IRC chatroom dialogues, movies and TV shows provide a space for studying complex pragmatic patterns of floor and topic change in face-to-face multi-party interactions. In this work, we draw on theoretical research in sociolinguistics, sociology, and film studies to operationalize a conversational thread (including the notion of a floor change) in dramatic texts, and use that definition to annotate a dataset of 10,033 dialogue turns (comprising 2,209 threads) from 831 movies. We compare the performance of several disentanglement models on this dramatic dataset, and apply the best-performing model to disentangle 808 movies. We see that, contrary to expectation, average thread lengths do not decrease significantly over the past 40 years, and characters portrayed by actors who are women, while underrepresented, initiate more new conversational threads relative to their speaking time.

pdf bib
Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications
Li Lucy | Jesse Dodge | David Bamman | Katherine Keith
Findings of the Association for Computational Linguistics: ACL 2023

Scholarly text is often laden with jargon, or specialized language that can facilitate efficient in-group communication within fields but hinder understanding for out-groups. In this work, we develop and validate an interpretable approach for measuring scholarly jargon from text. Expanding the scope of prior work which focuses on word types, we use word sense induction to also identify words that are widespread but overloaded with different meanings across fields. We then estimate the prevalence of these discipline-specific words and senses across hundreds of subfields, and show that word senses provide a complementary, yet unique view of jargon alongside word types. We demonstrate the utility of our metrics for science of science and computational sociolinguistics by highlighting two key social implications. First, though most fields reduce their use of jargon when writing for general-purpose venues, and some fields (e.g., biological sciences) do so less than others. Second, the direction of correlation between jargon and citation rates varies among fields, but jargon is nearly always negatively correlated with interdisciplinary impact. Broadly, our findings suggest that though multidisciplinary venues intend to cater to more general audiences, some fields’ writing norms may act as barriers rather than bridges, and thus impede the dispersion of scholarly ideas.

pdf bib
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Kent Chang | Mackenzie Cramer | Sandeep Soni | David Bamman
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

2022

pdf bib
Discovering Differences in the Representation of People using Contextualized Semantic Axes
Li Lucy | Divya Tadimeti | David Bamman
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

A common paradigm for identifying semantic differences across social and temporal contexts is the use of static word embeddings and their distances. In particular, past work has compared embeddings against “semantic axes” that represent two opposing concepts. We extend this paradigm to BERT embeddings, and construct contextualized axes that mitigate the pitfall where antonyms have neighboring representations. We validate and demonstrate these axes on two people-centric datasets: occupations from Wikipedia, and multi-platform discussions in extremist, men’s communities over fourteen years. In both studies, contextualized semantic axes can characterize differences among instances of the same word type. In the latter study, we show that references to women and the contexts around them have become more detestable over time.

pdf bib
Predicting Long-Term Citations from Short-Term Linguistic Influence
Sandeep Soni | David Bamman | Jacob Eisenstein
Findings of the Association for Computational Linguistics: EMNLP 2022

A standard measure of the influence of a research paper is the number of times it is cited. However, papers may be cited for many reasons, and citation count is not informative about the extent to which a paper affected the content of subsequent publications. We therefore propose a novel method to quantify linguistic influence in timestamped document collections. There are two main steps: first, identify lexical and semantic changes using contextual embeddings and word frequencies; second, aggregate information about these changes into per-document influence parameters by estimating a high-dimensional Hawkes process with a low-rank parameter matrix. The resulting measures of linguistic influence are predictive of future citations. Specifically, the estimate of linguistic influence from the two years after a paper’s publication is correlated with and predictive of its citation count in the following three years. This is demonstrated using an online evaluation with incremental temporal training/test splits, in comparison with a strong baseline that includes predictors for initial citation counts, topics, and lexical features.

pdf bib
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)
David Bamman | Dirk Hovy | David Jurgens | Katherine Keith | Brendan O'Connor | Svitlana Volkova
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

2021

pdf bib
Characterizing English Variation across Social Media Communities with BERT
Li Lucy | David Bamman
Transactions of the Association for Computational Linguistics, Volume 9

Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community’s unique word types, is used to identify cases where a social group’s language deviates from the norm. We validate our metrics using user-created glossaries and draw on sociolinguistic theories to connect language variation with trends in community behavior. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.

pdf bib
Gender and Representation Bias in GPT-3 Generated Stories
Li Lucy | David Bamman
Proceedings of the Third Workshop on Narrative Understanding

Using topic modeling and lexicon-based word similarity, we find that stories generated by GPT-3 exhibit many known gender stereotypes. Generated stories depict different topics and descriptions depending on GPT-3’s perceived gender of the character in a prompt, with feminine characters more likely to be associated with family and appearance, and described as less powerful than masculine characters, even when associated with high power verbs in a prompt. Our study raises questions on how one can avoid unintended social biases when using large language models for storytelling.

pdf bib
Narrative Theory for Computational Narrative Understanding
Andrew Piper | Richard Jean So | David Bamman
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Over the past decade, the field of natural language processing has developed a wide array of computational methods for reasoning about narrative, including summarization, commonsense inference, and event detection. While this work has brought an important empirical lens for examining narrative, it is by and large divorced from the large body of theoretical work on narrative within the humanities, social and cognitive sciences. In this position paper, we introduce the dominant theoretical frameworks to the NLP community, situate current research in NLP within distinct narratological traditions, and argue that linking computational work in NLP to theory opens up a range of new empirical questions that would both help advance our understanding of narrative and open up new practical applications.

2020

pdf bib
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science
David Bamman | Dirk Hovy | David Jurgens | Brendan O'Connor | Svitlana Volkova
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

pdf bib
An Annotated Dataset of Coreference in English Literature
David Bamman | Olivia Lewke | Anya Mansoor
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction published between 1719 and 1922. This dataset differs from previous coreference corpora in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature. This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.

pdf bib
Attending to Long-Distance Document Context for Sequence Labeling
Matthew Jörke | Jon Gillick | Matthew Sims | David Bamman
Findings of the Association for Computational Linguistics: EMNLP 2020

We present in this work a method for incorporating global context in long documents when making local decisions in sequence labeling problems like NER. Inspired by work in featurized log-linear models (Chieu and Ng, 2002; Sutton and McCallum, 2004), our model learns to attend to multiple mentions of the same word type in generating a representation for each token in context, extending that work to learning representations that can be incorporated into modern neural models. Attending to broader context at test time provides complementary information to pretraining (Gururangan et al., 2020), yields strong gains over equivalently parameterized models lacking such context, and performs best at recognizing entities with high TF-IDF scores (i.e., those that are important within a document).

pdf bib
Measuring Information Propagation in Literary Social Networks
Matthew Sims | David Bamman
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present the task of modeling information propagation in literature, in which we seek to identify pieces of information passing from character A to character B to character C, only given a description of their activity in text. We describe a new pipeline for measuring information propagation in this domain and publish a new dataset for speaker attribution, enabling the evaluation of an important component of this pipeline on a wider range of literary texts than previously studied. Using this pipeline, we analyze the dynamics of information propagation in over 5,000 works of fiction, finding that information flows through characters that fill structural holes connecting different communities, and that characters who are women are depicted as filling this role much more frequently than characters who are men.

2019

pdf bib
Literary Event Detection
Matthew Sims | Jong Ho Park | David Bamman
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this work we present a new dataset of literary events—events that are depicted as taking place within the imagined space of a novel. While previous work has focused on event detection in the domain of contemporary news, literature poses a number of complications for existing systems, including complex narration, the depiction of a broad array of mental states, and a strong emphasis on figurative language. We outline the annotation decisions of this new dataset and compare several models for predicting events; the best performing model, a bidirectional LSTM with BERT token representations, achieves an F1 score of 73.9. We then apply this model to a corpus of novels split across two dimensions—prestige and popularity—and demonstrate that there are statistically significant differences in the distribution of events for prestige.

pdf bib
An annotated dataset of literary entities
David Bamman | Sejal Popat | Sheng Shen
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a new dataset comprised of 210,532 tokens evenly drawn from 100 different English-language literary texts annotated for ACE entity categories (person, location, geo-political entity, facility, organization, and vehicle). These categories include non-named entities (such as “the boy”, “the kitchen”) and nested structure (such as [[the cook]’s sister]). In contrast to existing datasets built primarily on news (focused on geo-political entities and organizations), literary texts offer strikingly different distributions of entity categories, with much stronger emphasis on people and description of settings. We present empirical results demonstrating the performance of nested entity recognition models in this domain; training natively on in-domain literary data yields an improvement of over 20 absolute points in F-score (from 45.7 to 68.3), and mitigates a disparate impact in performance for male and female entities present in models trained on news data.

pdf bib
Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science
Svitlana Volkova | David Jurgens | Dirk Hovy | David Bamman | Oren Tsur
Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science

pdf bib
Proceedings of the First Workshop on Narrative Understanding
David Bamman | Snigdha Chaturvedi | Elizabeth Clark | Madalina Fiterau | Mohit Iyyer
Proceedings of the First Workshop on Narrative Understanding

2018

pdf bib
Please Clap: Modeling Applause in Campaign Speeches
Jon Gillick | David Bamman
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

This work examines the rhetorical techniques that speakers employ during political campaigns. We introduce a new corpus of speeches from campaign events in the months leading up to the 2016 U.S. presidential election and develop new models for predicting moments of audience applause. In contrast to existing datasets, we tackle the challenge of working with transcripts that derive from uncorrected closed captioning, using associated audio recordings to automatically extract and align labels for instances of audience applause. In prediction experiments, we find that lexical features carry the most information, but that a variety of features are predictive, including prosody, long-term contextual dependencies, and theoretically motivated features designed to capture rhetorical techniques.

pdf bib
Telling Stories with Soundtracks: An Empirical Analysis of Music in Film
Jon Gillick | David Bamman
Proceedings of the First Workshop on Storytelling

Soundtracks play an important role in carrying the story of a film. In this work, we collect a corpus of movies and television shows matched with subtitles and soundtracks and analyze the relationship between story, song, and audience reception. We look at the content of a film through the lens of its latent topics and at the content of a song through descriptors of its musical attributes. In two experiments, we find first that individual topics are strongly associated with musical attributes, and second, that musical attributes of soundtracks are predictive of film ratings, even after controlling for topic and genre.

2017

pdf bib
The Labeled Segmentation of Printed Books
Lara McConnaughey | Jennifer Dai | David Bamman
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We introduce the task of book structure labeling: segmenting and assigning a fixed category (such as Table of Contents, Preface, Index) to the document structure of printed books. We manually annotate the page-level structural categories for a large dataset totaling 294,816 pages in 1,055 books evenly sampled from 1750-1922, and present empirical results comparing the performance of several classes of models. The best-performing model, a bidirectional LSTM with rich features, achieves an overall accuracy of 95.8 and a class-balanced macro F-score of 71.4.

pdf bib
Adversarial Training for Relation Extraction
Yi Wu | David Bamman | Stuart Russell
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Adversarial training is a mean of regularizing classification algorithms by generating adversarial noise to the training data. We apply adversarial training in relation extraction within the multi-instance multi-label learning framework. We evaluate various neural network architectures on two different datasets. Experimental results demonstrate that adversarial training is generally effective for both CNN and RNN models and significantly improves the precision of predicted relations.

pdf bib
Proceedings of the Second Workshop on NLP and Computational Social Science
Dirk Hovy | Svitlana Volkova | David Bamman | David Jurgens | Brendan O’Connor | Oren Tsur | A. Seza Doğruöz
Proceedings of the Second Workshop on NLP and Computational Social Science

2016

pdf bib
Beyond Canonical Texts: A Computational Analysis of Fanfiction
Smitha Milli | David Bamman
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the First Workshop on NLP and Computational Social Science
David Bamman | A. Seza Doğruöz | Jacob Eisenstein | Dirk Hovy | David Jurgens | Brendan O’Connor | Alice Oh | Oren Tsur | Svitlana Volkova
Proceedings of the First Workshop on NLP and Computational Social Science

2015

pdf bib
Open Extraction of Fine-Grained Political Statements
David Bamman | Noah A. Smith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
CMU: Arc-Factored, Discriminative Semantic Dependency Parsing
Sam Thomson | Brendan O’Connor | Jeffrey Flanigan | David Bamman | Jesse Dodge | Swabha Swayamdipta | Nathan Schneider | Chris Dyer | Noah A. Smith
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Unsupervised Discovery of Biographical Structure from Text
David Bamman | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 2

We present a method for discovering abstract event classes in biographies, based on a probabilistic latent-variable model. Taking as input timestamped text, we exploit latent correlations among events to learn a set of event classes (such as Born, Graduates High School, and Becomes Citizen), along with the typical times in a person’s life when those events occur. In a quantitative evaluation at the task of predicting a person’s age for a given event, we find that our generative model outperforms a strong linear regression baseline, along with simpler variants of the model that ablate some features. The abstract event classes that we learn allow us to perform a large-scale analysis of 242,970 Wikipedia biographies. Though it is known that women are greatly underrepresented on Wikipedia—not only as editors (Wikipedia, 2011) but also as subjects of articles (Reagle and Rhue, 2011)—we find that there is a bias in their characterization as well, with biographies of women containing significantly more emphasis on events of marriage and divorce than biographies of men.

pdf bib
A Bayesian Mixed Effects Model of Literary Character
David Bamman | Ted Underwood | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Distributed Representations of Geographically Situated Language
David Bamman | Chris Dyer | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Learning Latent Personas of Film Characters
David Bamman | Brendan O’Connor | Noah A. Smith
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Framework for (Under)specifying Dependency Syntax without Overloading Annotators
Nathan Schneider | Brendan O’Connor | Naomi Saphra | David Bamman | Manaal Faruqui | Noah A. Smith | Chris Dyer | Jason Baldridge
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2008

pdf bib
The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin
David Bamman | Marco Passarotti | Roberto Busa | Gregory Crane
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of constituents is broken up with elements of other constituents. The two projects are the first of their kind for Latin, so no prior established guidelines for syntactic annotation are available to rely on. The general model for the adopted style of representation is that used by the Prague Dependency Treebank, with departures arising from the Latin grammar of Pinkster, specifically in the traditional grammatical categories of the ablative absolute, the accusative + infinitive, and gerunds/gerundives. Sharing common annotation guidelines allows us to compare the datasets of the two treebanks for tasks such as mutually checking annotation consistency, diachronically studying specific syntactic constructions, and training statistical dependency parsers.

2007

pdf bib
The Latin Dependency Treebank in a Cultural Heritage Digital Library
David Bamman | Gregory Crane
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).