David A. Smith

Also published as: David Smith, David Addison Smith

2025

Through the Lens of History: Methods for Analyzing Temporal Variation in Content and Framing of State-run Chinese Newspapers
Shijia Liu | David A. Smith
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

State-run Chinese newspapers are believed to strategically select and frame news articles to align with the shifting political tides of the country. This paper describes methods to quantify these changes in content and framing over time. Looking at more than 50 years of articles from the People’s Daily and Reference News, we analyze differences in name mentions and sentiment in news articles for politicians before and after their deaths, as well as during and not during certain political events. We find significant estimates of difference, reflecting the changes in various aspects of the political environment in China during different time periods. We also apply change point detection methods to identify turning points in time series data of name mentions and sentiment. The identified turning points show a high co-occurrence with crucial political events and deaths of politicians. Furthermore, we utilize topic modeling to analyze the framing choices for articles written in different decades. The changes in frequent topic words are more significant in People’s Daily than in Reference News, which is consistent with the focus shifts of the Chinese central government in history. Finally, by using pre-trained language models to predict masked names in news articles, we analyze the distinctiveness of the language used to report individuals.

pdf bib abs

Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
Jaydeep Borkar | Matthew Jagielski | Katherine Lee | Niloofar Mireshghallah | David A. Smith | Christopher A. Choquette-Choo
Findings of the Association for Computational Linguistics: ACL 2025

Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training. Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage. We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices. We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII; and (3) removing PII can lead to other PII being memorized.

2023

pdf bib abs

Composition and Deformance: Measuring Imageability with a Text-to-Image Model
Si Wu | David Smith
Proceedings of the 5th Workshop on Narrative Understanding

Although psycholinguists and psychologists have long studied the tendency of linguistic strings to evoke mental images in hearers or readers, most computational studies have applied this concept of imageability only to isolated words. Using recent developments in text-to-image generation models, such as DALLE mini, we propose computational methods that use generated images to measure the imageability of both single English words and connected text. We sample text prompts for image generation from three corpora: human-generated image captions, news article sentences, and poem lines. We subject these prompts to different deformances to examine the model’s ability to detect changes in imageability caused by compositional change. We find high correlation between the proposed computational measures of imageability and human judgments of individual words. We also find the proposed measures more consistently respond to changes in compositionality than baseline approaches. We discuss possible effects of model training and implications for the study of compositionality in text-to-image models.

pdf bib abs

Detecting Syntactic Change with Pre-trained Transformer Models
Liwen Hou | David Smith
Findings of the Association for Computational Linguistics: EMNLP 2023

We investigate the ability of Transformer-based language models to find syntactic differences between the English of the early 1800s and that of the late 1900s. First, we show that a fine-tuned BERT model can distinguish between text from these two periods using syntactic information only; to show this, we employ a strategy to hide semantic information from the text. Second, we make further use of fine-tuned BERT models to identify specific instances of syntactic change and specific words for which a new part of speech was introduced. To do this, we employ an automatic part-of-speech (POS) tagger and use it to train corpora-specific taggers based only on BERT representations pretrained on different corpora. Notably, our methods of identifying specific candidates for syntactic change avoid using any automatic POS tagger on old text, where its performance may be unreliable; instead, our methods only use untagged old text together with tagged modern text. We examine samples and distributional properties of the model output to validate automatically identified cases of syntactic change. Finally, we use our techniques to confirm the historical rise of the progressive construction, a known example of syntactic change.

David A. Smith

2025

2023

2021

2020

2019

2018

2017

2016

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

1986

Co-authors

Venues