David A. Smith

Also published as: David Addison Smith, David Smith


2023

pdf bib
Detecting Syntactic Change with Pre-trained Transformer Models
Liwen Hou | David Smith
Findings of the Association for Computational Linguistics: EMNLP 2023

We investigate the ability of Transformer-based language models to find syntactic differences between the English of the early 1800s and that of the late 1900s. First, we show that a fine-tuned BERT model can distinguish between text from these two periods using syntactic information only; to show this, we employ a strategy to hide semantic information from the text. Second, we make further use of fine-tuned BERT models to identify specific instances of syntactic change and specific words for which a new part of speech was introduced. To do this, we employ an automatic part-of-speech (POS) tagger and use it to train corpora-specific taggers based only on BERT representations pretrained on different corpora. Notably, our methods of identifying specific candidates for syntactic change avoid using any automatic POS tagger on old text, where its performance may be unreliable; instead, our methods only use untagged old text together with tagged modern text. We examine samples and distributional properties of the model output to validate automatically identified cases of syntactic change. Finally, we use our techniques to confirm the historical rise of the progressive construction, a known example of syntactic change.

pdf bib
Composition and Deformance: Measuring Imageability with a Text-to-Image Model
Si Wu | David Smith
Proceedings of the 5th Workshop on Narrative Understanding

Although psycholinguists and psychologists have long studied the tendency of linguistic strings to evoke mental images in hearers or readers, most computational studies have applied this concept of imageability only to isolated words. Using recent developments in text-to-image generation models, such as DALLE mini, we propose computational methods that use generated images to measure the imageability of both single English words and connected text. We sample text prompts for image generation from three corpora: human-generated image captions, news article sentences, and poem lines. We subject these prompts to different deformances to examine the model’s ability to detect changes in imageability caused by compositional change. We find high correlation between the proposed computational measures of imageability and human judgments of individual words. We also find the proposed measures more consistently respond to changes in compositionality than baseline approaches. We discuss possible effects of model training and implications for the study of compositionality in text-to-image models.

2021

pdf bib
Content-based Models of Quotation
Ansel MacLaughlin | David Smith
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We explore the task of quotability identification, in which, given a document, we aim to identify which of its passages are the most quotable, i.e. the most likely to be directly quoted by later derived documents. We approach quotability identification as a passage ranking problem and evaluate how well both feature-based and BERT-based (Devlin et al., 2019) models rank the passages in a given document by their predicted quotability. We explore this problem through evaluations on five datasets that span multiple languages (English, Latin) and genres of literature (e.g. poetry, plays, novels) and whose corresponding derived documents are of multiple types (news, journal articles). Our experiments confirm the relatively strong performance of BERT-based models on this task, with the best model, a RoBERTA sequential sentence tagger, achieving an average rho of 0.35 and NDCG@1, 5, 50 of 0.26, 0.31 and 0.40, respectively, across all five datasets.

pdf bib
Structural Encoding and Pre-training Matter: Adapting BERT for Table-Based Fact Verification
Rui Dong | David Smith
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Growing concern with online misinformation has encouraged NLP research on fact verification. Since writers often base their assertions on structured data, we focus here on verifying textual statements given evidence in tables. Starting from the Table Parsing (TAPAS) model developed for question answering (Herzig et al., 2020), we find that modeling table structure improves a language model pre-trained on unstructured text. Pre-training language models on English Wikipedia table data further improves performance. Pre-training on a question answering task with column-level cell rank information achieves the best performance. With improved pre-training and cell embeddings, this approach outperforms the state-of-the-art Numerically-aware Graph Neural Network table fact verification model (GNN-TabFact), increasing statement classification accuracy from 72.2% to 73.9% even without modeling numerical information. Incorporating numerical information with cell rankings and pre-training on a question-answering task increases accuracy to 76%. We further analyze accuracy on statements implicating single rows or multiple rows and columns of tables, on different numerical reasoning subtasks, and on generalizing to detecting errors in statements derived from the ToTTo table-to-text generation dataset.

pdf bib
Drivers of English Syntactic Change in the Canadian Parliament
Liwen Hou | David A. Smith
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
Emerging English Transitives over the Last Two Centuries
Liwen Hou | David A. Smith
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
Recovering Lexically and Semantically Reused Texts
Ansel MacLaughlin | Shaobin Xu | David A. Smith
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Writers often repurpose material from existing texts when composing new documents. Because most documents have more than one source, we cannot trace these connections using only models of document-level similarity. Instead, this paper considers methods for local text reuse detection (LTRD), detecting localized regions of lexically or semantically similar text embedded in otherwise unrelated material. In extensive experiments, we study the relative performance of four classes of neural and bag-of-words models on three LTRD tasks – detecting plagiarism, modeling journalists’ use of press releases, and identifying scientists’ citation of earlier papers. We conduct evaluations on three existing datasets and a new, publicly-available citation localization dataset. Our findings shed light on a number of previously-unexplored questions in the study of LTRD, including the importance of incorporating document-level context for predictions, the applicability of of-the-shelf neural models pretrained on “general” semantic textual similarity tasks such as paraphrase detection, and the trade-offs between more efficient bag-of-words and feature-based neural models and slower pairwise neural models.

2020

pdf bib
Finite State Machine Pattern-Root Arabic Morphological Generator, Analyzer and Diacritizer
Maha Alkhairy | Afshan Jafri | David Smith
Proceedings of the Twelfth Language Resources and Evaluation Conference

We describe and evaluate the Finite-State Arabic Morphologizer (FSAM) – a concatenative (prefix-stem-suffix) and templatic (root- pattern) morphologizer that generates and analyzes undiacritized Modern Standard Arabic (MSA) words, and diacritizes them. Our bidirectional unified-architecture finite state machine (FSM) is based on morphotactic MSA grammatical rules. The FSM models the root-pattern structure related to semantics and syntax, making it readily scalable unlike stem-tabulations in prevailing systems. We evaluate the coverage and accuracy of our model, with coverage being percentage of words in Tashkeela (a large corpus) that can be analyzed. Accuracy is computed against a gold standard, comprising words and properties, created from the intersection of UD PADT treebank and Tashkeela. Coverage of analysis (extraction of root and properties from word) is 82%. Accuracy results are: root computed from a word (92%), word generation from a root (100%), non-root properties of a word (97%), and diacritization (84%). FSAM’s non-root results match or surpass MADAMIRA’s, and root result comparisons are not made because of the concatenative nature of publicly available morphologizers.

pdf bib
Detecting de minimis Code-Switching in Historical German Books
Shijia Liu | David Smith
Proceedings of the 28th International Conference on Computational Linguistics

Code-switching has long interested linguists, with computational work in particular focusing on speech and social media data (Sitaram et al., 2019). This paper contrasts these informal instances of code-switching to its appearance in more formal registers, by examining the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries. We automatically annotate and manually inspect spans of six embedded languages (Latin, French, English, Italian, Spanish, and Greek) in the corpus. We quantitatively analyze the differences between code-switching patterns in these books and those in more typically studied speech and social media corpora. Furthermore, we address the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Such classifiers can help reduce errors when optical character recognition or speech transcription is applied to a large corpus with rare embedded languages.

pdf bib
Tracing Traditions: Automatic Extraction of Isnads from Classical Arabic Texts
Ryan Muther | David Smith
Proceedings of the Fifth Arabic Natural Language Processing Workshop

We present our work on automatically detecting isnads, the chains of authorities for a re-port that serve as citations in hadith and other classical Arabic texts. We experiment with both sequence labeling methods for identifying isnads in a single pass and a hybrid “retrieve-and-tag” approach, in which a retrieval model first identifies portions of the text that are likely to contain start points for isnads, then a sequence labeling model identifies the exact starting locations within these much smaller retrieved text chunks. We find that the usefulness of full-document sequence to sequence models is limited due to memory limitations and the ineffectiveness of such models at modeling very long documents. We conclude by sketching future improvements on the tagging task and more in-depth analysis of the people and relationships involved in the social network that influenced the evolution of the written tradition over time.

2019

pdf bib
Noisy Neural Language Modeling for Typing Prediction in BCI Communication
Rui Dong | David Smith | Shiran Dudy | Steven Bedrick
Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies

Language models have broad adoption in predictive typing tasks. When the typing history contains numerous errors, as in open-vocabulary predictive typing with brain-computer interface (BCI) systems, we observe significant performance degradation in both n-gram and recurrent neural network language models trained on clean text. In evaluations of ranking character predictions, training recurrent LMs on noisy text makes them much more robust to noisy histories, even when the error model is misspecified. We also propose an effective strategy for combining evidence from multiple ambiguous histories of BCI electroencephalogram measurements.

2018

pdf bib
Modeling the Decline in English Passivization
Liwen Hou | David Smith
Proceedings of the Society for Computation in Linguistics (SCiL) 2018

pdf bib
A Multi-Context Character Prediction Model for a Brain-Computer Interface
Shiran Dudy | Shaobin Xu | Steven Bedrick | David Smith
Proceedings of the Second Workshop on Subword/Character LEvel Models

Brain-computer interfaces and other augmentative and alternative communication devices introduce language-modeing challenges distinct from other character-entry methods. In particular, the acquired signal of the EEG (electroencephalogram) signal is noisier, which, in turn, makes the user intent harder to decipher. In order to adapt to this condition, we propose to maintain ambiguous history for every time step, and to employ, apart from the character language model, word information to produce a more robust prediction system. We present preliminary results that compare this proposed Online-Context Language Model (OCLM) to current algorithms that are used in this type of setting. Evaluation on both perplexity and predictive accuracy demonstrates promising results when dealing with ambiguous histories in order to provide to the front end a distribution of the next character the user might type.

pdf bib
Multi-Input Attention for Unsupervised OCR Correction
Rui Dong | David Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noisily observed textual variants or bootstrapping from a uniform error model. On two corpora of historical newspapers and books, we show that these unsupervised techniques cut the character and word error rates nearly in half on single inputs and, with the addition of multi-input decoding, can rival supervised methods.

2017

pdf bib
Can You See the (Linguistic) Difference? Exploring Mass/Count Distinction in Vision
David Addison Smith | Sandro Pezzelle | Francesca Franzon | Chiara Zanini | Raffaella Bernardi
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

2016

pdf bib
Online Multilingual Topic Models with Multi-Level Hyperpriors
Kriste Krstovski | David Smith | Michael J. Kurtz
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora
Kriste Krstovski | David Smith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Detecting and Evaluating Local Text Reuse in Social Networks
Shaobin Xu | David Smith | Abigail Mullen | Ryan Cordell
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media

2013

pdf bib
Online Polylingual Topic Models for Fast Document Translation Detection
Kriste Krstovski | David A. Smith
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

pdf bib
Grammarless Parsing for Joint Inference
Jason Naradowsky | Tim Vieira | David Smith
Proceedings of COLING 2012

pdf bib
A Dictionary of Wisdom and Wit: Learning to Extract Quotable Phrases
Michael Bendersky | David Smith
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature

pdf bib
Discovering Factions in the Computational Linguistics Community
Yanchuan Sim | Noah A. Smith | David A. Smith
Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

pdf bib
Parse, Price and Cut—Delayed Column and Row Generation for Graph Based Parsers
Sebastian Riedel | David Smith | Andrew McCallum
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Improving NLP through Marginalization of Hidden Syntactic Structure
Jason Naradowsky | Sebastian Riedel | David Smith
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Joint Annotation of Search Queries
Michael Bendersky | W. Bruce Croft | David A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
John Lee | Jason Naradowsky | David A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs
Kriste Krstovski | David A. Smith
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
Relaxed Marginal Inference and its Application to Dependency Parsing
Sebastian Riedel | David A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

pdf bib
Parser Adaptation and Projection with Quasi-Synchronous Grammar Features
David A. Smith | Jason Eisner
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Polylingual Topic Models
David Mimno | Hanna M. Wallach | Jason Naradowsky | David A. Smith | Andrew McCallum
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
Dependency Parsing by Belief Propagation
David Smith | Jason Eisner
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
HotSpots: Visualizing Edits to a Text
Srinivas Bangalore | David Smith
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Probabilistic Models of Nonprojective Dependency Trees
David A. Smith | Noah A. Smith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors
David A. Smith | Jason Eisner
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Log-Linear Models of Non-Projective Trees, k-best MST Parsing and Tree-Ranking
Keith Hall | Jiří Havelka | David A. Smith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Minimum Risk Annealing for Training Log-Linear Models
David A. Smith | Jason Eisner
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Vine Parsing and Minimum Risk Reranking for Speed and Precision
Markus Dreyer | David A. Smith | Noah A. Smith
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

pdf bib
Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies
David Smith | Jason Eisner
Proceedings on the Workshop on Statistical Machine Translation

pdf bib
An Overview of Statistical Machine Translation
David Smith | Charles Schafer
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Tutorials

2005

pdf bib
Context-Based Morphological Disambiguation with Random Fields
Noah A. Smith | David A. Smith | Roy W. Tromble
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
A Smorgasbord of Features for Statistical Machine Translation
Franz Josef Och | Daniel Gildea | Sanjeev Khudanpur | Anoop Sarkar | Kenji Yamada | Alex Fraser | Shankar Kumar | Libin Shen | David Smith | Katherine Eng | Viren Jain | Zhen Jin | Dragomir Radev
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

pdf bib
Bilingual Parsing with Factored Estimation: Using English to Parse Korean
David A. Smith | Noah A. Smith
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

pdf bib
Bootstrapping toponym classifiers
David A. Smith | Gideon S. Mann
Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References

1986

pdf bib
Translation practice in Europe
David Smith
Proceedings of Translating and the Computer 8: A profession on the move