2024
pdf
bib
abs
Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis
Hale Sirin
|
Sabrina Li
|
Thomas Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
In this study, we present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish. We introduce the task of detecting distinct patterns of multilinguality based on the frequency of structured language alternations within a document.
pdf
bib
abs
Dynamic embedded topic models and change-point detection for exploring literary-historical hypotheses
Hale Sirin
|
Thomas Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
We present a novel combination of dynamic embedded topic models and change-point detection to explore diachronic change of lexical semantic modality in classical and early Christian Latin. We demonstrate several methods for finding and characterizing patterns in the output, and relating them to traditional scholarship in Comparative Literature and Classics. This simple approach to unsupervised models of semantic change can be applied to any suitable corpus, and we conclude with future directions and refinements aiming to allow noisier, less-curated materials to meet that threshold.
pdf
bib
abs
Pairing Orthographically Variant Literary Words to Standard Equivalents Using Neural Edit Distance Models
Craig Messner
|
Thomas Lippincott
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding “standard” word pair. We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners. Finally, we analyze the relative performance of these models in the light of different negative training sample generation strategies, and offer concluding remarks on the unique challenge literary orthographic variation poses to string pairing methodologies.
pdf
bib
abs
Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus
Craig Messner
|
Thomas Lippincott
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.
2021
pdf
bib
abs
Active learning and negative evidence for language identification
Thomas Lippincott
|
Ben Van Durme
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Language identification (LID), the task of determining the natural language of a given text, is an essential first step in most NLP pipelines. While generally a solved problem for documents of sufficient length and languages with ample training data, the proliferation of microblogs and other social media has made it increasingly common to encounter use-cases that *don’t* satisfy these conditions. In these situations, the fundamental difficulty is the lack of, and cost of gathering, labeled data: unlike some annotation tasks, no single “expert” can quickly and reliably identify more than a handful of languages. This leads to a natural question: can we gain useful information when annotators are only able to *rule out* languages for a given document, rather than supply a positive label? What are the optimal choices for gathering and representing such *negative evidence* as a model is trained? In this paper, we demonstrate that using negative evidence can improve the performance of a simple neural LID model. This improvement is sensitive to policies of how the evidence is represented in the loss function, and for deciding which annotators to employ given the instance and model state. We consider simple policies and report experimental results that indicate the optimal choices for this task. We conclude with a discussion of future work to determine if and how the results generalize to other classification tasks.
2014
pdf
bib
Unsupervised Morphology-Based Vocabulary Expansion
Mohammad Sadegh Rasooli
|
Thomas Lippincott
|
Nizar Habash
|
Owen Rambow
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2012
pdf
bib
Learning Syntactic Verb Frames using Graphical Models
Thomas Lippincott
|
Anna Korhonen
|
Diarmuid Ó Séaghdha
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)