We present a novel tool designed for teaching and interfacing the information-theoretic modeling abilities of large language models. The Surprisal Toolkit allows students from diverse linguistic and programming backgrounds to learn about measures of information theory and natural language processing (NLP) through an online interactive tool. In addition, the interface provides a valuable research mechanism for obtaining measures of surprisal. We implement the toolkit as part of a classroom tutorial in three different learning scenarios and discuss the overall receptive student feedback. We suggest this toolkit and similar applications as resourceful supplements to instruction in NLP topics, especially for the purpose of balancing conceptual understanding with technical instruction, grounding abstract topics, and engaging students with varying coding abilities.
This paper introduces the corpus for the novel task of presupposition generation - a natural language generation problem where a model produces a list of presuppositions carried by the given input sentence, in the context of the presented research - given the cross-examination question. Two datasets, PECaN (Presupposition, Entailment, Contradiction and Neutral) and PGen (Presuppostion Generation), are designed to fine-tune existing BERT (CITATION) and T5 (CITATION) models for classification and generation tasks. Various corpora construction methods are proposed ranging from manual annotations, prompting the GPT 3.0 model, to augmenting data from the existing corpora. The fine-tuned models achieved high accuracy on the novel Presupposition as Natural Language Inference (PNLI) task which extends the traditional Natural Language Inference (NLI) incorporating instances of presupposition into classification. T5 outperforms BERT by broad margin achieving an overall accuracy of 84.35% compared to 71.85% of BERT, and specifically when classifying presuppositions (93% vs 73% respectively). Regarding presupposition generation, we observed that despite the limited amount of data used for fine-tuning, the model displays an emerging proficiency in generation presuppositions reaching ROUGE scores of 43.47, adhering to systematic patterns that mirror valid strategies for presupposition generation, although failed to generate the complete lists.
We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.