2023
pdf
bib
abs
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information
Jaehyung Kim
|
Yekyung Kim
|
Karin de Langis
|
Jinwoo Shin
|
Dongyeop Kang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The success of NLP systems often relies on the availability of large, high-quality datasets. However, not all samples in these datasets are equally valuable for learning, as some may be redundant or noisy. Several methods for characterizing datasets based on model-driven meta-information (e.g., model’s confidence) have been developed, but the relationship and complementary effects of these methods have received less attention. In this paper, we introduce infoVerse, a universal framework for dataset characterization, which provides a new feature space that effectively captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. infoVerse reveals distinctive regions of the dataset that are not apparent in the original semantic space, hence guiding users (or models) in identifying which samples to focus on for exploration, assessment, or annotation. Additionally, we propose a novel sampling method on infoVerse to select a set of data points that maximizes informativeness. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines in all applications. Our code and demo are publicly available.
pdf
bib
abs
A Comparative Study on Textual Saliency of Styles from Eye Tracking, Annotations, and Language Models
Karin de Langis
|
Dongyeop Kang
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
There is growing interest in incorporating eye-tracking data and other implicit measures of human language processing into natural language processing (NLP) pipelines. The data from human language processing contain unique insight into human linguistic understanding that could be exploited by language models. However, many unanswered questions remain about the nature of this data and how it can best be utilized in downstream NLP tasks. In this paper, we present EyeStyliency, an eye-tracking dataset for human processing of stylistic text (e.g., politeness). We develop an experimental protocol to collect these style-specific eye movements. We further investigate how this saliency data compares to both human annotation methods and model-based interpretability metrics. We find that while eye-tracking data is unique, it also intersects with both human annotations and model-based importance scores, providing a possible bridge between human- and machine-based perspectives. We propose utilizing this type of data to evaluate the cognitive plausibility of models that interpret style. Our eye-tracking data and processing code are publicly available.
pdf
bib
abs
An Analysis of Reader Engagement in Literary Fiction through Eye Tracking and Linguistic Features
Rose Neis
|
Karin De Langis
|
Zae Myung Kim
|
Dongyeop Kang
Proceedings of the 5th Workshop on Narrative Understanding
Capturing readers’ engagement in fiction is a challenging but important aspect of narrative understanding. In this study, we collected 23 readers’ reactions to 2 short stories through eye tracking, sentence-level annotations, and an overall engagement scale survey. We analyzed the significance of various qualities of the text in predicting how engaging a reader is likely to find it. As enjoyment of fiction is highly contextual, we also investigated individual differences in our data. Furthering our understanding of what captivates readers in fiction will help better inform models used in creative narrative generation and collaborative writing tools.