2024
pdf
bib
Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)
Eduard Dragut
|
Yunyao Li
|
Lucian Popa
|
Slobodan Vucetic
|
Shashank Srivastava
Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)
pdf
bib
abs
X-Shot: A Unified System to Handle Frequent, Few-shot and Zero-shot Learning Simultaneously in Classification
Hanzi Xu
|
Muhao Chen
|
Lifu Huang
|
Slobodan Vucetic
|
Wenpeng Yin
Findings of the Association for Computational Linguistics: ACL 2024
In recent years, few-shot and zero-shot learning, which learn to predict labels with limited annotated instances, have garnered significant attention. Traditional approaches often treat frequent-shot (freq-shot; labels with abundant instances), few-shot, and zero-shot learning as distinct challenges, optimizing systems for just one of these scenarios. Yet, in real-world settings, label occurrences vary greatly. Some of them might appear thousands of times, while others might only appear sporadically or not at all. For practical deployment, it is crucial that a system can adapt to any label occurrence. We introduce a novel classification challenge: **X-shot**, reflecting a real-world context where freq-shot, few-shot, and zero-shot labels co-occur without predefined limits. Here, **X** can span from 0 to positive infinity. The crux of **X-shot** centers on open-domain generalization and devising a system versatile enough to manage various label scenarios. To solve **X-shot**, we propose **BinBin** (**B**inary **IN**ference **B**ased on **IN**struction following) that leverages the Indirect Supervision from a large collection of NLP tasks via instruction following, bolstered by Weak Supervision provided by large language models. **BinBin** surpasses previous state-of-the-art techniques on three benchmark datasets across multiple domains. To our knowledge, this is the first work addressing **X-shot** learning, where **X** remains variable.
pdf
bib
abs
Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification
Ziyu Yang
|
Santhosh Cherian
|
Slobodan Vucetic
Findings of the Association for Computational Linguistics: ACL 2024
Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.
2023
pdf
bib
abs
Data Augmentation for Radiology Report Simplification
Ziyu Yang
|
Santhosh Cherian
|
Slobodan Vucetic
Findings of the Association for Computational Linguistics: EACL 2023
This work considers the development of a text simplification model to help patients better understand their radiology reports. This paper proposes a data augmentation approach to address the data scarcity issue caused by the high cost of manual simplification. It prompts a large foundational pre-trained language model to generate simplifications of unlabeled radiology sentences. In addition, it uses paraphrasing of labeled radiology sentences. Experimental results show that the proposed data augmentation approach enables the training of a significantly more accurate simplification model than the baselines.
2022
pdf
bib
abs
OpenStance: Real-world Zero-shot Stance Detection
Hanzi Xu
|
Slobodan Vucetic
|
Wenpeng Yin
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
Prior studies of zero-shot stance detection identify the attitude of texts towards unseen topics occurring in the same document corpus. Such task formulation has three limitations: (i) Single domain/dataset. A system is optimized on a particular dataset from a single domain; therefore, the resulting system cannot work well on other datasets; (ii) the model is evaluated on a limited number of unseen topics; (iii) it is assumed that part of the topics has rich annotations, which might be impossible in real-world applications. These drawbacks will lead to an impractical stance detection system that fails to generalize to open domains and open-form topics. This work defines OpenStance: open-domain zero-shot stance detection, aiming to handle stance detection in an open world with neither domain constraints nor topic-specific annotations. The key challenge of OpenStance lies in open-domain generalization: learning a system with fully unspecific supervision but capable of generalizing to any dataset. To solve OpenStance, we propose to combine indirect supervision, from textual entailment datasets, and weak supervision, from data generated automatically by pre-trained Language Models. Our single system, without any topic-specific supervision, outperforms the supervised method on three popular datasets. To our knowledge, this is the first work that studies stance detection under the open-domain zero-shot setting. All data and code will be publicly released.
pdf
bib
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)
Eduard Dragut
|
Yunyao Li
|
Lucian Popa
|
Slobodan Vucetic
|
Shashank Srivastava
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)
2021
pdf
bib
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Eduard Dragut
|
Yunyao Li
|
Lucian Popa
|
Slobodan Vucetic
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
pdf
bib
abs
A Visualization Approach for Rapid Labeling of Clinical Notes for Smoking Status Extraction
Saman Enayati
|
Ziyu Yang
|
Benjamin Lu
|
Slobodan Vucetic
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Labeling is typically the most human-intensive step during the development of supervised learning models. In this paper, we propose a simple and easy-to-implement visualization approach that reduces cognitive load and increases the speed of text labeling. The approach is fine-tuned for task of extraction of patient smoking status from clinical notes. The proposed approach consists of the ordering of sentences that mention smoking, centering them at smoking tokens, and annotating to enhance informative parts of the text. Our experiments on clinical notes from the MIMIC-III clinical database demonstrate that our visualization approach enables human annotators to label sentences up to 3 times faster than with a baseline approach.
2020
pdf
bib
abs
Improving Word Embeddings through Iterative Refinement of Word- and Character-level Models
Phong Ha
|
Shanshan Zhang
|
Nemanja Djuric
|
Slobodan Vucetic
Proceedings of the 28th International Conference on Computational Linguistics
Embedding of rare and out-of-vocabulary (OOV) words is an important open NLP problem. A popular solution is to train a character-level neural network to reproduce the embeddings from a standard word embedding model. The trained network is then used to assign vectors to any input string, including OOV and rare words. We enhance this approach and introduce an algorithm that iteratively refines and improves both word- and character-level models. We demonstrate that our method outperforms the existing algorithms on 5 word similarity data sets, and that it can be successfully applied to job title normalization, an important problem in the e-recruitment domain that suffers from the OOV problem.
2019
pdf
bib
abs
Spatial Aggregation Facilitates Discovery of Spatial Topics
Aniruddha Maiti
|
Slobodan Vucetic
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Spatial aggregation refers to merging of documents created at the same spatial location. We show that by spatial aggregation of a large collection of documents and applying a traditional topic discovery algorithm on the aggregated data we can efficiently discover spatially distinct topics. By looking at topic discovery through matrix factorization lenses we show that spatial aggregation allows low rank approximation of the original document-word matrix, in which spatially distinct topics are preserved and non-spatial topics are aggregated into a single topic. Our experiments on synthetic data confirm this observation. Our experiments on 4.7 million tweets collected during the Sandy Hurricane in 2012 show that spatial and temporal aggregation allows rapid discovery of relevant spatial and temporal topics during that period. Our work indicates that different forms of document aggregation might be effective in rapid discovery of various types of distinct topics from large collections of documents.
2018
pdf
bib
abs
Regular Expression Guided Entity Mention Mining from Noisy Web Data
Shanshan Zhang
|
Lihong He
|
Slobodan Vucetic
|
Eduard Dragut
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Many important entity types in web documents, such as dates, times, email addresses, and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those RE-generated weak labels. Finally, a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.