Estevam Hruschka


2024

pdf bib
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
Pouya Pezeshkpour | Estevam Hruschka
Findings of the Association for Computational Linguistics: NAACL 2024

Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions—commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 85% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model’s bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs’ predictions, leading to up to 8 percentage points improvement across different models and benchmarks.

pdf bib
Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks
Aditi Mishra | Sajjadur Rahman | Kushan Mitra | Hannah Kim | Estevam Hruschka
Findings of the Association for Computational Linguistics: ACL 2024

Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. However, their ability to generate rationales for knowledge-intensive tasks (KITs) remains under-explored. Generating rationales for KIT solutions, such as commonsense multiple-choice QA, requires external knowledge to support predictions and refute alternate options. In this work, we consider the task of generating retrieval-augmented rationalization of KIT model predictions via external knowledge guidance within a few-shot setting. Surprisingly, crowd-workers preferred LLM-generated rationales over existing crowd-sourced rationales, generated in a similar knowledge-guided setting, on aspects such as factuality, sufficiency, and convincingness. However, fine-grained evaluation of such rationales highlights the need for further improvements in conciseness, novelty, and domain invariance. Additionally, through an expert-sourced study evaluating the reliability of the rationales, we demonstrate that humans’ trust in LLM-generated rationales erodes when communicated faithfully, i.e., without taking model prediction accuracy into account. We find that even instrumenting simple guardrails can be effective for reliable rationalization.

pdf bib
Less is More for Long Document Summary Evaluation by LLMs
Yunshu Wu | Hayate Iso | Pouya Pezeshkpour | Nikita Bhutani | Estevam Hruschka
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR 2024)
Estevam Hruschka | Thom Lake | Naoki Otani | Tom Mitchell
Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR 2024)

2023

pdf bib
Zero-shot Triplet Extraction by Template Infilling
Bosung Kim | Hayate Iso | Nikita Bhutani | Estevam Hruschka | Ndapa Nakashole | Tom Mitchell
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023)
Estevam Hruschka | Tom Mitchell | Sajjadur Rahman | Dunja Mladenić | Marko Grobelnik
Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023)

2022

pdf bib
Distilling Salient Reviews with Zero Labels
Chieh-Yang Huang | Jinfeng Li | Nikita Bhutani | Alexander Whedon | Estevam Hruschka | Yoshi Suhara
Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER)

Many people read online reviews to learn about real-world entities of their interest. However, majority of reviews only describes general experiences and opinions of the customers, and may not reveal facts that are specific to the entity being reviewed. In this work, we focus on a novel task of mining from a review corpus sentences that are unique for each entity. We refer to this task as Salient Fact Extraction. Salient facts are extremely scarce due to their very nature. Consequently, collecting labeled examples for training supervised models is tedious and cost-prohibitive. To alleviate this scarcity problem, we develop an unsupervised method, ZL-Distiller, which leverages contextual language representations of the reviews and their distributional patterns to identify salient sentences about entities. Our experiments on multiple domains (hotels, products, and restaurants) show that ZL-Distiller achieves state-of-the-art performance and further boosts the performance of other supervised/unsupervised algorithms for the task. Furthermore, we show that salient sentences mined by ZL-Distiller provide unique and detailed information about entities, which benefit downstream NLP applications including question answering and summarization.

pdf bib
MEGAnno: Exploratory Labeling for NLP in Computational Notebooks
Dan Zhang | Hannah Kim | Rafael Li Chen | Eser Kandogan | Estevam Hruschka
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)

We present MEGAnno, a novel exploratory annotation framework designed for NLP researchers and practitioners. Unlike existing labeling tools that focus on data labeling only, our framework aims to support a broader, iterative ML workflow including data exploration and model development. With MEGAnno’s API, users can programmatically explore the data through sophisticated search and automated suggestion functions and incrementally update task schema as their project evolve. Combined with our widget, the users can interactively sort, filter, and assign labels to multiple items simultaneously in the same notebook where the rest of the NLP project resides. We demonstrate MEGAnno’s flexible, exploratory, efficient, and seamless labeling experience through a sentiment analysis use case.

pdf bib
Low-resource Entity Set Expansion: A Comprehensive Study on User-generated Text
Yutong Shao | Nikita Bhutani | Sajjadur Rahman | Estevam Hruschka
Findings of the Association for Computational Linguistics: NAACL 2022

Entity set expansion (ESE) aims at obtaining a more complete set of entities given a textual corpus and a seed set of entities of a concept. Although it is a critical task in many NLP applications, existing benchmarks are limited to well-formed text (e.g., Wikipedia) and well-defined concepts (e.g., countries and diseases). Furthermore, only a small number of predictions are evaluated compared to the actual size of an entity set. A rigorous assessment of ESE methods warrants more comprehensive benchmarks and evaluation. In this paper, we consider user-generated text to understand the generalizability of ESE methods. We develop new benchmarks and propose more rigorous evaluation metrics for assessing the performance of ESE methods. Additionally, we identify phenomena such as non-named entities, multifaceted entities, vague concepts that are more prevalent in user-generated text than well-formed text, and use them to profile ESE methods. We observe that the strong performance of state-of-the-art ESE methods does not generalize well to user-generated text. We conduct comprehensive empirical analysis and draw insights from the findings.

pdf bib
Low-resource Interactive Active Labeling for Fine-tuning Language Models
Seiji Maekawa | Dan Zhang | Hannah Kim | Sajjadur Rahman | Estevam Hruschka
Findings of the Association for Computational Linguistics: EMNLP 2022

Recently, active learning (AL) methods have been used to effectively fine-tune pre-trained language models for various NLP tasks such as sentiment analysis and document classification. However, given the task of fine-tuning language models, understanding the impact of different aspects on AL methods such as labeling cost, sample acquisition latency, and the diversity of the datasets necessitates a deeper investigation. This paper examines the performance of existing AL methods within a low-resource, interactive labeling setting. We observe that existing methods often underperform in such a setting while exhibiting higher latency and a lack of generalizability. To overcome these challenges, we propose a novel active learning method TYROUGE that employs a hybrid sampling strategy to minimize labeling cost and acquisition latency while providing a framework for adapting to dataset diversity via user guidance. Through our experiments, we observe that compared to SOTA methods, TYROUGE reduces the labeling cost by up to 43% and the acquisition latency by as much as 11X, while achieving comparable accuracy. Finally, we discuss the strengths and weaknesses of TYROUGE by exploring the impact of dataset characteristics.

pdf bib
Proceedings of the 2nd Workshop on Deriving Insights from User-Generated Text
Estevam Hruschka | Tom Mitchell | Dunja Mladenic | Marko Grobelnik | Nikita Bhutani
Proceedings of the 2nd Workshop on Deriving Insights from User-Generated Text

2014

pdf bib
Biocom Usp: Tweet Sentiment Analysis with Adaptive Boosting Ensemble
Nádia Silva | Estevam Hruschka | Eduardo Hruschka
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Biocom Usp: Tweet Sentiment Analysis with Adaptive Boosting Ensemble
Nádia Silva | Estevam Hruschka | Eduardo Hruschka
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2011

pdf bib
Discovering Relations between Noun Categories
Thahir Mohamed | Estevam Hruschka | Tom Mitchell
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing