Lucian Popa

2025

Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation, is an important and often overlooked task with several additional challenges. We present mtRAG, an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. mtRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on mtRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. mtRAG is available at https://github.com/ibm/mt-rag-benchmark.

pdf bib abs

Identifying Noise in Human-Created Datasets using Training Dynamics from Generative Models
Maeda Hanafi | Ishan Jindal | Yannis Katsis | Lucian Popa | Huaiyu Zhu
Findings of the Association for Computational Linguistics: EMNLP 2025

Instruction fine-tuning enhances the alignment of autoregressive language models (ArLMs) with human intent but relies on large-scale annotated datasets prone to label and text noise. In this paper, we show that existing noise detection techniques designed for autoencoder models (AeLMs) do not directly generalize to ArLMs due to differences in learning dynamics. We propose TDRanker, a novel approach leveraging training dynamics to rank datapoints from easy-to-learn to hard-to-learn, effectively identifying noisy instances. Our method demonstrates robustness across multiple model architectures covering both autoencoder and autoregressive language models (GPT-2, BERT, LaMini-Cerebras-256M) and across various dataset noise levels, achieving at least 2x faster denoising than previous techniques. Applied to real-world classification and generative tasks, TDRanker significantly improves data quality and model performance. These findings suggest that TDRanker provides a scalable solution for refining instruction-tuning datasets, enhancing the reliability of fine-tuned ArLMs in practical applications.

2024

pdf bib

Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)
Eduard Dragut | Yunyao Li | Lucian Popa | Slobodan Vucetic | Shashank Srivastava
Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)

pdf bib abs

ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation
Kashob Kumar Roy | Pritom Saha Akash | Kevin Chen-Chuan Chang | Lucian Popa
Findings of the Association for Computational Linguistics: EMNLP 2024

Open-domain long-form text generation requires generating coherent, comprehensive responses that address complex queries with both breadth and depth. This task is challenging due to the need to accurately capture diverse facets of input queries. Existing iterative retrieval-augmented generation (RAG) approaches often struggle to delve deeply into each facet of complex queries and integrate knowledge from various sources effectively. This paper introduces ConTReGen, a novel framework that employs a context-driven, tree-structured retrieval approach to enhance the depth and relevance of retrieved content. ConTReGen integrates a hierarchical, top-down in-depth exploration of query facets with a systematic bottom-up synthesis, ensuring comprehensive coverage and coherent integration of multifaceted information. Extensive experiments on multiple datasets, including LFQA and ODSUM, alongside a newly introduced dataset, ODSUM-WikiHow, demonstrate that ConTReGen outperforms existing state-of-the-art RAG models.

2023

pdf bib abs

Real-world domain experts (e.g., doctors) rarely annotate only a decision label in their day-to-day workflow without providing explanations. Yet, existing low-resource learning techniques, such as Active Learning (AL), that aim to support human annotators mostly focus on the label while neglecting the natural language explanation of a data point. This work proposes a novel AL architecture to support experts’ real-world need for label and explanation annotations in low-resource scenarios. Our AL architecture leverages an explanation-generation model to produce explanations guided by human explanations, a prediction model that utilizes generated explanations toward prediction faithfully, and a novel data diversity-based AL sampling strategy that benefits from the explanation annotations. Automated and human evaluations demonstrate the effectiveness of incorporating explanations into AL sampling and the improved human annotation efficiency and trustworthiness with our AL architecture. Additional ablation studies illustrate the potential of our AL architecture for transfer learning, generalizability, and integration with large language models (LLMs). While LLMs exhibit exceptional explanation-generation capabilities for relatively simple tasks, their effectiveness in complex real-world tasks warrants further in-depth study.

pdf bib abs

Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations
Bingsheng Yao | Prithviraj Sen | Lucian Popa | James Hendler | Dakuo Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Human-annotated labels and explanations are critical for training explainable NLP models. However, unlike human-annotated labels whose quality is easier to calibrate (e.g., with a majority vote), human-crafted free-form explanations can be quite subjective. Before blindly using them as ground truth to train ML models, a vital question needs to be asked: How do we evaluate a human-annotated explanation’s quality? In this paper, we build on the view that the quality of a human-annotated explanation can be measured based on its helpfulness (or impairment) to the ML models’ performance for the desired NLP tasks for which the annotations were collected. In comparison to the commonly used Simulatability score, we define a new metric that can take into consideration the helpfulness of an explanation for model performance at both fine-tuning and inference. With the help of a unified dataset format, we evaluated the proposed metric on five datasets (e.g., e-SNLI) against two model architectures (T5 and BART), and the results show that our proposed metric can objectively evaluate the quality of human-annotated explanations, while Simulatability falls short.

2022

pdf bib abs

Stock Price Volatility Prediction: A Case Study with AutoML
Hilal Pataci | Yunyao Li | Yannis Katsis | Yada Zhu | Lucian Popa
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Accurate prediction of the stock price volatility, the rate at which the price of a stock increases or decreases over a particular period, is an important problem in finance. Inaccurate prediction of stock price volatility might lead to investment risk and financial loss, while accurate prediction might generate significant returns for investors. Several studies investigated stock price volatility prediction in a regression task by using the transcripts of earning calls (quarterly conference calls held by public companies) with Natural Language Processing (NLP) techniques. Existing studies use the entire transcript and this degrades the performance due to noise caused by irrelevant information that might not have a significant impact on stock price volatility. In order to overcome these limitations, by considering stock price volatility prediction as a classification task, we explore several denoising approaches, ranging from general-purpose approaches to techniques specific to finance to remove the noise, and leverage AutoML systems that enable auto-exploration of a wide variety of models. Our preliminary findings indicate that domain-specific denoising approaches provide better results than general-purpose approaches, moreover AutoML systems provide promising results.

pdf bib abs

Cross-lingual Short-text Entity Linking: Generating Features for Neuro-Symbolic Methods
Qiuhao Lu | Sairam Gurajada | Prithviraj Sen | Lucian Popa | Dejing Dou | Thien Nguyen
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)

Entity linking (EL) on short text is crucial for a variety of industrial applications. Compared with general long-text EL, short-text EL poses particular challenges as the limited context restricts the clues one can leverage to disambiguate textual mentions. On the other hand, existing studies mostly focus on black-box neural methods and thus lack interpretability, which is critical to industrial applications in certain areas. In this study, we extend upon LNN-EL, a monolingual short-text EL method based on interpretable first-order logic, by incorporating three sets of multilingual features to enable disambiguating mentions written in languages other than English. More specifically, we use multilingual autoencoding language models (i.e., mBERT) to capture the similarities between the mention with its context and the candidate entity; we use multilingual sequence-to-sequence language models (i.e., mBART and mT5) to represent the likelihood of the text given the candidate entity. We also propose a word-level context feature to capture the semantic evidence of the co-occurring mentions. We evaluate the proposed xLNN-EL approach on the QALD-9-multilingual dataset and demonstrate the cross-linguality of the model and the effectiveness of the features.

pdf bib abs

A Comparative Analysis between Human-in-the-loop Systems and Large Language Models for Pattern Extraction Tasks
Maeda Hanafi | Yannis Katsis | Ishan Jindal | Lucian Popa
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)

Building a natural language processing (NLP) model can be challenging for end-users such as analysts, journalists, investigators, etc., especially given that they will likely apply existing tools out of the box. In this article, we take a closer look at how two complementary approaches, a state-of-the-art human-in-the-loop (HITL) tool and a generative language model (GPT-3) perform out of the box, that is, without fine-tuning. Concretely, we compare these approaches when end-users with little technical background are given pattern extraction tasks from text. We discover that the HITL tool performs with higher precision, while GPT-3 requires some level of engineering in its input prompts as well as post-processing on its output before it can achieve comparable results. Future work in this space should look further into the advantages and disadvantages of the two approaches, HITL and generative language model, as well as into ways to optimally combine them.

pdf bib abs

Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming
Ayush Maheshwari | Krishnateja Killamsetty | Ganesh Ramakrishnan | Rishabh Iyer | Marina Danilevsky | Lucian Popa
Findings of the Association for Computational Linguistics: ACL 2022

A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time-consuming to obtain. Although a small amount of labeled data cannot be used to train a model, it can be used effectively for the generation of humaninterpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data in a paradigm that is now commonly referred to as data programming. Previous methods of generating LFs do not attempt to use the given labeled data further to train a model, thus missing opportunities for improving performance. Additionally, since the LFs are generated automatically, they are likely to be noisy, and naively aggregating these LFs can lead to suboptimal results. In this work, we propose an LF-based bi-level optimization framework WISDOM to solve these two critical limitations. WISDOM learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that WISDOM significantly outperforms prior approaches on several text classification datasets.

pdf bib

Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)
Eduard Dragut | Yunyao Li | Lucian Popa | Slobodan Vucetic | Shashank Srivastava
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)

pdf bib abs

We propose a probabilistic approach to select a subset of a target domain representative keywords from a candidate set, contrasting with a context domain. Such a task is crucial for many downstream tasks in natural language processing. To contrast the target domain and the context domain, we adapt the two-component mixture model concept to generate a distribution of candidate keywords. It provides more importance to the distinctive keywords of the target domain than common keywords contrasting with the context domain. To support the representativeness of the selected keywords towards the target domain, we introduce an optimization algorithm for selecting the subset from the generated candidate distribution. We have shown that the optimization algorithm can be efficiently implemented with a near-optimal approximation guarantee. Finally, extensive experiments on multiple domains demonstrate the superiority of our approach over other baselines for the tasks of keyword summary generation and trending keywords selection.

2021

pdf bib

Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Eduard Dragut | Yunyao Li | Lucian Popa | Slobodan Vucetic
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

pdf bib abs

LNN-EL: A Neuro-Symbolic Approach to Short-text Entity Linking
Hang Jiang | Sairam Gurajada | Qiuhao Lu | Sumit Neelam | Lucian Popa | Prithviraj Sen | Yunyao Li | Alexander Gray
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Entity linking (EL) is the task of disambiguating mentions appearing in text by linking them to entities in a knowledge graph, a crucial task for text understanding, question answering or conversational systems. In the special case of short-text EL, which poses additional challenges due to limited context, prior approaches have reached good performance by employing heuristics-based methods or purely neural approaches. Here, we take a different, neuro-symbolic approach that combines the advantages of using interpretable rules based on first-order logic with the performance of neural learning. Even though constrained to use rules, we show that we reach competitive or better performance with SoTA black-box neural approaches. Furthermore, our framework has the benefits of extensibility and transferability. We show that we can easily blend existing rule templates given by a human expert, with multiple types of features (priors, BERT encodings, box embeddings, etc), and even with scores resulting from previous EL methods, thus improving on such methods. As an example of improvement, on the LC-QuAD-1.0 dataset, we show more than 3% increase in F1 score relative to previous SoTA. Finally, we show that the inductive bias offered by using logic results in a set of learned rules that transfers from one dataset to another, sometimes without finetuning, while still having high accuracy.

2020

pdf bib abs

Learning Structured Representations of Entity Names using Active Learning and Weak Supervision
Kun Qian | Poornima Chozhiyath Raman | Yunyao Li | Lucian Popa
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Structured representations of entity names are useful for many entity-related tasks such as entity normalization and variant generation. Learning the implicit structured representations of entity names without context and external knowledge is particularly challenging. In this paper, we present a novel learning framework that combines active learning and weak supervision to solve this problem. Our experimental evaluation show that this framework enables the learning of high-quality models from merely a dozen or so labeled examples.

2019

pdf bib abs

Low-resource Deep Entity Resolution with Transfer and Active Learning
Jungo Kasai | Kun Qian | Sairam Gurajada | Yunyao Li | Lucian Popa
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Entity resolution (ER) is the task of identifying different representations of the same real-world entities across databases. It is a key step for knowledge base creation and text mining. Recent adaptation of deep learning methods for ER mitigates the need for dataset-specific feature engineering by constructing distributed representations of entity records. While these methods achieve state-of-the-art performance over benchmark data, they require large amounts of labeled data, which are typically unavailable in realistic ER applications. In this paper, we develop a deep learning-based method that targets low-resource settings for ER through a novel combination of transfer learning and active learning. We design an architecture that allows us to learn a transferable model from a high-resource setting to a low-resource one. To further adapt to the target dataset, we incorporate active learning that carefully selects a few informative examples to fine-tune the transferred model. Empirical evaluation demonstrates that our method achieves comparable, if not better, performance compared to state-of-the-art learning-based methods while using an order of magnitude fewer labels.