2025
pdf
bib
abs
GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models
Tuo Wang
|
Adithya Kulkarni
|
Tyler Cody
|
Peter A. Beling
|
Yujun Yan
|
Dawei Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at https://github.com/ODYSSEYWT/GUQ.
pdf
bib
abs
SciCompanion: Graph-Grounded Reasoning for Structured Evaluation of Scientific Arguments
Joshua Alan Flashner
|
Adithya Kulkarni
|
Dawei Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
The exponential growth of scientific publications has overwhelmed reviewers and researchers, with top conferences receiving thousands of submissions annually. Reviewers must assess feasibility, novelty, and impact under tight deadlines, often lacking tools to identify relevant prior work. Early-career researchers face similar challenges, with limited support to navigate fast-evolving fields. Existing LLM-based systems struggle with static retrieval, surface-level features, and lack multi-hop reasoning, leading to shallow or hallucinated assessments. Scientific evaluation requires a deep, relational understanding, which current retrieval-augmented generation (RAG) methods fail to achieve. We introduce SciCompanion, a graph-grounded reasoning framework for structured scientific evaluation. Given a paper or abstract-like input, SciCompanion builds a dynamic knowledge graph from recent publications, domain-specific databases, and curated metadata. It employs multi-hop reasoning to iteratively construct contextual graphs and generate structured critiques, enabling deeper exploration of scientific literature. Unlike sentiment-biased LLM evaluations, SciCompanion directly optimizes retrieval and graph refinement using Group Relative Policy Optimization (GRPO), producing reviews aligned with expert judgments. Experiments on ICLR and ACL datasets show that SciCompanion reduces evaluation error by over 30% compared to prompting-only baselines and allows smaller models to outperform larger ones. Evaluations across three datasets, using metrics for retrieval accuracy, semantic overlap, and multi-hop sensitivity, along with a case study, demonstrate SciCompanion’s robustness and versatility.
pdf
bib
abs
MetaScientist: A Human-AI Synergistic Framework for Automated Mechanical Metamaterial Design
Jingyuan Qi
|
Zian Jia
|
Minqian Liu
|
Wangzhi Zhan
|
Junkai Zhang
|
Xiaofei Wen
|
Jingru Gan
|
Jianpeng Chen
|
Qin Liu
|
Mingyu Derek Ma
|
Bangzheng Li
|
Haohui Wang
|
Adithya Kulkarni
|
Muhao Chen
|
Dawei Zhou
|
Ling Li
|
Wei Wang
|
Lifu Huang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
The discovery of novel mechanical metamaterials, whose properties are dominated by their engineered structures rather than chemical composition, is a knowledge-intensive and resource-demanding process. To accelerate the design of novel metamaterials, we present MetaScientist, a human-in-the-loop system that integrates advanced AI capabilities with expert oversight with two primary phases: (1) hypothesis generation, where the system performs complex reasoning to generate novel and scientifically sound hypotheses, supported with domain-specific foundation models and inductive biases retrieved from existing literature; (2) 3D structure synthesis, where a 3D structure is synthesized with a novel 3D diffusion model based on the textual hypothesis and refined it with a LLM-based refinement model to achieve better structure properties. At each phase, domain experts iteratively validate the system outputs, and provide feedback and supplementary materials to ensure the alignment of the outputs with scientific principles and human preferences. Through extensive evaluation from human scientists, MetaScientist is able to deliver novel and valid mechanical metamaterial designs that have the potential to be highly impactful in the metamaterial field.
2024
pdf
bib
abs
Towards Effective Long Conversation Generation with Dynamic Topic Tracking and Recommendation
Trevor Ashby
|
Adithya Kulkarni
|
Jingyuan Qi
|
Minqian Liu
|
Eunah Cho
|
Vaibhav Kumar
|
Lifu Huang
Proceedings of the 17th International Natural Language Generation Conference
During conversations, the human flow of thoughts may result in topic shifts and evolution. In open-domain dialogue systems, it is crucial to track the topics discussed and recommend relevant topics to be included in responses to have effective conversations. Furthermore, topic evolution is needed to prevent stagnation as conversation length increases. Existing open-domain dialogue systems do not pay sufficient attention to topic evolution and shifting, resulting in performance degradation due to ineffective responses as conversation length increases. To address the shortcomings of existing approaches, we propose EvolvConv. EvolvConv conducts real-time conversation topic and user preference tracking and utilizes the tracking information to evolve and shift topics depending on conversation status. We conduct extensive experiments to validate the topic evolving and shifting capabilities of EvolvConv as conversation length increases. Un-referenced evaluation metric UniEval compare EvolvConv with the baselines. Experimental results show that EvolvConv maintains a smooth conversation flow without abruptly shifting topics; the probability of topic shifting ranges between 5%-8% throughout the conversation. EvolvConv recommends 4.77% more novel topics than the baselines, and the topic evolution follows balanced topic groupings. Furthermore, we conduct user surveys to test the practical viability of EvolvConv. User survey results reveal that responses generated by EvolvConv are preferred 47.8% of the time compared to the baselines and comes second to real human responses.
2023
pdf
bib
abs
Zero-shot Approach to Overcome Perturbation Sensitivity of Prompts
Mohna Chakraborty
|
Adithya Kulkarni
|
Qi Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent studies have demonstrated that natural-language prompts can help to leverage the knowledge learned by pre-trained language models for the binary sentence-level sentiment classification task. Specifically, these methods utilize few-shot learning settings to fine-tune the sentiment classification model using manual or automatically generated prompts. However, the performance of these methods is sensitive to the perturbations of the utilized prompts. Furthermore, these methods depend on a few labeled instances for automatic prompt generation and prompt ranking. This study aims to find high-quality prompts for the given task in a zero-shot setting. Given a base prompt, our proposed approach automatically generates multiple prompts similar to the base prompt employing positional, reasoning, and paraphrasing techniques and then ranks the prompts using a novel metric. We empirically demonstrate that the top-ranked prompts are high-quality and significantly outperform the base prompt and the prompts generated using few-shot learning for the binary sentence-level sentiment classification task.
2020
pdf
bib
abs
OptSLA: an Optimization-Based Approach for Sequential Label Aggregation
Nasim Sabetpour
|
Adithya Kulkarni
|
Qi Li
Findings of the Association for Computational Linguistics: EMNLP 2020
The need for the annotated training dataset on which data-hungry machine learning algorithms feed has increased dramatically with advanced acclaim of machine learning applications. To annotate the data, people with domain expertise are needed, but they are seldom available and expensive to hire. This has lead to the thriving of crowdsourcing platforms such as Amazon Mechanical Turk (AMT). However, the annotations provided by one worker cannot be used directly to train the model due to the lack of expertise. Existing literature in annotation aggregation focuses on binary and multi-choice problems. In contrast, little work has been done on complex tasks such as sequence labeling with imbalanced classes, a ubiquitous task in Natural Language Processing (NLP), and Bio-Informatics. We propose OptSLA, an Optimization-based Sequential Label Aggregation method, that jointly considers the characteristics of sequential labeling tasks, workers reliabilities, and advanced deep learning techniques to conquer the challenge. We evaluate our model on crowdsourced data for named entity recognition task. Our results show that the proposed OptSLA outperforms the state-of-the-art aggregation methods, and the results are easier to interpret.