Aidong Zhang

2025

Existing LLM-based medical question answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce MedCite, the first end-to-end framework that facilitates the design and evaluation of LLM citations for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations.Our extensive evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that our evaluation results correlate well with annotation results from professional experts.

pdf bib abs

InfAL: Inference Time Adversarial Learning for Improving Research Ideation
Sikun Guo | Amir Hassan Shariatmadari | Peng Wang | Albert Huang | Aidong Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025

Advancements in Large Language Models (LLMs) have opened new opportunities for scientific discovery by assisting researchers in generating novel hypotheses and ideas. In this process, a major challenge is how to optimally and efficiently utilize LLMs’ parametric knowledge obtained from their pretraining process. Inspired by Generative Adversarial Networks (GANs), we propose inference time adversarial learning (termed InfAL), implemented through multi-LLM-agent interactions, to enhance research ideation. This approach optimizes the utilization of LLMs’ parametric knowledge without requiring additional model training, making adversarial learning efficient and context-driven. To evaluate the quality of generated ideas, we propose a relative quality ranking metric as a scalable alternative to human evaluation. Our results show that InfAL significantly improves idea generation, with GPT-4o achieving a 21% increase in novelty and a 322% increase in feasibility, demonstrating its transformative potential for driving innovation in scientific research.

pdf bib abs

COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision-Language Models
Sanchit Sinha | Guangzhi Xiong | Aidong Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present “COCO-Tree” - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM’s linguistic reasoning. COCO-Tree’s beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.

2024

pdf bib abs

Benchmarking Retrieval-Augmented Generation for Medicine
Guangzhi Xiong | Qiao Jin | Zhiyong Lu | Aidong Zhang
Findings of the Association for Computational Linguistics: ACL 2024

While large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the “lost-in-the-middle” effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.

2023

pdf bib abs

On Task-personalized Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval
Jiayi Chen | Hanjun Dai | Bo Dai | Aidong Zhang | Wei Wei
Findings of the Association for Computational Linguistics: EMNLP 2023

Visually-rich document entity retrieval (VDER), which extracts key information (e.g. date, address) from document images like invoices and receipts, has become an important topic in industrial NLP applications. The emergence of new document types at a constant pace, each with its unique entity types, presents a unique challenge: many documents contain unseen entity types that occur only a couple of times. Addressing this challenge requires models to have the ability of learning entities in a few-shot manner. However, prior works for Few-shot VDER mainly address the problem at the document level with a predefined global entity space, which doesn’t account for the entity-level few-shot scenario: target entity types are locally personalized by each task and entity occurrences vary significantly among documents. To address this unexplored scenario, this paper studies a novel entity-level few-shot VDER task. The challenges lie in the uniqueness of the label space for each task and the increased complexity of out-of-distribution (OOD) contents. To tackle this novel task, we present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization that distinguishes between in-task and out-of-task distribution. Specifically, we adopt a hierarchical decoder (HC) and employ contrastive learning (ContrastProtoNet) to achieve this goal. Furthermore, we introduce a new dataset, FewVEX, to boost future research in the field of entity-level few-shot VDER. Experimental results demonstrate our approaches significantly improve the robustness of popular meta-learning baselines.

Co-authors

Bo Dai 1

Sikun Guo 1

Yu Hu 1

Albert Huang 1

Amir Hassan Shariatmadari 1

Wei Wei 1

Venues

Findings4
EMNLP1

Fix author