Dhananjay Ashok

2025

A Little Human Data Goes A Long Way
Dhananjay Ashok | Jonathan May
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Evidence-based Question Answering (QA) by incrementally replacing human-generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be improved by including as few as 125 human-generated data points. We show that matching the performance gain of a little human data requires an order of magnitude more synthetic data, and then estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human-generated.

pdf bib abs

Can VLMs Recall Factual Associations From Visual References?
Dhananjay Ashok | Ashutosh Chaubey | Hirona Jacqueline Arai | Jonathan May | Jesse Thomason
Findings of the Association for Computational Linguistics: EMNLP 2025

Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity, their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

2023

pdf bib abs

The student becomes the master: Outperforming GPT3 on Scientific Factual Error Correction
Dhananjay Ashok | Atharva Kulkarni | Hai Pham | Barnabas Poczos
Findings of the Association for Computational Linguistics: EMNLP 2023

Due to the prohibitively high cost of creating error correction datasets, most Factual Claim Correction methods rely on a powerful verification model to guide the correction process. This leads to a significant drop in performance in domains like Scientific Claim Correction, where good verification models do not always exist. In this work we introduce SciFix, a claim correction system that does not require a verifier but is able to outperform existing methods by a considerable margin — achieving correction accuracy of 84% on the SciFact dataset, 77% on SciFact-Open and 72.75% on the CovidFact dataset, compared to next best accuracies of 7.6%, 5% and 15% on the same datasets respectively. Our method leverages the power of prompting with LLMs during training to create a richly annotated dataset that can be used for fully supervised training and regularization. We additionally use a claim-aware decoding procedure to improve the quality of corrected claims. Our method outperforms the very LLM that was used to generate the annotated dataset — with FewShot Prompting on GPT3.5 achieving 58%, 61% and 64% on the respective datasets, a consistently lower correction accuracy, despite using nearly 800 times as many parameters as our model.

Co-authors

Barnabás Póczos 1

Jesse Thomason 1

Venues

findings2
acl1

Fix author