IndiFoodVQA: Advancing Visual Question Answering and Reasoning with a Knowledge-Infused Synthetic Data Generation Pipeline

Pulkit Agarwal, Settaluri Sravanthi, Pushpak Bhattacharyya


Abstract
Large Vision Language Models (VLMs) like GPT-4, LLaVA, and InstructBLIP exhibit extraordinary capabilities for both knowledge understanding and reasoning. However, the reasoning capabilities of such models on sophisticated problems that require external knowledge of a specific domain have not been assessed well, due to the unavailability of necessary datasets. In this work, we release a first-of-its-kind dataset called IndiFoodVQA with around 16.7k data samples, consisting of explicit knowledge-infused questions, answers, and reasons. We also release IndiFoodKG, a related Knowledge Graph (KG) with 79k triples. The data has been created with minimal human intervention via an automated pipeline based on InstructBlip and GPT-3.5. We also present a methodology to extract knowledge from the KG and use it to both answer and reason upon the questions. We employ different models to report baseline zero-shot and fine-tuned results. Fine-tuned VLMs on our data showed an improvement of ~25% over the corresponding base model, highlighting the fact that current VLMs need domain-specific fine-tuning to excel in specialized settings. Our findings reveal that (1) explicit knowledge infusion during question generation helps in making questions that have more grounded knowledge, and (2) proper knowledge retrieval can often lead to better-answering potential in such cases. The data and code is available at https://github.com/SLSravanthi/IndifoodVQA.
Anthology ID:
2024.findings-eacl.78
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1158–1176
Language:
URL:
https://aclanthology.org/2024.findings-eacl.78
DOI:
Bibkey:
Cite (ACL):
Pulkit Agarwal, Settaluri Sravanthi, and Pushpak Bhattacharyya. 2024. IndiFoodVQA: Advancing Visual Question Answering and Reasoning with a Knowledge-Infused Synthetic Data Generation Pipeline. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1158–1176, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
IndiFoodVQA: Advancing Visual Question Answering and Reasoning with a Knowledge-Infused Synthetic Data Generation Pipeline (Agarwal et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.78.pdf
Note:
 2024.findings-eacl.78.note.zip