Joydeb Mondal

2025

ExpertNeurons at SciVQA-2025: Retrieval Augmented VQA with Vision Language Model (RAVQA-VLM)
Nagaraj N Bhat | Joydeb Mondal | Srijon Sarkar
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

We introduce RAVQA-VLM, a novel Retrieval-Augmented Generation (RAG) architecture with Vision Language Model for the SciVQA challenge, which targets closed-ended visual and nonvisual questions over scientific figures drawn from ACL Anthology and arXiv papers (Borisova and Rehm, 2025). Our system first encodes each input figure and its accompanying metadata (caption, figure ID, type) into dense embed- dings, then retrieves context passages from the full PDF of the source paper via a Dense Passage Retriever (Karpukhin et al., 2020). The extracted contexts are concatenated with the question and passed to a vision-capable generative backbone (e.g., Phi-3.5, Pixtral-12B, Mixtral-24B-small, InterVL-3-14B) fine-tuned on the 15.1K SciVQA training examples (Yang et al., 2023; Pramanick et al., 2024). We jointly optimize retrieval and generation end-to-end to minimize answer loss and mitigate hallucinations (Lewis et al., 2020; Rujun Han and Castelli, 2024). On the SciVQA test set, RAVQA-VLM achieves significant improvements over parametric only baselines, with relative gains of +5% ROUGE1 and +5% ROUGE-L, demonstrating the efficacy of RAG for multimodal scientific QA.

pdf bib abs

Modgenix at SemEval-2025 Task 1: Context Aware Vision Language Ranking (CAViLR) for Multimodal Idiomaticity Understanding
Joydeb Mondal | Pramir Sarkar
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents CAViLR, a hybrid multimodal approach for SemEval-2025 Task 1. Our methodintegrates CLIP as a baseline with a Mixture of Experts (MoE) framework that dynamically selectsexpert models such as Pixtral-12B and Phi-3.5 based on input context. The approach addresseschallenges in both image ranking and image sequence prediction, improving the alignment of visualand textual semantics. Experimental results demonstrate that our hybrid model outperforms individualmodels. Future work will focus on refining expert selection and enhancing disambiguation strategiesfor complex idiomatic expressions.

2022

pdf bib abs

ExpertNeurons at FinCausal 2022 Task 2: Causality Extraction for Financial Documents
Joydeb Mondal | Nagaraj Bhat | Pramir Sarkar | Shahid Reza
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

In this paper describes the approach which we have built for causality extraction from the financial documents that we have submitted for FinCausal 2022 task 2. We proving a solution with intelligent pre-processing and post-processing to detect the number of cause and effect in a financial document and extract them. Our given approach achieved 90% as F1 score(weighted-average) for the official blind evaluation dataset.

Co-authors

Venues

Fix author