IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Shubham Nigam, Suparnojit Sarkar, Piyush Patel


Abstract
We present IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages (Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu). The dataset extends the MDDial corpus with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors introduced during automatic translation. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation (LoRA) of a quantized small language model, incorporating an optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate IndicMedLM against zero-shot multilingual baselines across ten languages and conduct systematic error analysis, identifying five failure modes: Instruction Drift, Label Collapse, Cross-Domain Confusion, Tokenization Failure, and Paraphrase-over-Label Generation. Results show strong post-processed diagnostic accuracy in Hindi, Marathi, and Bengali, while Assamese, Tamil, and Telugu remain in an extreme failure tier attributable to base-model tokenizer gaps, a finding with direct patient safety implications. Medical expert evaluation confirms the clinical plausibility and safety of the generated consultations.
Anthology ID:
2026.bionlp-1.84
Volume:
BioNLP 2026
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1041–1055
Language:
URL:
https://aclanthology.org/2026.bionlp-1.84/
DOI:
Bibkey:
Cite (ACL):
Shubham Nigam, Suparnojit Sarkar, and Piyush Patel. 2026. IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages. In BioNLP 2026, pages 1041–1055, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages (Nigam et al., BioNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.bionlp-1.84.pdf