Sunil Kumar Sahu
2026
Nanda Family: Open-Weights Generative Large Language Models for Hindi
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
2024
Class Name Guided Out-of-Scope Intent Classification
Chandan Gautam | Sethupathy Parameswaran | Aditya Kane | Yuan Fang | Savitha Ramasamy | Suresh Sundaram | Sunil Kumar Sahu | Xiaoli Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Chandan Gautam | Sethupathy Parameswaran | Aditya Kane | Yuan Fang | Savitha Ramasamy | Suresh Sundaram | Sunil Kumar Sahu | Xiaoli Li
Findings of the Association for Computational Linguistics: EMNLP 2024
2022
DENTRA: Denoising and Translation Pre-training for Multilingual Machine Translation
Samta Kamboj | Sunil Kumar Sahu | Neha Sengupta
Proceedings of the Seventh Conference on Machine Translation (WMT)
Samta Kamboj | Sunil Kumar Sahu | Neha Sengupta
Proceedings of the Seventh Conference on Machine Translation (WMT)
In this paper, we describe our submission to the WMT-2022: Large-Scale Machine Translation Evaluation for African Languages under the Constrained Translation track. We introduce DENTRA, a novel pre-training strategy for a multilingual sequence-to-sequence transformer model. DENTRA pre-training combines denoising and translation objectives to incorporate both monolingual and bitext corpora in 24 African, English, and French languages. To evaluate the quality of DENTRA, we fine-tuned it with two multilingual machine translation configurations, one-to-many and many-to-one. In both pre-training and fine-tuning, we employ only the datasets provided by the organizers. We compare DENTRA against a strong baseline, M2M-100, in different African multilingual machine translation scenarios and show gains in 3 out of 4 subtasks.
2020
Autoencoding Keyword Correlation Graph for Document Clustering
Billy Chiu | Sunil Kumar Sahu | Derek Thomas | Neha Sengupta | Mohammady Mahdy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Billy Chiu | Sunil Kumar Sahu | Derek Thomas | Neha Sengupta | Mohammady Mahdy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Document clustering requires a deep understanding of the complex structure of long-text; in particular, the intra-sentential (local) and inter-sentential features (global). Existing representation learning models do not fully capture these features. To address this, we present a novel graph-based representation for document clustering that builds a graph autoencoder (GAE) on a Keyword Correlation Graph. The graph is constructed with topical keywords as nodes and multiple local and global features as edges. A GAE is employed to aggregate the two sets of features by learning a latent representation which can jointly reconstruct them. Clustering is then performed on the learned representations, using vector dimensions as features for inducing document classes. Extensive experiments on two datasets show that the features learned by our approach can achieve better clustering performance than other existing features, including term frequency-inverse document frequency and average embedding.
2019
Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network
Sunil Kumar Sahu | Fenia Christopoulou | Makoto Miwa | Sophia Ananiadou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Sunil Kumar Sahu | Fenia Christopoulou | Makoto Miwa | Sophia Ananiadou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Inter-sentence relation extraction deals with a number of complex semantic relationships in documents, which require local, non-local, syntactic and semantic dependencies. Existing methods do not fully exploit such dependencies. We present a novel inter-sentence relation extraction model that builds a labelled edge graph convolutional neural network model on a document-level graph. The graph is constructed using various inter- and intra-sentence dependencies to capture local and non-local dependency information. In order to predict the relation of an entity pair, we utilise multi-instance learning with bi-affine pairwise scoring. Experimental results show that our model achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets. Our analysis shows that all the types in the graph are effective for inter-sentence relation extraction.
2017
Search
Fix author
Co-authors
- Neha Sengupta 3
- Samta Kamboj 2
- Utkarsh Agarwal 1
- Ashish Anand 1
- Sophia Ananiadou 1
- Debopriyo Banerjee 1
- Junaid Hamid Bhat 1
- Shivam Chauhan 1
- Kushal Chawla 1
- Billy Chiu 1
- Mukund Choudhary 1
- Monojit Choudhury 1
- Fenia Christopoulou 1
- Rocktim Jyoti Das 1
- Ali El Filali 1
- Yuan Fang 1
- Chandan Gautam 1
- Samujjwal Ghosh 1
- Gurpreet Gosal 1
- Xudong Han 1
- Alok Anil Jadhav 1
- Rituraj Joshi 1
- Aditya Kane 1
- Fajri Koto 1
- Xiaoli Li 1
- Haonan Li 1
- Mohammady Mahdy 1
- Makoto Miwa 1
- Parvez Mullah 1
- Preslav Nakov 1
- Rahul Pal 1
- Onkar Arun Pandit 1
- Sethupathy Parameswaran 1
- Lalit Pradhan 1
- Zainul Abedien Ahmed Quraishi 1
- Gokulakrishnan Ramakrishnan 1
- Savitha Ramasamy 1
- Dhruv Sahnan 1
- Avraham Sheinin 1
- Awantika Shukla 1
- Aaryamonvikram Singh 1
- Suresh Sundaram 1
- Derek Thomas 1
- Natalia Vassilieva 1