Neha Sengupta
2026
Nanda Family: Open-Weights Generative Large Language Models for Hindi
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
2024
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
Fajri Koto | Haonan Li | Sara Shatnawi | Jad Doughman | Abdelrahman Sadallah | Aisha Alraeesi | Khalid Almubarak | Zaid Alyafeai | Neha Sengupta | Shady Shehata | Nizar Habash | Preslav Nakov | Timothy Baldwin
Findings of the Association for Computational Linguistics: ACL 2024
Fajri Koto | Haonan Li | Sara Shatnawi | Jad Doughman | Abdelrahman Sadallah | Aisha Alraeesi | Khalid Almubarak | Zaid Alyafeai | Neha Sengupta | Shady Shehata | Nizar Habash | Preslav Nakov | Timothy Baldwin
Findings of the Association for Computational Linguistics: ACL 2024
The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present ArabicMMLU, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.
2022
DENTRA: Denoising and Translation Pre-training for Multilingual Machine Translation
Samta Kamboj | Sunil Kumar Sahu | Neha Sengupta
Proceedings of the Seventh Conference on Machine Translation (WMT)
Samta Kamboj | Sunil Kumar Sahu | Neha Sengupta
Proceedings of the Seventh Conference on Machine Translation (WMT)
In this paper, we describe our submission to the WMT-2022: Large-Scale Machine Translation Evaluation for African Languages under the Constrained Translation track. We introduce DENTRA, a novel pre-training strategy for a multilingual sequence-to-sequence transformer model. DENTRA pre-training combines denoising and translation objectives to incorporate both monolingual and bitext corpora in 24 African, English, and French languages. To evaluate the quality of DENTRA, we fine-tuned it with two multilingual machine translation configurations, one-to-many and many-to-one. In both pre-training and fine-tuning, we employ only the datasets provided by the organizers. We compare DENTRA against a strong baseline, M2M-100, in different African multilingual machine translation scenarios and show gains in 3 out of 4 subtasks.
2020
Autoencoding Keyword Correlation Graph for Document Clustering
Billy Chiu | Sunil Kumar Sahu | Derek Thomas | Neha Sengupta | Mohammady Mahdy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Billy Chiu | Sunil Kumar Sahu | Derek Thomas | Neha Sengupta | Mohammady Mahdy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Document clustering requires a deep understanding of the complex structure of long-text; in particular, the intra-sentential (local) and inter-sentential features (global). Existing representation learning models do not fully capture these features. To address this, we present a novel graph-based representation for document clustering that builds a graph autoencoder (GAE) on a Keyword Correlation Graph. The graph is constructed with topical keywords as nodes and multiple local and global features as edges. A GAE is employed to aggregate the two sets of features by learning a latent representation which can jointly reconstruct them. Clustering is then performed on the learned representations, using vector dimensions as features for inducing document classes. Extensive experiments on two datasets show that the features learned by our approach can achieve better clustering performance than other existing features, including term frequency-inverse document frequency and average embedding.
Search
Fix author
Co-authors
- Sunil Kumar Sahu 3
- Samta Kamboj 2
- Fajri Koto 2
- Haonan Li 2
- Preslav Nakov 2
- Utkarsh Agarwal 1
- Khalid Almubarak 1
- Aisha Alraeesi 1
- Zaid Alyafeai 1
- Timothy Baldwin 1
- Debopriyo Banerjee 1
- Junaid Hamid Bhat 1
- Shivam Chauhan 1
- Billy Chiu 1
- Mukund Choudhary 1
- Monojit Choudhury 1
- Rocktim Jyoti Das 1
- Jad Doughman 1
- Ali El Filali 1
- Samujjwal Ghosh 1
- Gurpreet Gosal 1
- Nizar Habash 1
- Xudong Han 1
- Alok Anil Jadhav 1
- Rituraj Joshi 1
- Mohammady Mahdy 1
- Parvez Mullah 1
- Rahul Pal 1
- Onkar Arun Pandit 1
- Lalit Pradhan 1
- Zainul Abedien Ahmed Quraishi 1
- Gokulakrishnan Ramakrishnan 1
- Abdelrahman Sadallah 1
- Dhruv Sahnan 1
- Sara Shatnawi 1
- Shady Shehata 1
- Avraham Sheinin 1
- Awantika Shukla 1
- Aaryamonvikram Singh 1
- Derek Thomas 1
- Natalia Vassilieva 1