Malhar Kulkarni

Also published as: Malhar A. Kulkarni

2025

An introduction to computational identification and classification of Upamā alaṇkāra
Bhakti Jadhav | Himanshu Dutta | Shruti Kanitkar | Malhar Kulkarni | Pushpak Bhattacharyya
Computational Sanskrit and Digital Humanities - World Sanskrit Conference 2025

2024

pdf bib abs

SansGPT: Advancing Generative Pre-Training in Sanskrit
Rhugved Pankaj Chaudhari | Bhakti Jadhav | Pushpak Bhattacharyya | Malhar Kulkarni
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

In the past decade, significant progress has been made in digitizing Sanskrit texts and advancing computational analysis of the language. However, efforts to advance NLP for complex semantic downstream tasks like Semantic Analogy Prediction, Named Entity Recognition, and others remain limited. This gap is mainly due to the absence of a robust, pre-trained Sanskrit model built on large-scale Sanskrit text data since this demands considerable computational resources and data preparation. In this paper, we introduce SansGPT, a generative pre-trained model that has been trained on a large corpus of Sanskrit texts and is designed to facilitate fine-tuning and development for downstream NLP tasks. We aim for this model to serve as a catalyst for advancing NLP research in Sanskrit. Additionally, we developed a custom tokenizer specifically optimized for Sanskrit text, enabling effective tokenization of compound words and making it better suited for generative tasks. Our data collection and cleaning process encompassed a wide array of available Sanskrit literature, ensuring comprehensive representation for training. We further demonstrate the model’s efficacy by fine-tuning it on Semantic Analogy Prediction and Simile Element Extraction, achieving an impressive accuracy of approximately 95.8% and 92.8%, respectively.

2023

pdf bib abs

Issues in the computational processing of Upamāalaṅkāra.
Bhakti Jadhav | Amruta Barbadikar | Amba Kulkarni | Malhar Kulkarni
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Processing and understanding of figurative speech is a challenging task for computers as well as humans. In this paper, we present a case of Upamā alaṅkāra (simile). The verbal cognition of the Upamā alaṅkāra by a human is presented as a dependency tree, which involves the identification of various components such as upamāna (vehicle), upameya (topic), sādhāran.a-dharma (common property) and upamādyotaka (word indicating similitude). This involves the repetition of elliptical elements. Further, we show, how the same dependency tree may be represented without any loss of information, even without repetition of elliptical elements. Such a representation would be useful for the computational processing of the alaṅkāras.

2021

pdf bib abs

Introduction to ProverbNet: An Online Multilingual Database of Proverbs and Comprehensive Metadata
Shreyas Pimpalgaonkar | Dhanashree Lele | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Proverbs are unique linguistic expressions used by humans in the process of communication. They are frozen expressions and have the capacity to convey deep semantic aspects of a given language. This paper describes ProverbNet, a novel online multilingual database of proverbs and comprehensive metadata equipped with a multipurpose search engine to store, explore, understand, classify and analyze proverbs and their metadata. ProverbNet has immense applications including machine translation, cognitive studies and learning tools. We have 2320 Sanskrit Proverbs and 1136 Marathi proverbs and their metadata in ProverbNet and are adding more proverbs in different languages to the network.

pdf bib abs

Cognition-aware Cognate Detection
Diptesh Kanojia | Prashant Sharma | Sayali Ghodekar | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers’ gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10% with the collected gaze features, and 12% using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.

2020

pdf bib abs

Part of Speech (POS) annotation is a significant challenge in natural language processing. The paper discusses issues and challenges faced in the process of POS annotation of the Marathi data from four domains viz., tourism, health, entertainment and agriculture. During POS annotation, a lot of issues were encountered. Some of the major ones are discussed in detail in this paper. Also, the two approaches viz., the lexical (L approach) and the functional (F approach) of POS tagging have been discussed and presented with examples. Further, some ambiguous cases in POS annotation are presented in the paper.

pdf bib abs

Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages
Diptesh Kanojia | Raj Dabre | Shubham Dewangan | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni
Proceedings of the 28th International Conference on Computational Linguistics

Cognates are variants of the same lexical form across different languages; for example “fonema” in Spanish and “phoneme” in English are cognates, both of which mean “a unit of sound”. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

pdf bib abs

Challenge Dataset of Cognates and False Friend Pairs from Indian Languages
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Gholamreza Haffari
Proceedings of the Twelfth Language Resources and Evaluation Conference

Cognates are present in multiple variants of the same text across different languages (e.g., “hund” in German and “hound” in the English language mean “dog”). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends’ dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.

pdf bib abs

Treatment of optional forms in Mathematical modelling of Pāṇini
Anupriya Aggarwal | Malhar Kulkarni
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Pāṇini in his Aṣṭādhyāyī has written the grammar of Sanskrit in an extremely concise manner in the form of about 4000 sūtras. We have attempted to mathematically remodel the data produced by these sūtras. The mathematical modelling is a way to show that the Pāṇinian approach is a minimal method of capturing the grammatical data for Sanskrit which is a natural language. The sūtras written by Pāṇini can be written as functions, that is for a single input the function produces a single output of the form y=f(x), where x and y is the input and output respectively. However, we observe that for some input dhātus, we get multiple outputs. For such cases, we have written multivalued functions that is the functions which give two or more outputs for a single input. In other words, multivalued function is a way to represent optional output forms which are expressed in Pāṇinian grammar with the help of 3 terms i.e. vā, vibhaṣā, and anyatarasyam. Comparison between the techniques employed by Pāṇini and our notation of functions helps us understand how Pāṇinian techniques ensure brevity and terseness, hence illustrating that Pāṇinian grammar is minimal.

2019

pdf bib

An Introduction to the Textual History Tool
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Eivind Kahrs
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib

Utilizing Word Embeddings based Features for Phylogenetic Tree Generation of Sanskrit Texts
Diptesh Kanojia | Abhijeet Dubey | Malhar Kulkarni | Pushpak Bhattacharyya | Gholemreza Haffari
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib

Introduction to Sanskrit Shabdamitra: An Educational Application of Sanskrit Wordnet
Malhar Kulkarni | Nilesh Joshi | Sayali Khare | Hanumant Redkar | Pushpak Bhattacharyya
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib abs

Utilizing Wordnets for Cognate Detection among Indian Languages
Diptesh Kanojia | Kevin Patel | Malhar Kulkarni | Pushpak Bhattacharyya | Gholemreza Haffari
Proceedings of the 10th Global Wordnet Conference

Automatic Cognate Detection (ACD) is a challenging task which has been utilized to help NLP applications like Machine Translation, Information Retrieval and Computational Phylogenetics. Unidentified cognate pairs can pose a challenge to these applications and result in a degradation of performance. In this paper, we detect cognate word pairs among ten Indian languages with Hindi and use deep learning methodologies to predict whether a word pair is cognate or not. We identify IndoWordnet as a potential resource to detect cognate word pairs based on orthographic similarity-based methods and train neural network models using the data obtained from it. We identify parallel corpora as another potential resource and perform the same experiments for them. We also validate the contribution of Wordnets through further experimentation and report improved performance of up to 26%. We discuss the nuances of cognate detection among closely related Indian languages and release the lists of detected cognates as a dataset. We also observe the behaviour of, to an extent, unrelated Indian language pairs and release the lists of detected cognates among them as well.

2018

pdf bib abs

This paper reports the work related to making Hindi Wordnet1 available as a digital resource for language learning and teaching, and the experiences and lessons that were learnt during the process. The language data of the Hindi Wordnet has been suitably modified and enhanced to make it into a language learning aid. This aid is based on modern pedagogical axioms and is aligned to the learning objectives of the syllabi of the school education in India. To make it into a comprehensive language tool, grammatical information has also been encoded, as far as these can be marked on the lexical items. The delivery of information is multi-layered, multi-sensory and is available across multiple digital platforms. The front end has been designed to offer an eye-catching user-friendly interface which is suitable for learners starting from age six onward. Preliminary testing of the tool has been done and it has been modified as per the feedbacks that were received. Above all, the entire exercise has offered gainful insights into learning based on associative networks and how knowledge based on such networks can be made available to modern learners.

2017

pdf bib

Hindi Shabdamitra: A Wordnet based E-Learning Tool for Language Learning and Teaching
Hanumant Redkar | Sandhya Singh | Dhara Gorasia | Meenakshi Somasundaram | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib abs

Hindi Shabdamitra: A Wordnet based E-Learning Tool for Language Learning and Teaching
Hanumant Redkar | Sandhya Singh | Meenakshi Somasundaram | Dhara Gorasia | Malhar Kulkarni | Pushpak Bhattacharyya
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

In today’s technology driven digital era, education domain is undergoing a transformation from traditional approaches to more learner controlled and flexible methods of learning. This transformation has opened the new avenues for interdisciplinary research in the field of educational technology and natural language processing in developing quality digital aids for learning and teaching. The tool presented here - Hindi Shabhadamitra, developed using Hindi Wordnet for Hindi language learning, is one such e-learning tool. It has been developed as a teaching and learning aid suitable for formal school based curriculum and informal setup for self learning users. Besides vocabulary, it also provides word based grammar along with images and pronunciation for better learning and retention. This aid demonstrates that how a rich lexical resource like wordnet can be systematically remodeled for practical usage in the educational domain.

2016

pdf bib

Use of Features for Accentuation of ghañanta Words
Samir Janardan Sohoni | Malhar A. Kulkarni
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib

pdf bib abs

Adverbs in Sanskrit Wordnet
Tanuja Ajotikar | Malhar Kulkarni
Proceedings of the 8th Global WordNet Conference (GWC)

The wordnet contains part-of-speech categories such as noun, verb, adjective and adverb. In Sanskrit, there is no formal distinction among nouns, adjectives and adverbs. This poses the question, is an adverb a separate category in Sanskrit? If not, then how do we accommodate it in a lexical resource? To investigate the issue, we attempt to study the complex nature of adverbs in Sanskrit and the policies adopted by Sanskrit lexicographers that would guide us in storing them in the Sanskrit wordnet.

pdf bib abs

Samāsa or compounds are a regular feature of Indian Languages. They are also found in other languages like German, Italian, French, Russian, Spanish, etc. Compound word is constructed from two or more words to form a single word. The meaning of this word is derived from each of the individual words of the compound. To develop a system to generate, identify and interpret compounds, is an important task in Natural Language Processing. This paper introduces a web based tool - Samāsa-Kartā for producing compound words. Here, the focus is on Sanskrit language due to its richness in usage of compounds; however, this approach can be applied to any Indian language as well as other languages. IndoWordNet is used as a resource for words to be compounded. The motivation behind creating compound words is to create, to improve the vocabulary, to reduce sense ambiguity, etc. in order to enrich the WordNet. The Samāsa-Kartā can be used for various applications viz., compound categorization, sandhi creation, morphological analysis, paraphrasing, synset creation, etc.