Ashwini Vaidya

2026

The Lock, Stock, and Barrel of Marathi Multiwords
Aakanksha Padhye | Ashwini Vaidya
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

Multiword expressions are an important area of study in linguistics and natural language processing as they represent combination of words that function as a single unit, and display properties that cannot be predicated fully from their individual components. This paper describes annotated corpora of about 3000 multiword expressions across syntactic categories in Marathi. This is the first exhaustive resource for Marathi which includes both verbal and non-verbal multiwords. In order to develop the guidelines for annotation, we have used the existing literature on the identification and classification of these expressions. Following the PARSEME 2.0 guidelines, we discuss the categories of multiwords and their behaviour in the corpus. Throughout the annotation process, we encounter variability in compositionality and syntactic realization and discuss our design decisions during annotation. Such a dataset will further our understanding of how grammatical structure can be integrated with lexically stored multiword units in Marathi.

2025

pdf bib

Investigating the Probability of External Causation in Hindi Light Verb Constructions
Kanishka Jain | Ashwini Vaidya
Proceedings of the Society for Computation in Linguistics 2025

pdf bib abs

A Benchmark for Hindi Verb-Argument Structure Alternations
Kanishka Jain | Ashwini Vaidya
Findings of the Association for Computational Linguistics: EMNLP 2025

In this paper we introduce a Hindi verb alternations benchmark to investigate whether pretrained large language models (LLMs) can infer the frame-selectional properties of Hindi verbs. Our benchmark consists of minimal pairs such as ‘Tina cut the wood’/*‘Tina disappeared the wood’. We create four variants of these alternations for Hindi to test knowledge of verbal morphology and argument case-marking. Our results show that a masked monolingual model performs the best, while causal models fare poorly. We further test the quality of the predictions using a cloze-style sentence completion task. While the models appear to infer the right mapping between verbal morphology and valency in the acceptability task, they do not generate the right verbal morphology in the cloze task. The model completions also lack pragmatic and world knowledge, crucial for making generalizations about verbal alternations. Our work points towards the need for more cross-linguistic research of verbal alternations.

pdf bib abs

A Brief Overview of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL)
Kengatharaiyer Sarveswaran | Surendrabikram Thapa | Sana Shams | Ashwini Vaidya | Bal Krishna Bal
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

In this paper, we provide a brief summary of the inaugural workshop on Challenges in Processing South Asian Languages (CHiPSAL) held as part of COLING 2025. The workshop included regular papers, invited keynotes, and shared task papers, fostering a collaborative platform for exploring challenges in processing South Asian languages. The shared task focused on Devanagari-script language understanding, encompassing subtasks on language identification, hate speech detection, and target classification. This workshop series aims to address linguistic and cultural nuances, resource constraints, and orthographic complexities in low-resource South Asian languages while advancing NLP research and promoting multilingual inclusivity.

pdf bib

Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Kengatharaiyer Sarveswaran | Ashwini Vaidya | Bal Krishna Bal | Sana Shams | Surendrabikram Thapa
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

2024

pdf bib abs

Revisiting VMWEs in Hindi: Annotating Layers of Predication
Kanishka Jain | Ashwini Vaidya
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

Multiword expressions in languages like Hindi are both productive and challenging. Hindi not only uses a variety of verbal multiword expressions (VMWEs) but also employs different combinatorial strategies to create new types of multiword expressions. In this paper we are investigating two such strategies that are quite common in the language. Firstly, we describe that VMWEs in Hindi are not just lexical but also morphological. Causatives are formed morphologically in Hindi. Second, we examine Stacked VMWEs i.e. when at least two VMWEs occur together. We suggest that the existing PARSEME annotation framework can be extended to these two phenomena without changing the existing guidelines. We also propose rule-based heuristics using existing Universal Dependency annotations to automatically identify and annotate some of the VMWEs in the language. The goal of this paper is to refine the existing PARSEME corpus of Hindi for VMWEs while expanding its scope giving a more comprehensive picture of VMWEs in Hindi.

2021

pdf bib abs

Fine-tuning Distributional Semantic Models for Closely-Related Languages
Kushagra Bhatia | Divyanshu Aggarwal | Ashwini Vaidya
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

In this paper we compare the performance of three models: SGNS (skip-gram negative sampling) and augmented versions of SVD (singular value decomposition) and PPMI (Positive Pointwise Mutual Information) on a word similarity task. We particularly focus on the role of hyperparameter tuning for Hindi based on recommendations made in previous work (on English). Our results show that there are language specific preferences for these hyperparameters. We extend the best settings for Hindi to a set of related languages: Punjabi, Gujarati and Marathi with favourable results. We also find that a suitably tuned SVD model outperforms SGNS for most of our languages and is also more robust in a low-resource setting.

pdf bib

2020

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

pdf bib

2019

pdf bib abs

Syntactic composition and selectional preferences in Hindi Light Verb Constructions
Ashwini Vaidya | Martha Palmer
Linguistic Issues in Language Technology, Volume 17, 2019

Previous work on light verb constructions (e.g. chorii kar ‘theft do; steal’) in Hindi describes their syntactic formation via co-predication (Ahmed et al., 2012, Butt, 2014). This implies that both noun and light verb contribute their arguments, and these overlapping argument structures must be composed in the syntax. In this paper, we present a co-predication analysis using Tree-Adjoining Grammar, which models syntactic composition and semantic selectional preferences without transformations (deletion or argument identification). The analysis has two key components (i) an underspecified category for the nominal and (ii) combinatorial constraints on the noun and light verb to specify selectional preferences. The former has the advantage of syntactic composition without argument identification and the latter prevents over-generalization, while recognizing the semantic contribution of both predicates. This work additionally accounts for the agreement facts for the Hindi LVC.

pdf bib abs

Towards measuring lexical complexity in Malayalam
Richard Shallam | Ashwini Vaidya
Proceedings of the 16th International Conference on Natural Language Processing

This paper proposes a metric to quantify lexical complexity in Malayalam. The met- ric utilizes word frequency, orthography and morphology as the three factors affect- ing visual word recognition in Malayalam. Malayalam differs from other Indian lan- guages due to its agglutinative morphology and orthography, which are incorporated into our model. The predictions made by our model are then evaluated against reac- tion times in a lexical decision task. We find that reaction times are predicted by frequency, morphological complexity and script complexity. We also explore the interactions between morphological com- plexity with frequency and script in our results. To the best of our knowledge, this is the first study on lexical complexity in Malayalam.

2018

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

2017

pdf bib

Understanding Constraints on Non-Projectivity Using Novel Measures
Himanshu Yadav | Ashwini Vaidya | Samar Husain
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

pdf bib abs

This paper describes our efforts for the development of a Proposition Bank for Urdu, an Indo-Aryan language. Our primary goal is the labeling of syntactic nodes in the existing Urdu dependency Treebank with specific argument labels. In essence, it involves annotation of predicate argument structures of both simple and complex predicates in the Treebank corpus. We describe the overall process of building the PropBank of Urdu. We discuss various statistics pertaining to the Urdu PropBank and the issues which the annotators encountered while developing the PropBank. We also discuss how these challenges were addressed to successfully expand the PropBank corpus. While reporting the Inter-annotator agreement between the two annotators, we show that the annotators share similar understanding of the annotation guidelines and of the linguistic phenomena present in the language. The present size of this Propbank is around 180,000 tokens which is double-propbanked by the two annotators for simple predicates. Another 100,000 tokens have been annotated for complex predicates of Urdu.

pdf bib abs

Linguistic features for Hindi light verb construction identification
Ashwini Vaidya | Sumeet Agarwal | Martha Palmer
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Light verb constructions (LVC) in Hindi are highly productive. If we can distinguish a case such as nirnay lenaa ‘decision take; decide’ from an ordinary verb-argument combination kaagaz lenaa ‘paper take; take (a) paper’,it has been shown to aid NLP applications such as parsing (Begum et al., 2011) and machine translation (Pal et al., 2011). In this paper, we propose an LVC identification system using language specific features for Hindi which shows an improvement over previous work(Begum et al., 2011). To build our system, we carry out a linguistic analysis of Hindi LVCs using Hindi Treebank annotations and propose two new features that are aimed at capturing the diversity of Hindi LVCs in the corpus. We find that our model performs robustly across a diverse range of LVCs and our results underscore the importance of semantic features, which is in keeping with the findings for English. Our error analysis also demonstrates that our classifier can be used to further refine LVC annotations in the Hindi Treebank and make them more consistent across the board.

This paper examines both linguistic behavior and practical implication of empty argument insertion in the Hindi PropBank. The Hindi PropBank is annotated on the Hindi Dependency Treebank, which contains some empty categories but not the empty arguments of verbs. In this paper, we analyze four kinds of empty arguments, *PRO*, *REL*, *GAP*, *pro*, and suggest effective ways of annotating these arguments. Empty arguments such as *PRO* and *REL* can be inserted deterministically; we present linguistically motivated rules that automatically insert these arguments with high accuracy. On the other hand, it is difficult to find deterministic rules to insert *GAP* and *pro*; for these arguments, we introduce a new annotation scheme that concurrently handles both semantic role labeling and empty category insertion, producing fast and high quality annotation. In addition, we present algorithms for finding antecedents of *REL* and *PRO*, and discuss why finding antecedents for some types of *PRO* is difficult.

2011

pdf bib

Analysis of the Hindi Proposition Bank using Dependency Structure
Ashwini Vaidya | Jinho Choi | Martha Palmer | Bhuvana Narasimhan
Proceedings of the 5th Linguistic Annotation Workshop

2010

pdf bib

pdf bib abs

We are in the process of creating a multi-representational and multi-layered treebank for Hindi/Urdu (Palmer et al., 2009), which has three main layers: dependency structure, predicate-argument structure (PropBank), and phrase structure. This paper discusses an important issue in treebank design which is often neglected: the use of empty categories (ECs). All three levels of representation make use of ECs. We make a high-level distinction between two types of ECs, trace and silent, on the basis of whether they are postulated to mark displacement or not. Each type is further refined into several subtypes based on the underlying linguistic phenomena which the ECs are introduced to handle. This paper discusses the stages at which we add ECs to the Hindi/Urdu treebank and why. We investigate methodically the different types of ECs and their role in our syntactic and semantic representations. We also examine our decisions whether or not to coindex each type of ECs with other elements in the representation.