Sumeet Agarwal


2024

pdf bib
Interference Predicts Locality: Evidence from an SOV language
Sidharth Ranjan | Sumeet Agarwal | Rajakrishnan Rajkumar
Proceedings of the Society for Computation in Linguistics 2024

2022

pdf bib
Discourse Context Predictability Effects in Hindi Word Order
Sidharth Ranjan | Marten van Schijndel | Sumeet Agarwal | Rajakrishnan Rajkumar
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We test the hypothesis that discourse predictability influences Hindi syntactic choice. While prior work has shown that a number of factors (e.g., information status, dependency length, and syntactic surprisal) influence Hindi word order preferences, the role of discourse predictability is underexplored in the literature. Inspired by prior work on syntactic priming, we investigate how the words and syntactic structures in a sentence influence the word order of the following sentences. Specifically, we extract sentences from the Hindi-Urdu Treebank corpus (HUTB), permute the preverbal constituents of those sentences, and build a classifier to predict which sentences actually occurred in the corpus against artificially generated distractors. The classifier uses a number of discourse-based features and cognitive features to make its predictions, including dependency length, surprisal, and information status. We find that information status and LSTM-based discourse predictability influence word order choices, especially for non-canonical object-fronted orders. We conclude by situating our results within the broader syntactic priming literature.

pdf bib
Dual Mechanism Priming Effects in Hindi Word Order
Sidharth Ranjan | Marten van Schijndel | Sumeet Agarwal | Rajakrishnan Rajkumar
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Word order choices during sentence production can be primed by preceding sentences. In this work, we test the DUAL MECHANISM hypothesis that priming is driven by multiple different sources. Using a Hindi corpus of text productions, we model lexical priming with an n-gram cache model, and we capture more abstract syntactic priming with an adaptive neural language model. We permute the preverbal constituents of corpus sentences and then use a logistic regression model to predict which sentences actually occurred in the corpus against artificially generated meaning-equivalent variants. Our results indicate that lexical priming and lexically-independent syntactic priming affect complementary sets of verb classes. By showing that different priming influences are separable from one another, our results support the hypothesis that multiple different cognitive mechanisms underlie priming.

pdf bib
Linguistic Complexity and Planning Effects on Word Duration in Hindi Read Aloud Speech
Sidharth Ranjan | Rajakrishnan Rajkumar | Sumeet Agarwal
Proceedings of the Society for Computation in Linguistics 2022

2021

pdf bib
Effects of Duration, Locality, and Surprisal in Speech Disfluency Prediction in English Spontaneous Speech
Samvit Dammalapati | Rajakrishnan Rajkumar | Sidharth Ranjan | Sumeet Agarwal
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
Can RNNs trained on harder subject-verb agreement instances still perform well on easier ones?
Hritik Bansal | Gantavya Bhatt | Sumeet Agarwal
Proceedings of the Society for Computation in Linguistics 2021

2020

pdf bib
How much complexity does an RNN architecture need to learn syntax-sensitive dependencies?
Gantavya Bhatt | Hritik Bansal | Rishubh Singh | Sumeet Agarwal
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Long short-term memory (LSTM) networks and their variants are capable of encapsulating long-range dependencies, which is evident from their performance on a variety of linguistic tasks. On the other hand, simple recurrent networks (SRNs), which appear more biologically grounded in terms of synaptic connections, have generally been less successful at capturing long-range dependencies as well as the loci of grammatical errors in an unsupervised setting. In this paper, we seek to develop models that bridge the gap between biological plausibility and linguistic competence. We propose a new architecture, the Decay RNN, which incorporates the decaying nature of neuronal activations and models the excitatory and inhibitory connections in a population of neurons. Besides its biological inspiration, our model also shows competitive performance relative to LSTMs on subject-verb agreement, sentence grammaticality, and language modeling tasks. These results provide some pointers towards probing the nature of the inductive biases required for RNN architectures to model linguistic phenomena successfully.

2019

pdf bib
Expectation and Locality Effects in the Prediction of Disfluent Fillers and Repairs in English Speech
Samvit Dammalapati | Rajakrishnan Rajkumar | Sumeet Agarwal
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

This study examines the role of three influential theories of language processing, viz., Surprisal Theory, Uniform Information Density (UID) hypothesis and Dependency Locality Theory (DLT), in predicting disfluencies in speech production. To this end, we incorporate features based on lexical surprisal, word duration and DLT integration and storage costs into logistic regression classifiers aimed to predict disfluencies in the Switchboard corpus of English conversational speech. We find that disfluencies occur in the face of upcoming difficulties and speakers tend to handle this by lessening cognitive load before disfluencies occur. Further, we see that reparandums behave differently from disfluent fillers possibly due to the lessening of the cognitive load also happening in the word choice of the reparandum, i.e., in the disfluency itself. While the UID hypothesis does not seem to play a significant role in disfluency prediction, lexical surprisal and DLT costs do give promising results in explaining language production. Further, we also find that as a means to lessen cognitive load for upcoming difficulties speakers take more time on words preceding disfluencies, making duration a key element in understanding disfluencies.

pdf bib
Surprisal and Interference Effects of Case Markers in Hindi Word Order
Sidharth Ranjan | Sumeet Agarwal | Rajakrishnan Rajkumar
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Based on the Production-Distribution-Comprehension (PDC) account of language processing, we formulate two distinct hypotheses about case marking, word order choices and processing in Hindi. Our first hypothesis is that Hindi tends to optimize for processing efficiency at both lexical and syntactic levels. We quantify the role of case markers in this process. For the task of predicting the reference sentence occurring in a corpus (amidst meaning-equivalent grammatical variants) using a machine learning model, surprisal estimates from an artificial version of the language (i.e., Hindi without any case markers) result in lower prediction accuracy compared to natural Hindi. Our second hypothesis is that Hindi tends to minimize interference due to case markers while ordering preverbal constituents. We show that Hindi tends to avoid placing next to each other constituents whose heads are marked by identical case inflections. Our findings adhere to PDC assumptions and we discuss their implications for language production, learning and universals.

2018

pdf bib
SandhiKosh: A Benchmark Corpus for Evaluating Sanskrit Sandhi Tools
Shubham Bhardwaj | Neelamadhav Gantayat | Nikhil Chaturvedi | Rahul Garg | Sumeet Agarwal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Uniform Information Density Effects on Syntactic Choice in Hindi
Ayush Jain | Vishal Singh | Sidharth Ranjan | Rajakrishnan Rajkumar | Sumeet Agarwal
Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing

According to the UNIFORM INFORMATION DENSITY (UID) hypothesis (Levy and Jaeger, 2007; Jaeger, 2010), speakers tend to distribute information density across the signal uniformly while producing language. The prior works cited above studied syntactic reduction in language production at particular choice points in a sentence. In contrast, we use a variant of the above UID hypothesis in order to investigate the extent to which word order choices in Hindi are influenced by the drive to minimize the variance of information across entire sentences. To this end, we propose multiple lexical and syntactic measures (at both word and constituent levels) to capture the uniform spread of information across a sentence. Subsequently, we incorporate these measures in machine learning models aimed to distinguish between a naturally occurring corpus sentence and its grammatical variants (expressing the same idea). Our results indicate that our UID measures are not a significant factor in predicting the corpus sentence in the presence of lexical surprisal, a competing control predictor. Finally, in the light of other recent works, we conclude with a discussion of reasons for UID not being suitable for a theory of word order.

2016

pdf bib
Linguistic features for Hindi light verb construction identification
Ashwini Vaidya | Sumeet Agarwal | Martha Palmer
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Light verb constructions (LVC) in Hindi are highly productive. If we can distinguish a case such as nirnay lenaa ‘decision take; decide’ from an ordinary verb-argument combination kaagaz lenaa ‘paper take; take (a) paper’,it has been shown to aid NLP applications such as parsing (Begum et al., 2011) and machine translation (Pal et al., 2011). In this paper, we propose an LVC identification system using language specific features for Hindi which shows an improvement over previous work(Begum et al., 2011). To build our system, we carry out a linguistic analysis of Hindi LVCs using Hindi Treebank annotations and propose two new features that are aimed at capturing the diversity of Hindi LVCs in the corpus. We find that our model performs robustly across a diverse range of LVCs and our results underscore the importance of semantic features, which is in keeping with the findings for English. Our error analysis also demonstrates that our classifier can be used to further refine LVC annotations in the Hindi Treebank and make them more consistent across the board.