Rupak Sarkar


2023

pdf bib
Natural Language Decompositions of Implicit Content Enable Better Text Representations
Alexander Hoyle | Rupak Sarkar | Pranav Goel | Philip Resnik
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

When people interpret text, they rely on inferences that go beyond the observed language itself. Inspired by this observation, we introduce a method for the analysis of text that takes implicitly communicated content explicitly into account. We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed, then validate the plausibility of the generated content via human judgments. Incorporating these explicit representations of implicit content proves useful in multiple problem settings that involve the human interpretation of utterances: assessing the similarity of arguments, making sense of a body of opinion data, and modeling legislative behavior. Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP and particularly its applications to social science.

2022

pdf bib
Are Neural Topic Models Broken?
Alexander Miserlis Hoyle | Pranav Goel | Rupak Sarkar | Philip Resnik
Findings of the Association for Computational Linguistics: EMNLP 2022

Recently, the relationship between automated and human evaluation of topic models has been called into question. Method developers have staked the efficacy of new topic model variants on automated measures, and their failure to approximate human preferences places these models on uncertain ground. Moreover, existing evaluation paradigms are often divorced from real-world use. Motivated by content analysis as a dominant real-world use case for topic modeling, we analyze two related aspects of topic models that affect their effectiveness and trustworthiness in practice for that purpose: the stability of their estimates and the extent to which the model’s discovered categories align with human-determined categories in the data. We find that neural topic models fare worse in both respects compared to an established classical method. We take a step toward addressing both issues in tandem by demonstrating that a straightforward ensembling method can reliably outperform the members of the ensemble.

2021

pdf bib
Empathy and Hope: Resource Transfer to Model Inter-country Social Media Dynamics
Clay H. Yoo | Shriphani Palakodety | Rupak Sarkar | Ashiqur KhudaBukhsh
Proceedings of the 1st Workshop on NLP for Positive Impact

The ongoing COVID-19 pandemic resulted in significant ramifications for international relations ranging from travel restrictions, global ceasefires, and international vaccine production and sharing agreements. Amidst a wave of infections in India that resulted in a systemic breakdown of healthcare infrastructure, a social welfare organization based in Pakistan offered to procure medical-grade oxygen to assist India - a nation which was involved in four wars with Pakistan in the past few decades. In this paper, we focus on Pakistani Twitter users’ response to the ongoing healthcare crisis in India. While #IndiaNeedsOxygen and #PakistanStandsWithIndia featured among the top-trending hashtags in Pakistan, divisive hashtags such as #EndiaSaySorryToKashmir simultaneously started trending. Against the backdrop of a contentious history including four wars, divisive content of this nature, especially when a country is facing an unprecedented healthcare crisis, fuels further deterioration of relations. In this paper, we define a new task of detecting supportive content and demonstrate that existing NLP for social impact tools can be effectively harnessed for such tasks within a quick turnaround time. We also release the first publicly available data set at the intersection of geopolitical relations and a raging pandemic in the context of India and Pakistan.

2020

pdf bib
Social Media Attributions in the Context of Water Crisis
Rupak Sarkar | Sayantan Mahinder | Hirak Sarkar | Ashiqur KhudaBukhsh
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Attribution of natural disasters/collective misfortune is a widely-studied political science problem. However, such studies typically rely on surveys, or expert opinions, or external signals such as voting outcomes. In this paper, we explore the viability of using unstructured, noisy social media data to complement traditional surveys through automatically extracting attribution factors. We present a novel prediction task of attribution tie detection of identifying the factors (e.g., poor city planning, exploding population etc.) held responsible for the crisis in a social media document. We focus on the 2019 Chennai water crisis that rapidly escalated into a discussion topic with global importance following alarming water-crisis statistics. On a challenging data set constructed from YouTube comments (72,098 comments posted by 43,859 users on 623 videos relevant to the crisis), we present a neural baseline to identify attribution ties that achieves a reasonable performance (accuracy: 87.34% on attribution detection and 81.37% on attribution resolution). We release the first annotated data set of 2,500 comments in this important domain.

pdf bib
The Non-native Speaker Aspect: Indian English in Social Media
Rupak Sarkar | Sayantan Mahinder | Ashiqur KhudaBukhsh
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

As the largest institutionalized second language variety of English, Indian English has received a sustained focus from linguists for decades. However, to the best of our knowledge, no prior study has contrasted web-expressions of Indian English in noisy social media with English generated by a social media user base that are predominantly native speakers. In this paper, we address this gap in the literature through conducting a comprehensive analysis considering multiple structural and semantic aspects. In addition, we propose a novel application of language models to perform automatic linguistic quality assessment.