Arjun Mukherjee

2025

Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection against LLM-Generated Threats
Sadat Shahriar | Navid Ayoobi | Arjun Mukherjee | Mostafa Musharrat | Sai Vishnu Vamsi Senagasetty
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.

pdf bib abs

The Erosion of LLM Signatures: Can We Still Distinguish Human and LLM-Generated Scientific Ideas after Iterative Paraphrasing?
Sadat Shahriar | Navid Ayoobi | Arjun Mukherjee
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

With the increasing reliance on LLMs as research agents, distinguishing between LLM and human-generated ideas has become crucial for understanding the cognitive nuances of LLMs’ research capabilities. While detecting LLM-generated text has been extensively studied, distinguishing human vs LLM-generated *scientific ideas* remains an unexplored area. In this work, we systematically evaluate the ability of state-of-the-art (SOTA) machine learning models to differentiate between human and LLM-generated ideas, particularly after successive paraphrasing stages. Our findings highlight the challenges SOTA models face in source attribution, with detection performance declining by an average of 25.4% after five consecutive paraphrasing stages. Additionally, we demonstrate that incorporating the research problem as contextual information improves detection performance by up to 2.97%. Notably, our analysis reveals that detection algorithms struggle significantly when ideas are paraphrased into a simplified, non-expert style, contributing the most to the erosion of distinguishable LLM signatures.

2023

pdf bib abs

Tackling the Myriads of Collusion Scams on YouTube Comments of Cryptocurrency Videos
Sadat Shahriar | Arjun Mukherjee
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Despite repeated measures, YouTube’s comment section has been a fertile ground for scammers. With the growth of the cryptocurrency market and obscurity around it, a new form of scam, namely “Collusion Scam” has emerged as a dominant force within YouTube’s comment space. Unlike typical scams and spams, collusion scams employ a cunning persuasion strategy, using the facade of genuine social interactions within comment threads to create an aura of trust and success to entrap innocent users. In this research, we collect 1,174 such collusion scam threads and perform a detailed analysis, which is tailored towards the successful detection of these scams. We find that utilization of the collusion dynamics can provide an accuracy of 96.67% and an F1-score of 93.04%. Furthermore, we demonstrate the robust predictive power of metadata associated with these threads and user channels, which act as compelling indicators of collusion scams. Finally, we show that modern LLM, like chatGPT, can effectively detect collusion scams without the need for any training.

pdf bib abs

Exploring Deceptive Domain Transfer Strategies: Mitigating the Differences among Deceptive Domains
Sadat Shahriar | Arjun Mukherjee | Omprakash Gnawali
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Deceptive text poses a significant threat to users, resulting in widespread misinformation and disorder. While researchers have created numerous cutting-edge techniques for detecting deception in domain-specific settings, whether there is a generic deception pattern so that deception-related knowledge in one domain can be transferred to the other remains mostly unexplored. Moreover, the disparities in textual expression across these many mediums pose an additional obstacle for generalization. To this end, we present a Multi-Task Learning (MTL)-based deception generalization strategy to reduce the domain-specific noise and facilitate a better understanding of deception via a generalized training. As deceptive domains, we use News (fake news), Tweets (rumors), and Reviews (fake reviews) and employ LSTM and BERT model to incorporate domain transfer techniques. Our proposed architecture for the combined approach of domain-independent and domain-specific training improves the deception detection performance by up to 5.28% in F1-score.

2022

pdf bib abs

COIN – an Inexpensive and Strong Baseline for Predicting Out of Vocabulary Word Embeddings
Andrew Schneider | Lihong He | Zhijia Chen | Arjun Mukherjee | Eduard Dragut
Proceedings of the 29th International Conference on Computational Linguistics

Social media is the ultimate challenge for many natural language processing tools. The constant emergence of linguistic constructs challenge even the most sophisticated NLP tools. Predicting word embeddings for out of vocabulary words is one of those challenges. Word embedding models only include terms that occur a sufficient number of times in their training corpora. Word embedding vector models are unable to directly provide any useful information about a word not in their vocabularies. We propose a fast method for predicting vectors for out of vocabulary terms that makes use of the surrounding terms of the unknown term and the hidden context layer of the word2vec model. We propose this method as a strong baseline in the sense that 1) while it does not surpass all state-of-the-art methods, it surpasses several techniques for vector prediction on benchmark tasks, 2) even when it underperforms, the margin is very small retaining competitive performance in downstream tasks, and 3) it is inexpensive to compute, requiring no additional training stage. We also show that our technique can be incorporated into existing methods to achieve a new state-of-the-art on the word vector prediction problem.

2021

pdf bib abs

Claim Verification Using a Multi-GAN Based Model
Amartya Hatua | Arjun Mukherjee | Rakesh Verma
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This article describes research on claim verification carried out using a multiple GAN-based model. The proposed model consists of three pairs of generators and discriminators. The generator and discriminator pairs are responsible for generating synthetic data for supported and refuted claims and claim labels. A theoretical discussion about the proposed model is provided to validate the equilibrium state of the model. The proposed model is applied to the FEVER dataset, and a pre-trained language model is used for the input text data. The synthetically generated data helps to gain information that improves classification performance over state of the art baselines. The respective F1 scores after applying the proposed method on FEVER 1.0 and FEVER 2.0 datasets are 0.65+-0.018 and 0.65+-0.051.

pdf bib abs

On the Usefulness of Personality Traits in Opinion-oriented Tasks
Marjan Hosseinia | Eduard Dragut | Dainis Boumber | Arjun Mukherjee
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We use a deep bidirectional transformer to extract the Myers-Briggs personality type from user-generated data in a multi-label and multi-class classification setting. Our dataset is large and made up of three available personality datasets of various social media platforms including Reddit, Twitter, and Personality Cafe forum. We induce personality embeddings from our transformer-based model and investigate if they can be used for downstream text classification tasks. Experimental evidence shows that personality embeddings are effective in three classification tasks including authorship verification, stance, and hyperpartisan detection. We also provide novel and interpretable analysis for the third task: hyperpartisan news classification.

pdf bib abs

A Domain-Independent Holistic Approach to Deception Detection
Sadat Shahriar | Arjun Mukherjee | Omprakash Gnawali
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The deception in the text can be of different forms in different domains, including fake news, rumor tweets, and spam emails. Irrespective of the domain, the main intent of the deceptive text is to deceit the reader. Although domain-specific deception detection exists, domain-independent deception detection can provide a holistic picture, which can be crucial to understand how deception occurs in the text. In this paper, we detect deception in a domain-independent setting using deep learning architectures. Our method outperforms the State-of-the-Art performance of most benchmark datasets with an overall accuracy of 93.42% and F1-Score of 93.22%. The domain-independent training allows us to capture subtler nuances of deceptive writing style. Furthermore, we analyze how much in-domain data may be helpful to accurately detect deception, especially for the cases where data may not be readily available to train. Our results and analysis indicate that there may be a universal pattern of deception lying in-between the text independent of the domain, which can create a novel area of research and open up new avenues in the field of deception detection.

pdf bib abs

Opinion prediction is an emerging research area with diverse real-world applications, such as market research and situational awareness. We identify two lines of approaches to the problem of opinion prediction. One uses topic-based sentiment analysis with time-series modeling, while the other uses static embedding of text. The latter approaches seek user-specific solutions by generating user fingerprints. Such approaches are useful in predicting user’s reactions to unseen content. In this work, we propose a novel dynamic fingerprinting method that leverages contextual embedding of user’s comments conditioned on relevant user’s reading history. We integrate BERT variants with a recurrent neural network to generate predictions. The results show up to 13% improvement in micro F1-score compared to previous approaches. Experimental results show novel insights that were previously unknown such as better predictions for an increase in dynamic history length, the impact of the nature of the article on performance, thereby laying the foundation for further research.

pdf bib abs

Improving Evidence Retrieval with Claim-Evidence Entailment
Fan Yang | Eduard Dragut | Arjun Mukherjee
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Claim verification is challenging because it requires first to find textual evidence and then apply claim-evidence entailment to verify a claim. Previous works evaluate the entailment step based on the retrieved evidence, whereas we hypothesize that the entailment prediction can provide useful signals for evidence retrieval, in the sense that if a sentence supports or refutes a claim, the sentence must be relevant. We propose a novel model that uses the entailment score to express the relevancy. Our experiments verify that leveraging entailment prediction improves ranking multiple pieces of evidence.

2020

pdf bib abs

Predicting Personal Opinion on Future Events with Fingerprints
Fan Yang | Eduard Dragut | Arjun Mukherjee
Proceedings of the 28th International Conference on Computational Linguistics

Predicting users’ opinions in their response to social events has important real-world applications, many of which political and social impacts. Existing approaches derive a population’s opinion on a going event from large scores of user generated content. In certain scenarios, we may not be able to acquire such content and thus cannot infer an unbiased opinion on those emerging events. To address this problem, we propose to explore opinion on unseen articles based on one’s fingerprinting: the prior reading and commenting history. This work presents a focused study on modeling and leveraging fingerprinting techniques to predict a user’s future opinion. We introduce a recurrent neural network based model that integrates fingerprinting. We collect a large dataset that consists of event-comment pairs from six news websites. We evaluate the proposed model on this dataset. The results show substantial performance gains demonstrating the effectiveness of our approach.

pdf bib abs

Stance Prediction for Contemporary Issues: Data and Experiments
Marjan Hosseinia | Eduard Dragut | Arjun Mukherjee
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

We investigate whether pre-trained bidirectional transformers with sentiment and emotion information improve stance detection in long discussions of contemporary issues. As a part of this work, we create a novel stance detection dataset covering 419 different controversial issues and their related pros and cons collected by procon.org in nonpartisan format. Experimental results show that a shallow recurrent neural network with sentiment or emotion information can reach competitive results compared to fine-tuned BERT with 20x fewer parameters. We also use a simple approach that explains which input phrases contribute to stance detection.

2018

pdf bib abs

Attending Sentences to detect Satirical Fake News
Sohan De Sarkar | Fan Yang | Arjun Mukherjee
Proceedings of the 27th International Conference on Computational Linguistics

Satirical news detection is important in order to prevent the spread of misinformation over the Internet. Existing approaches to capture news satire use machine learning models such as SVM and hierarchical neural networks along with hand-engineered features, but do not explore sentence and document difference. This paper proposes a robust, hierarchical deep neural network approach for satire detection, which is capable of capturing satire both at the sentence level and at the document level. The architecture incorporates pluggable generic neural networks like CNN, GRU, and LSTM. Experimental results on real world news satire dataset show substantial performance gains demonstrating the effectiveness of our proposed approach. An inspection of the learned models reveals the existence of key sentences that control the presence of satire in news.

Arjun Mukherjee

2025

2023

2022

2021

2020

2018

2017

2016

2015

2014

2013

2012

2010

Co-authors

Venues