Rudra Roy

2025

Enhancing Textual Understanding: Automated Claim Span Identification in English, Hindi, Bengali, and CodeMix
Rudra Roy | Pritam Pal | Dipankar Das | Saptarshi Ghosh | Biswajit Paul
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Claim span identification, a crucial task in Natural Language Processing (NLP), aims to extract specific claims from texts. Such claim spans can be further utilized in various critical NLP applications, such as claim verification, fact-checking, and opinion mining, among others. The present work proposes a multilingual claim span identification framework for handling social media data in English, Hindi, Bengali, and CodeMixed texts, leveraging the strengths and knowledge of transformer-based pre-trained models. Our proposed framework efficiently identifies the contextual relationships between words and precisely detects claim spans across all languages, achieving a high F1 score and Jaccard score. The source code and datasets are available at: https://github.com/pritampal98/claim-span-multilingual

2024

pdf bib abs

With the advancement of natural language processing (NLP) and sophisticated Large Language Models (LLMs), distinguishing between human-written texts and machine-generated texts is quite difficult nowadays. This paper presents a systematic approach to classifying machine-generated text from human-written text with a combination of the transformer-based model and textual feature-based post-processing technique. We extracted five textual features: readability score, stop word score, spelling and grammatical error count, unique word score and human phrase count from both human-written and machine-generated texts separately and trained three machine learning models (SVM, Random Forest and XGBoost) with these scores. Along with exploring traditional machine-learning models, we explored the BiLSTM and transformer-based distilBERT models to enhance the classification performance. By training and evaluating with a large dataset containing both human-written and machine-generated text, our best-performing framework achieves an accuracy of 87.5%.

Co-authors

Urwah Jawaid 1

Biswajit Paul 1

Venues

icon1
ranlp1

Fix author