Rashi Goel

2025

pdf bib abs
Studying the Effect of Hindi Tokenizer Performance on Downstream Tasks
Rashi Goel | Fatiha Sadat
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

This paper deals with a study on the effect of training data size and tokenizer performance for Hindi language on the eventual downstream model performance and comprehension. Multiple monolingual Hindi tokenizers are trained for large language models such as BERT and intrinsic and extrinsic evaluations are performed on multiple Hindi datasets. The objective of this study is to understand the precise effects of tokenizer performance on downstream task performance to gain insight on how to develop better models for low-resource languages.

2023

pdf bib abs
RSM-NLP at BLP-2023 Task 2: Bangla Sentiment Analysis using Weighted and Majority Voted Fine-Tuned Transformers
Pratinav Seth | Rashi Goel | Komal Mathur | Swetha Vemulapalli
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

This paper describes our approach to submissions made at Shared Task 2 at BLP Workshop - Sentiment Analysis of Bangla Social Media Posts. Sentiment Analysis is an action research area in the digital age. With the rapid and constant growth of online social media sites and services and the increasing amount of textual data, the application of automatic Sentiment Analysis is on the rise. However, most of the research in this domain is based on the English language. Despite being the world’s sixth most widely spoken language, little work has been done in Bangla. This task aims to promote work on Bangla Sentiment Analysis while identifying the polarity of social media content by determining whether the sentiment expressed in the text is Positive, Negative, or Neutral. Our approach consists of experimenting and finetuning various multilingual and pre-trained BERT-based models on our downstream tasks and using a Majority Voting and Weighted ensemble model that outperforms individual baseline model scores. Our system scored 0.711 for the multiclass classification task and scored 10th place among the participants on the leaderboard for the shared task. Our code is available at https://github.com/ptnv-s/RSM-NLP-BLP-Task2 .

Co-authors

Venues

Fix data