Gaurish Thakkar


2024

pdf bib
Revealing Public Opinion Sentiment Landscape: Eurovision Song Contest Sentiment Analysis
Klara Kozolic | Gaurish Thakkar | Nives Mikelic Preradovic
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2

pdf bib
FZZG at WILDRE-7: Fine-tuning Pre-trained Models for Code-mixed, Less-resourced Sentiment Analysis
Gaurish Thakkar | Marko Tadić | Nives Mikelic Preradovic
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

This paper describes our system used for a shared task on code-mixed, less-resourced sentiment analysis for Indo-Aryan languages. We are using the large language models (LLMs) since they have demonstrated excellent performance on classification tasks. In our participation in all tracks, we use unsloth/mistral-7b-bnb-4bit LLM for the task of code-mixed sentiment analysis. For track 1, we used a simple fine-tuning strategy on PLMs by combining data from multiple phases. Our trained systems secured first place in four phases out of five. In addition, we present the results achieved using several PLMs for each language.

pdf bib
M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets
Gaurish Thakkar | Sherzod Hakimov | Marko Tadić
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In recent years, multimodal natural language processing, aimed at learning from diverse data types, has garnered significant attention. However, there needs to be more clarity when it comes to analysing multimodal tasks in multi-lingual contexts. While prior studies on sentiment analysis of tweets have predominantly focused on the English language, this paper addresses this gap by transforming an existing textual Twitter sentiment dataset into a multimodal format through a straightforward curation process. Our work opens up new avenues for sentiment-related research within the research community. Additionally, we conduct baseline experiments utilising this augmented dataset and report the findings. Notably, our evaluations reveal that when comparing unimodal and multimodal configurations, using a sentiment-tuned large language model as a text encoder performs exceptionally well.

2023

pdf bib
Konkani ASR
Swapnil Fadte | Gaurish Thakkar | Jyoti D. Pawar
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Konkani is a resource-scarce language, mainly spoken on the west coast of India. The lack of resources directly impacts the development of language technology tools and services. Therefore, the development of digital resources is required to aid in the improvement of this situation. This paper describes the work on the Automatic Speech Recognition (ASR) System for Konkani language. We have created the ASR by fine-tuning the whisper-small ASR model with 100 hours of Konkani speech corpus data. The baseline model showed a word error rate (WER) of 17, which serves as evidence for the efficacy of the fine-tuning procedure in establishing ASR accuracy for Konkani language.

pdf bib
Croatian Film Review Dataset (Cro-FiReDa): A Sentiment Annotated Dataset of Film Reviews
Gaurish Thakkar | Nives Mikelic Preradovic | Marko Tadić
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

This paper introduces Cro-FiReDa, a sentiment-annotated dataset for Croatian in the domain of movie reviews. The dataset, which contains over 10,000 sentences, has been annotated at the sentence level. In addition to presentingthe overall annotation process, we also present benchmark results based on the transformer-based fine-tuning approach.

2020

pdf bib
Evaluating Language Tools for Fifteen EU-official Under-resourced Languages
Diego Alves | Gaurish Thakkar | Marko Tadić
Proceedings of the Twelfth Language Resources and Evaluation Conference

This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages. The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event-centric knowledge processing on top of the application of linguistic processing chains (LPCs) for at least 24 EU-official languages. In this campaign, we concentrated on three existing NLP platforms (Stanford CoreNLP, NLP Cube, UDPipe) that all provide models for under-resourced languages and in this first run we covered 15 under-resourced languages for which the models were available. We present the design of the evaluation campaign and present the results as well as discuss them. We considered the difference between reported and our tested results within a single percentage point as being within the limits of acceptable tolerance and thus consider this result as reproducible. However, for a number of languages, the results are below what was reported in the literature, and in some cases, our testing results are even better than the ones reported previously. Particularly problematic was the evaluation of NERC systems. One of the reasons is the absence of universally or cross-lingually applicable named entities classification scheme that would serve the NERC task in different languages analogous to the Universal Dependency scheme in parsing task. To build such a scheme has become one of our the future research directions.

pdf bib
Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced Languages
Diego Alves | Gaurish Thakkar | Marko Tadić
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

This article presents the strategy for developing a platform containing Language Processing Chains for European Union languages, consisting of Tokenization to Parsing, also including Named Entity recognition and with addition of Sentiment Analysis. These chains are part of the first step of an event-centric knowledge processing pipeline whose aim is to process multilingual media information about major events that can cause an impact in Europe and the rest of the world. Due to the differences in terms of availability of language resources for each language, we have built this strategy in three steps, starting with processing chains for the well-resourced languages and finishing with the development of new modules for the under-resourced ones. In order to classify all European Union official languages in terms of resources, we have analysed the size of annotated corpora as well as the existence of pre-trained models in mainstream Language Processing tools, and we have combined this information with the proposed classification published at META-NET whitepaper series.

2017

pdf bib
Towards Normalising Konkani-English Code-Mixed Social Media Text
Akshata Phadte | Gaurish Thakkar
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)