2024
pdf
bib
abs
Innovators at SemEval-2024 Task 10: Revolutionizing Emotion Recognition and Flip Analysis in Code-Mixed Texts
Abhay Shanbhag
|
Suramya Jadhav
|
Shashank Rathi
|
Siddhesh Pande
|
Dipali Kadam
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
In this paper, we introduce our system for all three tracks of the SemEval 2024 EDiReF Shared Task 10, which focuses on Emotion Recognition in Conversation (ERC) and Emotion Flip Reasoning (EFR) within the domain of conversational analysis. Task-Track 1 (ERC) aims to assign an emotion to each utterance in the Hinglish language from a predefined set of possible emotions. Tracks 2 (EFR) and 3 (EFR) aim to identify the trigger utterance(s) for an emotion flip in a multi-party conversation dialogue in Hinglish and English text, respectively. For Track 1, our study spans both traditional machine learning ensemble techniques, including Decision Trees, SVM, Logistic Regression, and Multinomial NB models, as well as advanced transformer-based models like XLM-Roberta (XLMR), DistilRoberta, and T5 from Hugging Face’s transformer library. In the EFR competition, we developed and proposed two innovative algorithms to tackle the challenges presented in Tracks 2 and 3. Specifically, our team, Innovators, developed a standout algorithm that propelled us to secure the 2nd rank in Track 2, achieving an impressive F1 score of 0.79, and the 7th rank in Track 3, with an F1 score of 0.68.
pdf
bib
abs
CAILMD-23 at SemEval-2024 Task 1: Multilingual Evaluation of Semantic Textual Relatedness
Srushti Sonavane
|
Sharvi Endait
|
Ridhima Sinare
|
Pritika Rohera
|
Advait Naik
|
Dipali Kadam
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
The explosive growth of online content demands robust Natural Language Processing (NLP) techniques that can capture nuanced meanings and cultural context across diverse languages. Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective. Despite its pivotal role, prior NLP research has predominantly focused on English, limiting its applicability across languages. Addressing this gap, our paper dives into capturing deeper connections between sentences beyond simple word overlap. Going beyond English-centric NLP research, we explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more. Leveraging the SemEval-2024 shared task, we explore various language models across three learning paradigms: supervised, unsupervised, and cross-lingual. Our comprehensive methodology gains promising results, demonstrating the effectiveness of our approach. This work aims to not only showcase our achievements but also inspire further research in multilingual STR, particularly for low-resourced languages.
2023
pdf
bib
abs
Trinity at SemEval-2023 Task 12: Sentiment Analysis for Low-resource African Languages using Twitter Dataset
Shashank Rathi
|
Siddhesh Pande
|
Harshwardhan Atkare
|
Rahul Tangsali
|
Aditya Vyawahare
|
Dipali Kadam
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
In this paper, we have performed sentiment analysis on three African languages (Hausa, Swahili, and Yoruba). We used various deep learning and traditional models paired with a vectorizer for classification and data -preprocessing. We have also used a few data oversampling methods to handle the imbalanced text data. Thus, we could analyze the performance of those models in all the languages by using weighted and macro F1 scores as evaluation metrics.
pdf
bib
abs
Team Converge at ProbSum 2023: Abstractive Text Summarization of Patient Progress Notes
Gaurav Kolhatkar
|
Aditya Paranjape
|
Omkar Gokhale
|
Dipali Kadam
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks
In this paper, we elaborate on our approach for the shared task 1A issued by BioNLP Workshop 2023 titled Problem List Summarization. With an increase in the digitization of health records, a need arises for quick and precise summarization of large amounts of records. With the help of summarization, medical professionals can sieve through multiple records in a short span of time without overlooking any crucial point. We use abstractive text summarization for this task and experiment with multiple state-of-the-art models like Pegasus, BART, and T5, along with various pre-processing and data augmentation techniques to generate summaries from patients’ progress notes. For this task, the metric used was the ROUGE-L score. From our experiments, we conclude that Pegasus is the best-performing model on the dataset, achieving a ROUGE-L F1 score of 0.2744 on the test dataset (3rd rank on the leaderboard).
pdf
bib
abs
CASM - Context and Something More in Lexical Simplification
Atharva Kumbhar
|
Sheetal Sonawane
|
Dipali Kadam
|
Prathamesh Mulay
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Lexical Simplification is a challenging task that aims to improve the readability of text for nonnative people, people with dyslexia, and any linguistic impairments. It consists of 3 components: 1) Complex Word Identification 2) Substitute Generation 3) Substitute Ranking. Current methods use contextual information as a primary source in all three stages of the simplification pipeline. We argue that while context is an important measure, it alone is not sufficient in the process. In the complex word identification step, contextual information is inadequate, moreover, heavy feature engineering is required to use additional linguistic features. This paper presents a novel architecture for complex word identification that uses a pre-trained transformer model’s information flow through its hidden layers as a feature representation that implicitly encodes all the features required for identification. We portray how database methods and masked language modeling can be complementary to one another in substitute generation and ranking process that is built on the foundational pillars of Simplicity, Grammatical and Semantic correctness, and context preservation. We show that our proposed model generalizes well and outperforms the current state-of-the-art on wellknown datasets.
2022
pdf
bib
abs
PICT@WAT 2022: Neural Machine Translation Systems for Indic Languages
Anupam Patil
|
Isha Joshi
|
Dipali Kadam
Proceedings of the 9th Workshop on Asian Translation
Translation entails more than simply translating words from one language to another. It is vitally essential for effective cross-cultural communication, thus making good translation systems an important requirement. We describe our systems in this paper, which were submitted to the WAT 2022 translation shared tasks. As part of the Multi-modal translation tasks’ text-only translation sub-tasks, we submitted three Neural Machine Translation systems based on Transformer models for English to Malayalam, English to Bengali, and English to Hindi text translation. We found significant results on the leaderboard for English-Indic (en-xx) systems utilizing BLEU and RIBES scores as comparative metrics in our studies. For the respective translations of English to Malayalam, Bengali, and Hindi, we obtained BLEU scores of 19.50, 32.90, and 41.80 for the challenge subset and 30.60, 39.80, and 42.90 on the benchmark evaluation subset data.
pdf
bib
To Train or Not to Train: Predicting the Performance of Massively Multilingual Models
Shantanu Patankar
|
Omkar Gokhale
|
Onkar Litake
|
Aditya Mandke
|
Dipali Kadam
Proceedings of the First Workshop on Scaling Up Multilingual Evaluation
pdf
bib
abs
PICT@DravidianLangTech-ACL2022: Neural Machine Translation On Dravidian Languages
Aditya Vyawahare
|
Rahul Tangsali
|
Aditya Mandke
|
Onkar Litake
|
Dipali Kadam
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
This paper presents a summary of the findings that we obtained based on the shared task on machine translation of Dravidian languages. As a part of this shared task, we carried out neural machine translations for the following five language pairs: Kannada to Tamil, Kannada to Telugu, Kannada to Malayalam, Kannada to Sanskrit, and Kannada to Tulu. The datasets for each of the five language pairs were used to train various translation models, including Seq2Seq models such as LSTM, bidirectional LSTM, Conv Seq2Seq, and training state-of-the-art as transformers from scratch, and fine-tuning already pre-trained models. For some models involving monolingual corpora, we implemented backtranslation as well. These models’ accuracy was later tested with a part of the same dataset using BLEU score as an evaluation metric.
pdf
bib
abs
Optimize_Prime@DravidianLangTech-ACL2022: Emotion Analysis in Tamil
Omkar Gokhale
|
Shantanu Patankar
|
Onkar Litake
|
Aditya Mandke
|
Dipali Kadam
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
This paper aims to perform an emotion analysis of social media comments in Tamil. Emotion analysis is the process of identifying the emotional context of the text. In this paper, we present the findings obtained by Team Optimize_Prime in the ACL 2022 shared task “Emotion Analysis in Tamil.” The task aimed to classify social media comments into categories of emotion like Joy, Anger, Trust, Disgust, etc. The task was further divided into two subtasks, one with 11 broad categories of emotions and the other with 31 specific categories of emotion. We implemented three different approaches to tackle this problem: transformer-based models, Recurrent Neural Networks (RNNs), and Ensemble models. XLM-RoBERTa performed the best on the first task with a macro-averaged f1 score of 0.27, while MuRIL provided the best results on the second task with a macro-averaged f1 score of 0.13.
pdf
bib
abs
Optimize_Prime@DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil
Shantanu Patankar
|
Omkar Gokhale
|
Onkar Litake
|
Aditya Mandke
|
Dipali Kadam
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
This paper tries to address the problem of abusive comment detection in low-resource indic languages. Abusive comments are statements that are offensive to a person or a group of people. These comments are targeted toward individuals belonging to specific ethnicities, genders, caste, race, sexuality, etc. Abusive Comment Detection is a significant problem, especially with the recent rise in social media users. This paper presents the approach used by our team — Optimize_Prime, in the ACL 2022 shared task “Abusive Comment Detection in Tamil.” This task detects and classifies YouTube comments in Tamil and Tamil-English Codemixed format into multiple categories. We have used three methods to optimize our results: Ensemble models, Recurrent Neural Networks, and Transformers. In the Tamil data, MuRIL and XLM-RoBERTA were our best performing models with a macro-averaged f1 score of 0.43. Furthermore, for the Code-mixed data, MuRIL and M-BERT provided sublime results, with a macro-averaged f1 score of 0.45.
pdf
bib
abs
Unsupervised and Very-Low Resource Supervised Translation on German and Sorbian Variant Languages
Rahul Tangsali
|
Aditya Vyawahare
|
Aditya Mandke
|
Onkar Litake
|
Dipali Kadam
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper presents the work of team PICT-NLP for the shared task on unsupervised and very low-resource supervised machine translation, organized by the Workshop on Machine Translation, a workshop in collocation with the Conference on Empirical Methods in Natural Language Processing (EMNLP 2022). The paper delineates the approaches we implemented for supervised and unsupervised translation between the following 6 language pairs: German-Lower Sorbian (de-dsb), Lower Sorbian-German (dsb-de), Lower Sorbian-Upper Sorbian (dsb-hsb), Upper Sorbian-Lower Sorbian (hsb-dsb), German-Upper Sorbian (de-hsb), and Upper Sorbian-German (hsb-de). For supervised learning, we implemented the transformer architecture from scratch using the Fairseq library. Whereas for unsupervised learning, we implemented Facebook’s XLM masked language modeling approach. We discuss the training details for the models we used, and the results obtained from our approaches. We used the BLEU and chrF metrics for evaluating the accuracies of the generated translations on our systems.