Grigori Sidorov

2025

pdf bib abs
CULEMO: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding
Tadesse Destaw Belay | Ahmed Haj Ahmed | Alvin Grissom II | Iqra Ameer | Grigori Sidorov | Olga Kolesnikova | Seid Muhie Yimam
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer fromtwo major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required fordeeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designedto evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmocomprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code is available.

Large Language Models (LLMs) show promising learning and reasoning abilities. Compared to other NLP tasks, multilingual and multi-label emotion evaluation tasks are under-explored in LLMs. In this paper, we present EthioEmo, a multi-label emotion classification dataset for four Ethiopian languages, namely, Amharic (amh), Afan Oromo (orm), Somali (som), and Tigrinya (tir). We perform extensive experiments with an additional English multi-label emotion dataset from SemEval 2018 Task 1. Our evaluation includes encoder-only, encoder-decoder, and decoder-only language models. We compare zero and few-shot approaches of LLMs to fine-tuning smaller language models. The results show that accurate multi-label emotion classification is still insufficient even for high-resource languages such as English, and there is a large gap between the performance of high-resource and low-resource languages. The results also show varying performance levels depending on the language and model type. EthioEmo is available publicly to further improve the understanding of emotions in language models and how people convey emotions through various languages.

pdf bib abs
Girma@DravidianLangTech 2025: Detecting AI Generated Product Reviews
Girma Yohannis Bade | Muhammad Tayyab Zamir | Olga Kolesnikova | José Luis Oropeza | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The increasing prevalence of AI-generated content, including fake product reviews, poses significant challenges in maintaining authenticity and trust in e-commerce systems. While much work has focused on detecting such reviews in high-resource languages, limited attention has been given to low-resource languages like Malayalam and Tamil. This study aims to address this gap by developing a robust framework to identify AI-generated product reviews in these languages. We explore a BERT-based approach for this task. Our methodology involves fine-tuning a BERT-based model specifically on Malayalam and Tamil datasets. The experiments are conducted using labeled datasets that contain a mix of human-written and AI-generated reviews. Performance is evaluated using the macro F1 score. The results show that the BERT-based model achieved a macro F1 score of 0.6394 for Tamil and 0.8849 for Malayalam. Preliminary results indicate that the BERT-based model performs significantly better for Malayalam than for Tamil in terms of the average Macro F1 score, leveraging its ability to capture the complex linguistic features of these languages. Finally, we open the source code of the implementation in the GitHub repository: AI-Generated-Product-Review-Code

pdf bib abs
CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
Tewodros Achamaleh | Tolulope Olalekan Abiola | Lemlem Eyob Kawo | Mikiyas Mebraihtu | Grigori Sidorov
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

AI-generated text now matches human writing so well that telling them apart is very difficult. Our CIC-NLP team submits results for the DravidianLangTech@NAACL 2025 shared task to reveal AI-generated product reviews in Dravidian languages. We performed a binary classification task with XLM-RoBERTa-Base using the DravidianLangTech@NAACL 2025 datasets offered by the event organizers. Through training the model correctly, our tests could tell between human and AI-generated reviews with scores of 0.96 for Tamil and 0.88 for Malayalam in the evaluation test set. This paper presents detailed information about preprocessing, model architecture, hyperparameter fine-tuning settings, the experimental process, and the results. The source code is available on GitHub1.

pdf bib abs
CIC-NLP@DravidianLangTech 2025: Fake News Detection in Dravidian Languages
Tewodros Achamaleh | Nida Hafeez | Mikiyas Mebraihtu | Fatima Uroosa | Grigori Sidorov
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Misinformation is a growing problem for technologycompanies and for society. Although there exists a large body of related work on identifying fake news in predominantlyresource languages, there is unfortunately a lack of such studies in low-resource languages (LRLs). Because corpora and annotated data are scarce in LRLs, the identification of false information remains at an exploratory stage. Fake news detection is critical in this digital era to avoid spreading misleading information. This research work presents an approach to Detect Fake News in Dravidian Languages. Our team CIC-NLP work primarily targets Task 1 which involves identifying whether a given social platform news is original or fake. For fake news detection (FND) problem, we used mBERT model and utilized the dataset that was provided by the organizers of the workshop. In this work, we describe our findings and the results of the proposed method. Our mBERT model achieved an F1 score of 0.853.

pdf bib abs
TechExperts(IPN) at GenAI Detection Task 1: Detecting AI-Generated Text in English and Multilingual Contexts
Gull Mehak | Amna Qasim | Abdul Gafar Manuel Meque | Nisar Hussain | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

The ever-increasing spread of AI-generated text, driven by the considerable progress in large language models, entails a real problem for all digital platforms: how to ensure con tent authenticity. The team TechExperts(IPN) presents a method for detecting AI-generated content in English and multilingual contexts, using the google/gemma-2b model fine-tuned for COLING 2025 shared task 1 for English and multilingual. Training results show peak F1 scores of 97.63% for English and 97.87% for multilingual detection, highlighting the model’s effectiveness in supporting content integrity across platforms.

pdf bib abs
CIC-NLP at GenAI Detection Task 1: Advancing Multilingual Machine-Generated Text Detection
Tolulope Olalekan Abiola | Tewodros Achamaleh Bizuneh | Fatima Uroosa | Nida Hafeez | Grigori Sidorov | Olga Kolesnikova | Olumide Ebenezer Ojo
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

Machine-written texts are gradually becoming indistinguishable from human-generated texts, leading to the need to use sophisticated methods to detect them. Team CIC-NLP presents work in the Gen-AI Content Detection Task 1 at COLING 2025 Workshop: the focus of our work is on Subtask B of Task 1, which is the classification of text written by machines and human authors, with particular attention paid to identifying multilingual binary classification problem. Usng mBERT, we addressed the binary classification task using the dataset provided by the GenAI Detection Task team. mBERT acchieved a macro-average F1-score of 0.72 as well as an accuracy score of 0.73.

pdf bib abs
CIC-NLP at GenAI Detection Task 1: Leveraging DistilBERT for Detecting Machine-Generated Text in English
Tolulope Olalekan Abiola | Tewodros Achamaleh Bizuneh | Oluwatobi Joseph Abiola | Temitope Olasunkanmi Oladepo | Olumide Ebenezer Ojo | Grigori Sidorov | Olga Kolesnikova
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

As machine-generated texts (MGT) become increasingly similar to human writing, these dis- tinctions are harder to identify. In this paper, we as the CIC-NLP team present our submission to the Gen-AI Content Detection Workshop at COLING 2025 for Task 1 Subtask A, which involves distinguishing between text generated by LLMs and text authored by humans, with an emphasis on detecting English-only MGT. We applied the DistilBERT model to this binary classification task using the dataset provided by the organizers. Fine-tuning the model effectively differentiated between the classes, resulting in a micro-average F1-score of 0.70 on the evaluation test set. We provide a detailed explanation of the fine-tuning parameters and steps involved in our analysis.

pdf bib abs
A Hybrid Multilingual Approach to Sentiment Analysis for Uralic and Low-Resource Languages: Combining Extractive and Abstractive Techniques
Mikhail Krasitskii | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

This paper introduces a novel hybrid architecture for multilingual sentiment analysis specifically designed for morphologically complex Uralic languages. Our approach synergistically combines extractive and abstractive summarization with specialized morphological processing for agglutinative structures. The proposed model integrates dynamic thresholding mechanisms and culturally-aware attention layers, achieving statistically significant improvements of 12% accuracy for Uralic languages (p < 0.01) while outperforming state-of-the-art alternatives in summarization quality (ROUGE 1: 0.60 vs. 0.52). Key innovations include language-specific stemmers for Finno-Ugric languages and cross-Uralic transfer learning, yielding 15.7% improvement in recall while maintaining 98.2% precision. Comprehensive evaluations across multiple datasets demonstrate consistent superiority over contemporary baselines, with particular emphasis on addressing Uralic language processing challenges.

pdf bib abs
EM-26@LT-EDI 2025: Detecting Racial Hoaxes in Code-Mixed Social Media Data
Tewodros Achamaleh | Fatima Uroosa | Nida Hafeez | Tolulope Olalekan Abiola | Mikiyas Mebraihtu | Sara Getachew | Grigori Sidorov | Rolando Quintero
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

Social media platforms and user-generated content, such as tweets, comments, and blog posts often contain offensive language, including racial hate speech, personal attacks, and sexual harassment. Detecting such inappropriate language is essential to ensure user safety and to prevent the spread of hateful behavior and online aggression. Approaches base on conventional machine learning and deep learning have shown robust results for high-resource languages like English and find it hard to deal with code-mixed text, which is common in bilingual communication. We participated in the shared task “LT-EDI@LDK 2025” organized by DravidianLangTech, applying the BERT-base multilingual cased model and achieving an F1 score of 0.63. These results demonstrate how our model effectively processes and interprets the unique linguistic features of code-mixed content. The source code is available on GitHub.1

pdf bib abs
EM-26@LT-EDI 2025: Caste and Migration Hate Speech Detection in Tamil-English Code-Mixed Social Media Texts
Tewodros Achamaleh | Tolulope Olalekan Abiola | Mikiyas Mebraihtu | Sara Getachew | Grigori Sidorov
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

In this paper, we describe the system developed by Team EM-26 for the Shared Task on Caste and Migration Hate Speech Detection at LTEDI@LDK 2025. The task addresses the challenge of recognizing caste-based and migration related hate speech in Tamil social media text, a language that is both nuanced and under resourced for machine learning. To tackle this, we fine-tuned the multilingual transformer XLM-RoBERTa-Large on the provided training data, leveraging its cross-lingual strengths to detect both explicit and implicit hate speech. To improve performance, we applied social media focused preprocessing techniques, including Tamil text normalization and noise removal. Our model achieved a macro F1-score of 0.6567 on the test set, highlighting the effectiveness of multilingual transformers for low resource hate speech detection. Additionally, we discuss key challenges and errors in Tamil hate speech classification, which may guide future work toward building more ethical and inclusive AI systems. The source code is available on GitHub.1

pdf bib abs
Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions
Mikhail Krasitskii | Olga Kolesnikova | Liliana Chanona Hernandez | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This study examines sentiment analysis in Tamil-English code-mixed texts using advanced transformer-based architectures. The unique linguistic challenges, including mixed grammar, orthographic variability, and phonetic inconsistencies, are addressed. Data limitations and annotation gaps are discussed, highlighting the need for larger datasets. The performance of models such as XLM-RoBERTa, mT5, IndicBERT, and RemBERT is evaluated, with insights into their optimization for low-resource, code-mixed environments.

pdf bib abs
Amado at SemEval-2025 Task 11: Multi-label Emotion Detection in Amharic and English Data
Girma Yohannis Bade | Olga Kolesnikova | José Luis Oropeza | Grigori Sidorov | Mesay Gemeda Yigezu
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Recently, social media has become a platform for different human emotions. Although most existing works treat the user’s opinions into a single emotion, the reality is that one user can have more than one emotion at a time, representing multiple emotions at the same time. Multi-label emotion detection is a more advanced and realistic approach, as it acknowledges the complexity of human emotions and their overlapping nature. This paper presents multi-label emotion detection in Amharic and English data. The work is part of SemEval2025 shared task 11, where tasks and datasets are offered by task organizers. To accomplish the aim of the given task, we fine-tune transformers base BERT model, passing through all different workflow pipelines. On unseen test data, the model evaluation achieved 0.6300 and 0.7025 an average macro F1-score for Amharic and English, respectively.

pdf bib abs
Tewodros at SemEval-2025 Task 11: Multilingual Emotion Intensity Detection using Small Language Models
Mikiyas Eyasu | Wendmnew Sitot Abebaw | Nida Hafeez | Fatima Uroosa | Tewodros Achamaleh Bizuneh | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Emotions play a fundamental role in the decision-making process, shaping human actions across diverse disciplines. The extensive usage of emotion intensity detection approaches has generated substantial research interest during the last few years. Efficient multi-label emotion intensity detection remains unsatisfactory even for high-resource languages, with a substantial performance gap among well-resourced and under-resourced languages. Team {textbf{Tewodros}} participated in SemEval-2025 Task 11, Track B, focusing on detecting text-based emotion intensity. Our work involved multi-label emotion intensity detection across three languages: Amharic, English, and Spanish, using the (afro-xlmr-large-76L), (DeBERTa-v3-base), and (BERT-base-Spanish-wwm-uncased) models. The models achieved an average F1 score of 0.6503 for Amharic, 0.5943 for English, and an accuracy score of 0.6228 for Spanish. These results demonstrate the effectiveness of our models in capturing emotion intensity across multiple languages.

pdf bib abs
CIC-IPN at SemEval-2025 Task 11: Transformer-Based Approach to Multi-Class Emotion Detection
Tolulope Abiola | Olumide Ebenezer Ojo | Grigori Sidorov | Olga Kolesnikova | Hiram Calvo
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents a multi-step approach for multi-label emotion classification as our system description paper for the SEMEVAL-2025 workshop Task A using machine learning and deep learning models. We test our methodology on English, Spanish, and low-resource Yoruba datasets, with each dataset labeled with five emotion categories: anger, fear, joy, sadness, and surprise. Our preprocessing involves text cleaning and feature extraction using bigrams and TF-IDF. We employ logistic regression for baseline classification and fine-tune Transformer models, such as BERT and XLM-RoBERTa, for improved performance. The Transformer-based models outperformed the logistic regression model, achieving micro-F1 scores of 0.7061, 0.7321, and 0.2825 for English, Spanish, and Yoruba, respectively. Notably, our Yoruba fine-tuned model outperformed the baseline model of the task organizers with micro-F1 score of 0.092, demonstrating the effectiveness of Transformer models in handling emotion classification tasks across diverse languages.

2024

pdf bib abs
Social Media Fake News Classification Using Machine Learning Algorithm
Girma Yohannis Bade | Olga Kolesnikova | Grigori Sidorov | José Luis Oropeza
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rise of social media has facilitated easier communication, information sharing, and current affairs updates. However, the prevalence of misleading and deceptive content, commonly referred to as fake news, poses a significant challenge. This paper focuses on the classification of fake news in Malayalam, a Dravidian language, utilizing natural language processing (NLP) techniques. To develop a model, we employed a random forest machine learning method on a dataset provided by a shared task(DravidianLangTech@EACL 2024)1. When evaluated by the separate test dataset, our developed model achieved a 0.71 macro F1 measure.

pdf bib abs
Selam@DravidianLangTech 2024:Identifying Hate Speech and Offensive Language
Selam Abitte Kanta | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Social media has transformed into a powerful tool for sharing information while upholding the principle of free expression. However, this open platform has given rise to significant issues like hate speech, cyberbullying, aggression, and offensive language, negatively impacting societal well-being. These problems can even lead to severe consequences such as suicidal thoughts, affecting the mental health of the victims. Our primary goal is to develop an automated system for the rapid detection of offensive content on social media, facilitating timely interventions and moderation. This research employs various machine learning classifiers, utilizing character N-gram TF-IDF features. Additionally, we introduce SVM, RL, and Convolutional Neural Network (CNN) models specifically designed for hate speech detection. SVM utilizes character Ngram TF-IDF features, while CNN employs word embedding features. Through extensive experiments, we achieved optimal results, with a weighted F1-score of 0.77 in identifying hate speech and offensive language.

pdf bib abs
Tewodros@DravidianLangTech 2024: Hate Speech Recognition in Telugu Codemixed Text
Tewodros Achamaleh | Lemlem Kawo | Ildar Batyrshini | Grigori Sidorov
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This study goes into our team’s active participation in the Hate and Offensive Language Detection in Telugu Codemixed Text (HOLDTelugu) shared task, which is an essential component of the DravidianLangTech@EACL 2024 workshop. The ultimate goal of this collaborative work is to push the bounds of hate speech recognition, especially tackling the issues given by codemixed text in Telugu, where English blends smoothly. Our inquiry offers a complete evaluation of the task’s aims, the technique used, and the precise achievements obtained by our team, providing a full insight into our contributions to this crucial linguistic and technical undertaking.

pdf bib abs
Lidoma@DravidianLangTech 2024: Identifying Hate Speech in Telugu Code-Mixed: A BERT Multilingual
Muhammad Zamir | Moein Tash | Zahra Ahani | Alexander Gelbukh | Grigori Sidorov
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Over the past few years, research on hate speech and offensive content identification on social media has been ongoing. Since most people in the world are not native English speakers, unapproved messages are typically sent in code-mixed language. We accomplished collaborative work to identify the language of code-mixed text on social media in order to address the difficulties associated with it in the Telugu language scenario. Specifically, we participated in the shared task on the provided dataset by the Dravidian- LangTech Organizer for the purpose of identifying hate and non-hate content. The assignment is to classify each sentence in the provided text into two predetermined groups: hate or non-hate. We developed a model in Python and selected a BERT multilingual to do the given task. Using a train-development data set, we developed a model, which we then tested on test data sets. An average macro F1 score metric was used to measure the model’s performance. For the task, the model reported an average macro F1 of 0.6151.

pdf bib abs
Habesha@DravidianLangTech 2024: Detecting Fake News Detection in Dravidian Languages using Deep Learning
Mesay Gemeda Yigezu | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This research tackles the issue of fake news by utilizing the RNN-LSTM deep learning method with optimized hyperparameters identified through grid search. The model’s performance in multi-label classification is hindered by unbalanced data, despite its success in binary classification. We achieved a score of 0.82 in the binary classification task, whereas in the multi-class task, the score was 0.32. We suggest incorporating data balancing techniques for researchers who aim to further this task, aiming to improve results in managing a variety of information.

pdf bib abs
Social Media Hate and Offensive Speech Detection Using Machine Learning method
Girma Yohannis Bade | Olga Kolesnikova | Grigori Sidorov | José Luis Oropeza
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Even though the improper use of social media is increasing nowadays, there is also technology that brings solutions. Here, improperness is posting hate and offensive speech that might harm an individual or group. Hate speech refers to an insult toward an individual or group based on their identities. Spreading it on social media platforms is a serious problem for society. The solution, on the other hand, is the availability of natural language processing(NLP) technology that is capable to detect and handle such problems. This paper presents the detection of social media’s hate and offensive speech in the code-mixed Telugu language. For this, the task and golden standard dataset were provided for us by the shared task organizer (DravidianLangTech@ EACL 2024)1. To this end, we have employed the TF-IDF technique for numeric feature extraction and used a random forest algorithm for modeling hate speech detection. Finally, the developed model was evaluated on the test dataset and achieved 0.492 macro-F1.

pdf bib abs
Multilingual Approaches to Sentiment Analysis of Texts in Linguistically Diverse Languages: A Case Study of Finnish, Hungarian, and Bulgarian
Mikhail Krasitskii | Olga Kolesnikova | Liliana Chanona Hernandez | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

This article is dedicated to the study of multilingual approaches to sentiment analysis of texts in Finnish, Hungarian, and Bulgarian. For Finnish and Hungarian, which are characterized by complex morphology and agglutinative grammar, an analysis was conducted using both traditional rule-based methods and modern machine learning techniques. In the study, BERT, XLM-R, and mBERT models were used for sentiment analysis, demonstrating high accuracy in sentiment classification. The inclusion of Bulgarian was motivated by the opportunity to compare results across languages with varying degrees of morphological complexity, which allowed for a better understanding of how these models can adapt to different linguistic structures. Datasets such as the Hungarian Emotion Corpus, FinnSentiment, and SentiFi were used to evaluate model performance. The results showed that transformer-based models, particularly BERT, XLM-R, and mBERT, significantly outperformed traditional methods, achieving high accuracy in sentiment classification tasks for all the languages studied.

pdf bib abs
Pinealai_StressIdent_LT-EDI@EACL2024: Minimal configurations for Stress Identification in Tamil and Telugu
Anvi Alex Eponon | Ildar Batyrshin | Grigori Sidorov
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

This paper introduces an approach to stress identification in Tamil and Telugu, leveraging traditional machine learning models—Fasttext for Tamil and Naive Bayes for Telugu—yielding commendable results. The study highlights the scarcity of annotated data and recognizes limitations in phonetic features relevant to these languages, impacting precise information extraction. Our models achieved a macro F1 score of 0.77 for Tamil and 0.72 for Telugu with Fasttext and Naive Bayes, respectively. While the Telugu model secured the second rank in shared tasks, ongoing research is crucial to unlocking the full potential of stress identification in these languages, necessitating the exploration of additional features and advanced techniques specified in the discussions and limitations section.

2023

pdf bib abs
Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec
Atnafu Lambebo Tonja | Christian Maldonado-sifuentes | David Alejandro Mendoza Castillo | Olga Kolesnikova | Noé Castro-Sánchez | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

In this paper, we present a parallel Spanish- Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook m2m100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The results indicate that translation performance is influenced by the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) and is more effective when indigenous languages are used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings.

pdf bib abs
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
Atnafu Lambebo Tonja | Hellina Hailu Nigatu | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh | Jugal Kalita
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

This paper describes CIC NLP’s submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) — Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.

pdf bib abs
Selam@DravidianLangTech:Sentiment Analysis of Code-Mixed Dravidian Texts using SVM Classification
Selam Kanta | Grigori Sidorov
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Sentiment analysis in code-mixed text written in Dravidian languages. Specifically, Tamil- English and Tulu-English. This paper describes the system paper of the RANLP-2023 shared task. The goal of this shared task is to develop systems that accurately classify the sentiment polarity of code-mixed comments and posts. be provided with development, training, and test data sets containing code-mixed text in Tamil- English and Tulu-English. The task involves message-level polarity classification, to classify YouTube comments into positive, negative, neutral, or mixed emotions. This Code- Mix was compiled by RANLP-2023 organizers from posts on social media. We use classification techniques SVM and achieve an F1 score of 0.147 for Tamil-English and 0.518 for Tulu- English.

pdf bib abs
LIDOMA@DravidianLangTech: Convolutional Neural Networks for Studying Correlation Between Lexical Features and Sentiment Polarity in Tamil and Tulu Languages
Moein Tash | Jesus Armenta-Segura | Zahra Ahani | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

With the prevalence of code-mixing among speakers of Dravidian languages, DravidianLangTech proposed the shared task on Sentiment Analysis in Tamil and Tulu at RANLP 2023. This paper presents the submission of LIDOMA, which proposes a methodology that combines lexical features and Convolutional Neural Networks (CNNs) to address the challenge. A fine-tuned 6-layered CNN model is employed, achieving macro F1 scores of 0.542 and 0.199 for Tulu and Tamil, respectively

pdf bib abs
Habesha@DravidianLangTech: Utilizing Deep and Transfer Learning Approaches for Sentiment Analysis.
Mesay Gemeda Yigezu | Tadesse Kebede | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

This research paper focuses on sentiment analysis of Tamil and Tulu texts using a BERT model and an RNN model. The BERT model, which was pretrained, achieved satisfactory performance for the Tulu language, with a Macro F1 score of 0.352. On the other hand, the RNN model showed good performance for Tamil language sentiment analysis, obtaining a Macro F1 score of 0.208. As future work, the researchers aim to fine-tune the models to further improve their results after the training process.

pdf bib abs
Habesha@DravidianLangTech: Abusive Comment Detection using Deep Learning Approach
Mesay Gemeda Yigezu | Selam Kanta | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

This research focuses on identifying abusive language in comments. The study utilizes deep learning models, including Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs), to analyze linguistic patterns. Specifically, the LSTM model, a type of RNN, is used to understand the context by capturing long-term dependencies and intricate patterns in the input sequences. The LSTM model achieves better accuracy and is enhanced through the addition of a dropout layer and early stopping. For detecting abusive language in Telugu and Tamil-English, an LSTM model is employed, while in Tamil abusive language detection, a word-level RNN is developed to identify abusive words. These models process text sequentially, considering overall content and capturing contextual dependencies.

2022

pdf bib abs
MUCIC@TamilNLP-ACL2022: Abusive Comment Detection in Tamil Language using 1D Conv-LSTM
Fazlourrahman Balouchzahi | Anusha Gowda | Hosahalli Shashirekha | Grigori Sidorov
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Abusive language content such as hate speech, profanity, and cyberbullying etc., which is common in online platforms is creating lot of problems to the users as well as policy makers. Hence, detection of such abusive language in user-generated online content has become increasingly important over the past few years. Online platforms strive hard to moderate the abusive content to reduce societal harm, comply with laws, and create a more inclusive environment for their users. In spite of various methods to automatically detect abusive languages in online platforms, the problem still persists. To address the automatic detection of abusive languages in online platforms, this paper describes the models submitted by our team - MUCIC to the shared task on “Abusive Comment Detection in Tamil-ACL 2022”. This shared task addresses the abusive comment detection in native Tamil script texts and code-mixed Tamil texts. To address this challenge, two models: i) n-gram-Multilayer Perceptron (n-gram-MLP) model utilizing MLP classifier fed with char-n gram features and ii) 1D Convolutional Long Short-Term Memory (1D Conv-LSTM) model, were submitted. The n-gram-MLP model fared well among these two models with weighted F1-scores of 0.560 and 0.430 for code-mixed Tamil and native Tamil script texts, respectively. This work may be reproduced using the code available in https://github.com/anushamdgowda/abusive-detection.

pdf bib abs
Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts
Atnafu Lambebo Tonja | Mesay Gemeda Yigezu | Olga Kolesnikova | Moein Shahiki Tash | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts

Language Identification at the Word Level in Kannada-English Texts. This paper describes the system paper of CoLI-Kanglish 2022 shared task. The goal of this task is to identify the different languages used in CoLI-Kanglish 2022. This dataset is distributed into different categories including Kannada, English, Mixed-Language, Location, Name, and Others. This Code-Mix was compiled by CoLI-Kanglish 2022 organizers from posts on social media. We use two classification techniques, KNN and SVM, and achieve an F1-score of 0.58 and place third out of nine competitors.

pdf bib abs
Word Level Language Identification in Code-mixed Kannada-English Texts using Deep Learning Approach
Mesay Gemeda Yigezu | Atnafu Lambebo Tonja | Olga Kolesnikova | Moein Shahiki Tash | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts

The goal of code-mixed language identification (LID) is to determine which language is spoken or written in a given segment of a speech, word, sentence, or document. Our task is to identify English, Kannada, and mixed language from the provided data. To train a model we used the CoLI-Kenglish dataset, which contains English, Kannada, and mixed-language words. In our work, we conducted several experiments in order to obtain the best performing model. Then, we implemented the best model by using Bidirectional Long Short Term Memory (Bi-LSTM), which outperformed the other trained models with an F1-score of 0.61%.

pdf bib abs
Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022
F. Balouchzahi | S. Butt | A. Hegde | N. Ashraf | H.l. Shashirekha | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts

The task of Language Identification (LI) in text processing refers to automatically identifying the languages used in a text document. LI task is usually been studied at the document level and often in high-resource languages while giving less importance to low-resource languages. However, with the recent advance- ment in technologies, in a multilingual country like India, many low-resource language users post their comments using English and one or more language(s) in the form of code-mixed texts. Combination of Kannada and English is one such code-mixed text of mixing Kannada and English languages at various levels. To address the word level LI in code-mixed text, in CoLI-Kanglish shared task, we have focused on open-sourcing a Kannada-English code-mixed dataset for word level LI of Kannada, English and mixed-language words written in Roman script. The task includes classifying each word in the given text into one of six predefined categories, namely: Kannada (kn), English (en), Kannada-English (kn-en), Name (name), Lo-cation (location), and Other (other). Among the models submitted by all the participants, the best performing model obtained averaged-weighted and averaged-macro F1 scores of 0.86 and 0.62 respectively.

pdf bib abs
MUCIC@LT-EDI-ACL2022: Hope Speech Detection using Data Re-Sampling and 1D Conv-LSTM
Anusha Gowda | Fazlourrahman Balouchzahi | Hosahalli Shashirekha | Grigori Sidorov
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Spreading positive vibes or hope content on social media may help many people to get motivated in their life. To address Hope Speech detection in YouTube comments, this paper presents the description of the models submitted by our team - MUCIC, to the Hope Speech Detection for Equality, Diversity, and Inclusion (HopeEDI) shared task at Association for Computational Linguistics (ACL) 2022. This shared task consists of texts in five languages, namely: English, Spanish (in Latin scripts), and Tamil, Malayalam, and Kannada (in code-mixed native and Roman scripts) with the aim of classifying the YouTube comment into “Hope”, “Not-Hope” or “Not-Intended” categories. The proposed methodology uses the re-sampling technique to deal with imbalanced data in the corpus and obtained 1st rank for English language with a macro-averaged F1-score of 0.550 and weighted-averaged F1-score of 0.860. The code to reproduce this work is available in GitHub.

pdf bib abs
CIC@LT-EDI-ACL2022: Are transformers the only hope? Hope speech detection for Spanish and English comments
Fazlourrahman Balouchzahi | Sabur Butt | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Hope is an inherent part of human life and essential for improving the quality of life. Hope increases happiness and reduces stress and feelings of helplessness. Hope speech is the desired outcome for better and can be studied using text from various online sources where people express their desires and outcomes. In this paper, we address a deep-learning approach with a combination of linguistic and psycho-linguistic features for hope-speech detection. We report our best results submitted to LT-EDI-2022 which ranked 2nd and 3rd in English and Spanish respectively.

pdf bib abs
CIC NLP at SMM4H 2022: a BERT-based approach for classification of social media forum posts
Atnafu Lambebo Tonja | Olumide Ebenezer Ojo | Mohammed Arif Khan | Abdul Gafar Manuel Meque | Olga Kolesnikova | Grigori Sidorov | Alexander Gelbukh
Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper describes our submissions for the Social Media Mining for Health (SMM4H) 2022 shared tasks. We participated in 2 tasks: a) Task 4: Classification of Tweets self-reporting exact age and b) Task 9: Classification of Reddit posts self-reporting exact age. We evaluated the two( BERT and RoBERTa) transformer based models for both tasks. For Task 4 RoBERTa-Large achieved an F1 score of 0.846 on the test set and BERT-Large achieved an F1 score of 0.865 on the test set for Task 9.

2021

pdf bib abs
MUCIC at ComMA@ICON: Multilingual Gender Biased and Communal Language Identification Using N-grams and Multilingual Sentence Encoders
Fazlourrahman Balouchzahi | Oxana Vitman | Hosahalli Lakshmaiah Shashirekha | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

Social media analytics are widely being explored by researchers for various applications. Prominent among them are identifying and blocking abusive contents especially targeting individuals and communities, for various reasons. The increasing abusive contents and the increasing number of users on social media demands automated tools to detect and filter the abusive contents as it is highly impossible to handle this manually. To address the challenges of detecting abusive contents, this paper describes the approaches proposed by our team MUCIC for Multilingual Gender Biased and Communal Language Identification shared task (ComMA@ICON) at International Conference on Natural Language Processing (ICON) 2021. This shared task dataset consists of code-mixed multi-script texts in Meitei, Bangla, Hindi as well as in Multilingual (a combination of Meitei, Bangla, Hindi, and English). The shared task is modeled as a multi-label Text Classification (TC) task combining word and char n-grams with vectors obtained from Multilingual Sentence Encoder (MSE) to train the Machine Learning (ML) classifiers using Pre-aggregation and Post-aggregation of labels. These approaches obtained the highest performance in the shared task for Meitei, Bangla, and Multilingual texts with instance-F1 scores of 0.350, 0.412, and 0.380 respectively using Pre-aggregation of labels.

2020

pdf bib abs
Data Augmentation using Machine Translation for Fake News Detection in the Urdu Language
Maaz Amjad | Grigori Sidorov | Alisa Zhila
Proceedings of the Twelfth Language Resources and Evaluation Conference

The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.

pdf bib abs
The IPN-CIC team system submission for the WMT 2020 similar language task
Luis A. Menéndez-Salazar | Grigori Sidorov | Marta R. Costa-Jussà
Proceedings of the Fifth Conference on Machine Translation

This paper describes the participation of the NLP research team of the IPN Computer Research center in the WMT 2020 Similar Language Translation Task. We have submitted systems for the Spanish-Portuguese language pair (in both directions). The three submitted systems are based on the Transformer architecture and used fine tuning for domain Adaptation.

2019

pdf bib abs
CIC at SemEval-2019 Task 5: Simple Yet Very Efficient Approach to Hate Speech Detection, Aggressive Behavior Detection, and Target Classification in Twitter
Iqra Ameer | Muhammad Hammad Fahim Siddiqui | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 13th International Workshop on Semantic Evaluation

In recent years, the use of social media has in-creased incredibly. Social media permits Inter-net users a friendly platform to express their views and opinions. Along with these nice and distinct communication chances, it also allows bad things like usage of hate speech. Online automatic hate speech detection in various aspects is a significant scientific problem. This paper presents the Instituto Politécnico Nacional (Mexico) approach for the Semeval 2019 Task-5 [Hateval 2019] (Basile et al., 2019) competition for Multilingual Detection of Hate Speech on Twitter. The goal of this paper is to detect (A) Hate speech against immigrants and women, (B) Aggressive behavior and target classification, both for English and Spanish. In the proposed approach, we used a bag of words model with preprocessing (stem-ming and stop words removal). We submitted two different systems with names: (i) CIC-1 and (ii) CIC-2 for Hateval 2019 shared task. We used TF values in the first system and TF-IDF for the second system. The first system, CIC-1 got 2nd rank in subtask B for both English and Spanish languages with EMR score of 0.568 for English and 0.675 for Spanish. The second system, CIC-2 was ranked 4th in sub-task A and 1st in subtask B for Spanish language with a macro-F1 score of 0.727 and EMR score of 0.705 respectively.

2018

pdf bib abs
The Role of Emotions in Native Language Identification
Ilia Markov | Vivi Nastase | Carlo Strapparava | Grigori Sidorov
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

We explore the hypothesis that emotion is one of the dimensions of language that surfaces from the native language into a second language. To check the role of emotions in native language identification (NLI), we model emotion information through polarity and emotion load features, and use document representations using these features to classify the native language of the author. The results indicate that emotion is relevant for NLI, even for high proficiency levels and across topics.

2017

pdf bib abs
Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words
Helena Gomez | Ilia Markov | Jorge Baptista | Grigori Sidorov | David Pinto
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

pdf bib abs
CIC-FBK Approach to Native Language Identification
Ilia Markov | Lingzhen Chen | Carlo Strapparava | Grigori Sidorov
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We present the CIC-FBK system, which took part in the Native Language Identification (NLI) Shared Task 2017. Our approach combines features commonly used in previous NLI research, i.e., word n-grams, lemma n-grams, part-of-speech n-grams, and function words, with recently introduced character n-grams from misspelled words, and features that are novel in this task, such as typed character n-grams, and syntactic n-grams of words and of syntactic relation tags. We use log-entropy weighting scheme and perform classification using the Support Vector Machines (SVM) algorithm. Our system achieved 0.8808 macro-averaged F1-score and shared the 1st rank in the NLI Shared Task 2017 scoring.

The paper presents an approach for constructing a weighted bilingual dictionary of inflectional forms using as input data a traditional bilingual dictionary, and not parallel corpora. An algorithm is developed that generates all possible morphological (inflectional) forms and weights them using information on distribution of corresponding grammar sets (grammar information) in large corpora for each language. The algorithm also takes into account the compatibility of grammar sets in a language pair; for example, verb in past tense in language L normally is expected to be translated by verb in past tense in Language L'. We consider that the developed method is universal, i.e. can be applied to any pair of languages. The obtained dictionary is freely available. It can be used in several NLP tasks, for example, statistical machine translation.

2001

pdf bib abs
Word Sense Disambiguation in a Spanish Explanatory Dictionary
Grigori Sidorov | Alexander Gelbukh
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

We apply word sense disambiguation to the definitions in a Spanish explanatory dictionary. To calculate the scores of word senses basing on the context (which in our case is the dictionary definition), we use a modification of Lesk’s algorithm. The algorithm relies on a comparison between two words. In the original Lesk’s algorithm, the comparison is trivial: two words are either the same lexeme or not; our modification consists in fuzzy (weighted) comparison using a large synonym dictionary and a simple derivational morphology system. Application of disambiguation to dictionary definitions (in contrast to usual texts) allows for some simplifications of the algorithm, e.g., we do not have to care of context window size.