2023
pdf
bib
abs
Using Ensemble Learning in Language Variety Identification
Mihaela Gaman
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
The present work describes the solutions pro- posed by the UnibucNLP team to address the closed format of the DSL-TL task featured in the tenth VarDial Evaluation Campaign. The DSL-TL organizers provided approximately 11 thousand sentences written in three different languages and manually tagged with one of 9 classes. Out of these, 3 tags are considered common label and the remaining 6 tags are variety-specific. The DSL-TL task features 2 subtasks: Track 1 - a three-way and Track 2 - a two-way classification per language. In Track 2 only the variety-specific labels are used for scoring, whereas in Track 1 the common label is considered as well. Our team participated in both tracks, with three ensemble-based sub- missions for each. The meta-learner used for Track 1 is XGBoost and for Track 2, Logis- tic Regression. With each submission, we are gradually increasing the complexity of the en- semble, starting with two shallow, string-kernel based methods. To the first ensemble, we add a convolutional neural network for our second submission. The third ensemble submitted adds a fine-tuned BERT model to the second one. In Track 1, ensemble three is our highest ranked, with an F1 − score of 53.18%; 5.36% less than the leader. Surprisingly, in Track 2 the en- semble of shallow methods surpasses the other two, more complex ensembles, achieving an F 1 − score of 69.35%.
2022
pdf
bib
abs
Findings of the VarDial Evaluation Campaign 2022
Noëmi Aepli
|
Antonios Anastasopoulos
|
Adrian-Gabriel Chifu
|
William Domingues
|
Fahim Faisal
|
Mihaela Gaman
|
Radu Tudor Ionescu
|
Yves Scherrer
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2022. The campaign is part of the ninth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2022. Three separate shared tasks were included this year: Identification of Languages and Dialects of Italy (ITDI), French Cross-Domain Dialect Identification (FDI), and Dialectal Extractive Question Answering (DialQA). All three tasks were organized for the first time this year.
2021
pdf
bib
abs
SaRoCo: Detecting Satire in a Novel Romanian Corpus of News Articles
Ana-Cristina Rogoz
|
Mihaela Găman
|
Radu Tudor Ionescu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the only one for the Romanian language. We provide an official split of the text samples, such that training news articles belong to different sources than test news articles, thus ensuring that models do not achieve high performance simply due to overfitting. We conduct experiments with two state-of-the-art deep neural models, resulting in a set of strong baselines for our novel corpus. Our results show that the machine-level accuracy for satire detection in Romanian is quite low (under 73% on the test set) compared to the human-level accuracy (87%), leaving enough room for improvement in future research.
pdf
bib
abs
Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa - A Large Romanian Sentiment Data Set
Anca Maria Tache
|
Mihaela Găman
|
Radu Tudor Ionescu
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from the largest Romanian e-commerce platform. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf’s law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic.
pdf
bib
abs
Findings of the VarDial Evaluation Campaign 2021
Bharathi Raja Chakravarthi
|
Mihaela Găman
|
Radu Tudor Ionescu
|
Heidi Jauhiainen
|
Tommi Jauhiainen
|
Krister Lindén
|
Nikola Ljubešić
|
Niko Partanen
|
Ruba Priyadharshini
|
Christoph Purschke
|
Eswari Rajagopal
|
Yves Scherrer
|
Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.
pdf
bib
abs
UnibucKernel: Geolocating Swiss German Jodels Using Ensemble Learning
Mihaela Găman
|
Sebastian Cojocariu
|
Radu Tudor Ionescu
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
In this work, we describe our approach addressing the Social Media Variety Geolocation task featured in the 2021 VarDial Evaluation Campaign. We focus on the second subtask, which is based on a data set formed of approximately 30 thousand Swiss German Jodels. The dialect identification task is about accurately predicting the latitude and longitude of test samples. We frame the task as a double regression problem, employing an XGBoost meta-learner with the combined power of a variety of machine learning approaches to predict both latitude and longitude. The models included in our ensemble range from simple regression techniques, such as Support Vector Regression, to deep neural models, such as a hybrid neural network and a neural transformer. To minimize the prediction error, we approach the problem from a few different perspectives and consider various types of features, from low-level character n-grams to high-level BERT embeddings. The XGBoost ensemble resulted from combining the power of the aforementioned methods achieves a median distance of 23.6 km on the test data, which places us on the third place in the ranking, at a difference of 6.05 km and 2.9 km from the submissions on the first and second places, respectively.
2020
pdf
bib
abs
A Report on the VarDial Evaluation Campaign 2020
Mihaela Gaman
|
Dirk Hovy
|
Radu Tudor Ionescu
|
Heidi Jauhiainen
|
Tommi Jauhiainen
|
Krister Lindén
|
Nikola Ljubešić
|
Niko Partanen
|
Christoph Purschke
|
Yves Scherrer
|
Marcos Zampieri
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.
pdf
bib
abs
Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets
Mihaela Gaman
|
Radu Tudor Ionescu
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
In this work, we introduce the methods proposed by the UnibucKernel team in solving the Social Media Variety Geolocation task featured in the 2020 VarDial Evaluation Campaign. We address only the second subtask, which targets a data set composed of nearly 30 thousand Swiss German Jodels. The dialect identification task is about accurately predicting the latitude and longitude of test samples. We frame the task as a double regression problem, employing a variety of machine learning approaches to predict both latitude and longitude. From simple models for regression, such as Support Vector Regression, to deep neural networks, such as Long Short-Term Memory networks and character-level convolutional neural networks, and, finally, to ensemble models based on meta-learners, such as XGBoost, our interest is focused on approaching the problem from a few different perspectives, in an attempt to minimize the prediction error. With the same goal in mind, we also considered many types of features, from high-level features, such as BERT embeddings, to low-level features, such as characters n-grams, which are known to provide good results in dialect identification. Our empirical results indicate that the handcrafted model based on string kernels outperforms the deep learning approaches. Nevertheless, our best performance is given by the ensemble model that combines both handcrafted and deep learning models.
2019
pdf
bib
abs
Automatically Identifying Complaints in Social Media
Daniel Preoţiuc-Pietro
|
Mihaela Gaman
|
Nikolaos Aletras
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Complaining is a basic speech act regularly used in human and computer mediated communication to express a negative mismatch between reality and expectations in a particular situation. Automatically identifying complaints in social media is of utmost importance for organizations or brands to improve the customer experience or in developing dialogue systems for handling and responding to complaints. In this paper, we introduce the first systematic analysis of complaints in computational linguistics. We collect a new annotated data set of written complaints expressed on Twitter. We present an extensive linguistic analysis of complaining as a speech act in social media and train strong feature-based and neural models of complaints across nine domains achieving a predictive performance of up to 79 F1 using distant supervision.