Raluca-Andreea Gînga
2024
SciTechBaitRO: ClickBait Detection for Romanian Science and Technology News
Raluca-Andreea Gînga
|
Ana Sabina Uban
Proceedings of the Third Workshop on NLP for Positive Impact
In this paper, we introduce a new annotated corpus of clickbait news in a low-resource language - Romanian, and a rarely covered domain - science and technology news: SciTechBaitRO. It is one of the first and the largest corpus (almost 11,000 examples) of annotated clickbait texts for the Romanian language and the first one to focus on the sci-tech domain, to our knowledge. We evaluate the possibility of automatically detecting clickbait through a series of data analysis and machine learning experiments with varied features and models, including a range of linguistic features, classical machine learning models, deep learning and pre-trained models. We compare the performance of models using different kinds of features, and show that the best results are given by the BERT models, with results of up to 89% F1 score. We additionally evaluate the models in a cross-domain setting for news belonging to other categories (i.e. politics, sports, entertainment) and demonstrate their capacity to generalize by detecting clickbait news outside of domain with high F1-scores.
2022
University of Bucharest Team at Semeval-2022 Task4: Detection and Classification of Patronizing and Condescending Language
Tudor Dumitrascu
|
Raluca-Andreea Gînga
|
Bogdan Dobre
|
Bogdan Radu Silviu Sielecki
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
This paper details our implementations for finding Patronizing and Condescending Language in texts, as part of the SemEval Workshop Task 4. We have used a variety of methods from simple machine learning algorithms applied on bag of words, all the way to BERT models, in order to solve the binary classification and the multi-label multi-class classification.
Search