Vivian Stamou


2024

pdf bib
The Corpus AIKIA: Using Ranking Annotation for Offensive Language Detection in Modern Greek
Stella Markantonatou | Vivian Stamou | Christina Christodoulou | Georgia Apostolopoulou | Antonis Balas | George Ioannakis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce a new corpus, named AIKIA, for Offensive Language Detection (OLD) in Modern Greek (EL). EL is a less-resourced language regarding OLD. AIKIA offers free access to annotated data leveraged from EL Twitter and fiction texts using the lexicon of offensive terms, ERIS, that originates from HurtLex. AIKIA has been annotated for offensive values with the Best Worst Scaling (BWS) method, which is designed to avoid problems of categorical and scalar annotation methods. BWS assigns continuous offensive scores in the form of floating point numbers instead of binary arithmetical or categorical values. AIKIA’s performance in OLD was tested by fine-tuning a variety of pre-trained language models in a binary classification task. Experimentation with a number of thresholds showed that the best mapping of the continuous values to binary labels should occur at the range [0.5-0.6] of BWS values and that the pre-trained models on EL data achieved the highest Macro-F1 scores. Greek-Media-BERT outperformed all models with a threshold of 0.6 by obtaining a Macro-F1 score of 0.92

2023

pdf bib
Methodological issues regarding the semi-automatic UD treebank creation of under-resourced languages: the case of Pomak
Stella Markantonatou | Nicolaos Th. Constantinides | Vivian Stamou | Vasileios Arampatzakis | Panagiotis G. Krimpas | George Pavlidis
Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)

Pomak is an endangered oral Slavic language of Thrace/Greece. We present a short description of its interesting morphological and syntactic features in the UD framework. Because the morphological annotation of the treebank takes advantage of existing resources, it requires a different methodological approach from the one adopted for syntactic annotation that has started from scratch. It also requires the option of obtaining morphological predictions/evaluation separately from the syntactic ones with state-of-the-art NLP tools. Active annotation is applied in various settings in order to identify the best model that would facilitate the ongoing syntactic annotation.

2022

pdf bib
Morphologically annotated corpora of Pomak
Ritván Jusúf Karahóǧa | Panagiotis G. Krimpas | Vivian Stamou | Vasileios Arampatzakis | Dimitrios Karamatskos | Vasileios Sevetlidis | Nikolaos Constantinides | Nikolaos Kokkas | George Pavlidis | Stella Markantonatou
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

The project XXXX is developing a platform to enable researchers of living languages to easily create and make available state-of-the-art spoken and textual annotated resources. As a case study we use Greek and Pomak, the latter being an endangered oral Slavic language of the Balkans (including Thrace/Greece). The linguistic documentation of Pomak is an ongoing work by an interdisciplinary team in close cooperation with the Pomak community of Greece. We describe our experience in the development of a Latin-based orthography and morphologically annotated text corpora of Pomak with state-of-the-art NLP technology. These resources will be made openly available on the XXXX site and the gold annotated corpora of Pomak will be made available on the Universal Dependencies treebank repository.

pdf bib
Cleansing & expanding the HURTLEX(el) with a multidimensional categorization of offensive words
Vivian Stamou | Iakovi Alexiou | Antigone Klimi | Eleftheria Molou | Alexandra Saivanidou | Stella Markantonatou
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

We present a cleansed version of the multilingual lexicon HURTLEX-(EL) comprising 737 offensive words of Modern Greek. We worked bottom-up in two annotation rounds and developed detailed guidelines by cross-classifying words on three dimensions: context, reference, and thematic domain. Our classification reveals a wider spectrum of thematic domains concerning the study of offensive language than previously thought Efthymiou et al. (2014) and reveals social and cultural aspects that are not included in the HURTLEX categories.

2020

pdf bib
VMWE discovery: a comparative analysis between Literature and Twitter Corpora
Vivian Stamou | Artemis Xylogianni | Marilena Malli | Penny Takorou | Stella Markantonatou
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

We evaluate manually five lexical association measurements as regards the discovery of Modern Greek verb multiword expressions with two or more lexicalised components usingmwetoolkit3 (Ramisch et al., 2010). We use Twitter corpora and compare our findings with previous work on fiction corpora. The results of LL, MLE and T-score were found to overlap significantly in both the fiction and the Twitter corpora, while the results of PMI and Dice do not. We find that MWEs with two lexicalised components are more frequent in Twitter than in fiction corpora and that lean syntactic patterns help retrieve them more efficiently than richer ones. Our work (i) supports the enrichment of the lexicographical database for Modern Greek MWEs’ IDION’ (Markantonatou et al., 2019) and (ii) highlights aspects of the usage of five association measurements on specific text genres for best MWE discovery results.