John Pavlopoulos - ACL Anthology

John Pavlopoulos

2025

GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek
Lefteris Loukas | Nikolaos Smyrnioudis | Chrysa Dikonomaki | Spiros Barbakos | Anastasios Toumazatos | John Koutsikakis | Manolis Kyriakakis | Mary Georgiou | Stavros Vassos | John Pavlopoulos | Ion Androutsopoulos
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP) toolkit developed specifically for modern Greek. The toolkit provides state-of-the-art performance in five core NLP tasks, namely part-of-speech tagging, morphological tagging, dependency parsing, named entity recognition, and Greeklish-to-Greek transliteration. The toolkit is based on pre-trained Transformers, it is freely available, and can be easily installed in Python (pip install gr-nlp-toolkit). It is also accessible through a demonstration platform on HuggingFace, along with a publicly available API for non-commercial use. We discuss the functionality provided for each task, the underlying methods, experiments against comparable open-source toolkits, and future possible enhancements. The toolkit is available at: https://github.com/nlpaueb/gr-nlp-toolkit

Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey
Katerina Korre | Dimitris Tsirmpas | Nikos Gkoumas | Emma Cabalé | Danai Myrtzani | Theodoros Evgeniou | Ion Androutsopoulos | John Pavlopoulos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We present a survey of methods for assessing and enhancing the quality of online discussions, focusing on the potential of Large Language Models (LLMs). While online discourses aim, at least in theory, to foster mutual understanding, they often devolve into harmful exchanges, such as hate speech, threatening social cohesion and democratic values. Recent advancements in LLMs enable artificial facilitation agents to not only moderate content, but also actively improve the quality of interactions. Our survey synthesizes ideas from Natural Language Processing (NLP) and Social Sciences to provide (a) a new taxonomy on discussion quality evaluation, (b) an overview of intervention and facilitation strategies, (c) along with a new taxonomy of conversation facilitation datasets, (d) an LLM-oriented roadmap of good practices and future research directions, from technological and societal perspectives.

Dialect Normalization using Large Language Models and Morphological Rules
Antonios Dimakis | John Pavlopoulos | Antonios Anastasopoulos
Findings of the Association for Computational Linguistics: ACL 2025

Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.

FoodSafeSum: Enabling Natural Language Processing Applications for Food Safety Document Summarization and Analysis
Juli Bakagianni | Korbinian Randl | Guido Rocchietti | Cosimo Rulli | Franco Maria Nardini | Salvatore Trani | Aron Henriksson | Anna Romanova | John Pavlopoulos
Findings of the Association for Computational Linguistics: EMNLP 2025

Food safety demands timely detection, regulation, and public communication, yet the lack of structured datasets hinders Natural Language Processing (NLP) research. We present and release a new dataset of human-written and Large Language Model (LLM)-generated summaries of food safety documents, plus food safety related metadata. We evaluate its utility on three NLP tasks directly reflecting food safety practices: multilabel classification for organizing documents into domain-specific categories; document retrieval for accessing regulatory and scientific evidence; and question answering via retrieval-augmented generation that improves factual accuracy.We show that LLM summaries perform comparably or better than human ones across tasks. We also demonstrate clustering of summaries for event tracking and compliance monitoring. This dataset enables NLP applications that support core food safety practices, including the organization of regulatory and scientific evidence, monitoring of compliance issues, and communication of risks to the public.

Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition
Panagiotis Kaliosis | John Pavlopoulos
Findings of the Association for Computational Linguistics: EMNLP 2025

Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at https://github.com/pkaliosis/fada.

KostasThesis2025 at SemEval-2025 Task 10 Subtask 2: A Continual Learning Approach to Propaganda Analysis in Online News
Konstantinos Eleftheriou | Panos Louridas | John Pavlopoulos
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

In response to the growing challenge of propagandistic presence through online media inonline news, the increasing need for automated systems that are able to identify and classify narrative structures in multiple languages is evident. We present our approach to the SemEval-2025 Task 10 Subtask 2, focusing on the challenge of hierarchical multi-label, multi-class classification in multilingual news articles. We present methods to handle long articles with respect to how they are naturally structured in the dataset, propose a hierarchical classification neural network model with respect to the taxonomy, and a continual learning training approach that leverages cross-lingual knowledge transfer.

SemEval-2025 Task 9: The Food Hazard Detection Challenge
Korbinian Randl | John Pavlopoulos | Aron Henriksson | Tony Lindgren | Juli Bakagianni
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we are gradually releasing (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

2024

Polarized Opinion Detection Improves the Detection of Toxic Language
John Pavlopoulos | Aristidis Likas
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Distance from unimodality (DFU) has been found to correlate well with human judgment for the assessment of polarized opinions. However, its un-normalized nature makes it less intuitive and somewhat difficult to exploit in machine learning (e.g., as a supervised signal). In this work a normalized version of this measure, called nDFU, is proposed that leads to better assessment of the degree of polarization. Then, we propose a methodology for K-class text classification, based on nDFU, that exploits polarized texts in the dataset. Such polarized instances are assigned to a separate K+1 class, so that a K+1-class classifier is trained. An empirical analysis on three datasets for abusive language detection, shows that nDFU can be used to model polarized annotations and prevent them from harming the classification performance. Finally, we further exploit nDFU to specify conditions that could explain polarization given a dimension and present text examples that polarized the annotators when the dimension was gender and race. Our code is available at https://github.com/ipavlopoulos/ndfu.

Towards a Greek Proverb Atlas: Computational Spatial Exploration and Attribution of Greek Proverbs
John Pavlopoulos | Panos Louridas | Panagiotis Filos
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Proverbs carry wisdom transferred orally from generation to generation. Based on the place they were recorded, this study introduces a publicly-available and machine-actionable dataset of more than one hundred thousand Greek proverb variants. By quantifying the spatial distribution of proverbs, we show that the most widespread proverbs come from the mainland while the least widespread proverbs come primarily from the islands. By focusing on the least dispersed proverbs, we present the most frequent tokens per location and undertake a benchmark in geographical attribution, using text classification and regression (text geocoding). Our results show that this is a challenging task for which specific locations can be attributed more successfully compared to others. The potential of our resource and benchmark is showcased by two novel applications. First, we extracted terms moving the regression prediction toward the four cardinal directions. Second, we leveraged conformal prediction to attribute 3,676 unregistered proverbs with statistically rigorous predictions of locations each of these proverbs was possibly registered in.

A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning
Panagiotis Kaliosis | John Pavlopoulos | Foivos Charalampakos | Georgios Moschovis | Ion Androutsopoulos
Findings of the Association for Computational Linguistics: ACL 2024

CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification
Korbinian Randl | John Pavlopoulos | Aron Henriksson | Tony Lindgren
Findings of the Association for Computational Linguistics: ACL 2024

Contaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a TF-IDF representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting.

Deciphering Emotional Landscapes in the Iliad: A Novel French-Annotated Dataset for Emotion Recognition
Davide Picca | John Pavlopoulos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

One of the most significant pieces of ancient Greek literature, the Iliad, is part of humanity’s collective cultural heritage. This work aims to provide the scientific community with an emotion-labeled dataset for classical literature and Western mythology in particular. To model the emotions of the poem, we use a multi-variate time series. We also evaluated the dataset by means of two methods. We compare the manual classification against a dictionary-based benchmark as well as employ a state-of-the-art deep learning masked language model that has been tuned using our data. Both evaluations return encouraging results (MSE and MAE Macro Avg 0.101 and 0.188 respectively) and highlight some interesting phenomena.

HoLM: Analyzing the Linguistic Unexpectedness in Homeric Poetry
John Pavlopoulos | Ryan Sandell | Maria Konstantinidou | Chiara Bozzone
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The authorship of the Homeric poems has been a matter of debate for centuries. Computational approaches such as language modeling exist that can aid experts in making crucial headway. We observe, however, that such work has, thus far, only been carried out at the level of lengthier excerpts, but not individual verses, the level at which most suspected interpolations occur. We address this weakness by presenting a corpus of Homeric verses, each complemented with a score quantifying linguistic unexpectedness based on Perplexity. We assess the nature of these scores by exploring their correlation with named entities, the frequency of character n-grams, and (inverse) word frequency, revealing robust correlations with the latter two. This apparent bias can be partly overcome by simply dividing scores for unexpectedness by the maximum term frequency per verse.

Still All Greeklish to Me: Greeklish to Greek Transliteration
Anastasios Toumazatos | John Pavlopoulos | Ion Androutsopoulos | Stavros Vassos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Modern Greek is normally written in the Greek alphabet. In informal online messages, however, Greek is often written using characters available on Latin-character keyboards, a form known as Greeklish. Originally used to bypass the lack of support for the Greek alphabet in older computers, Greeklish is now also used to avoid switching languages on multilingual keyboards, hide spelling mistakes, or as a form of slang. There is no consensus mapping, hence the same Greek word can be written in numerous different ways in Greeklish. Even native Greek speakers may struggle to understand (or be annoyed by) Greeklish, which requires paying careful attention to context to decipher. Greeklish may also be a problem for NLP models trained on Greek datasets written in the Greek alphabet. Experimenting with a range of statistical and deep learning models on both artificial and real-life Greeklish data, we find that: (i) prompting large language models (e.g., GPT-4) performs impressively well with few- or even zero-shot training, outperforming several fine-tuned encoder-decoder models; however (ii) a twenty years old statistical Greeklish transliteration model is still very competitive; and (iii) the problem is still far from having been solved; (iv) nevertheless, downstream Greek NLP systems that need to cope with Greeklish, such as moderation classifiers, can benefit significantly even with the current non-perfect transliteration systems. We make all our code, models, and data available and suggest future improvements, based on an analysis of our experimental results.

Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
John Pavlopoulos | Thea Sommerschield | Yannis Assael | Shai Gordin | Kyunghyun Cho | Marco Passarotti | Rachele Sprugnoli | Yudong Liu | Bin Li | Adam Anderson
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

Challenging Error Correction in Recognised Byzantine Greek
John Pavlopoulos | Vasiliki Kougia | Esteban Garces Arias | Paraskevi Platanou | Stepan Shabalin | Konstantina Liagkou | Emmanouil Papadatos | Holger Essler | Jean-Baptiste Camps | Franz Fischer
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

Automatic correction of errors in Handwritten Text Recognition (HTR) output poses persistent challenges yet to be fully resolved. In this study, we introduce a shared task aimed at addressing this challenge, which attracted 271 submissions, yielding only a handful of promising approaches. This paper presents the datasets, the most effective methods, and an experimental analysis in error-correcting HTRed manuscripts and papyri in Byzantine Greek, the language that followed Classical and preceded Modern Greek. By using recognised and transcribed data from seven centuries, the two best-performing methods are compared, one based on a neural encoder-decoder architecture and the other based on engineered linguistic rules. We show that the recognition error rate can be reduced by both, up to 2.5 points at the level of characters and up to 15 at the level of words, while also elucidating their respective strengths and weaknesses.

Exploring intertextuality across the Homeric poems through language models
Maria Konstantinidou | John Pavlopoulos | Elton Barker
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

Past research has modelled statistically the language of the Homeric poems, assessing the degree of surprisal for each verse through diverse metrics and resulting to the HoLM resource. In this study we utilise the HoLM resource to explore cross poem affinity at the verse level, looking at Iliadic verses and passages that are less surprising to the Odyssean model than to the Iliadic one and vice-versa. Using the same tool, we investigate verses that evoke greater surprise when assessed by a local model trained solely on their source book, compared to a global model trained on the entire source poem. Investigating deeper on the distribution of such verses across the Homeric poems we employ machine learning text classification to further analyse quantitatively cross-poem affinity in selected books.

Leveraging LLMs for Translating and Classifying Mental Health Data
Konstantinos Skianis | A. Seza Doğruöz | John Pavlopoulos
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

Large language models (LLMs) are increasingly used in medical fields. In mental health support, the early identification of linguistic markers associated with mental health conditions can provide valuable support to mental health professionals, and reduce long waiting times for patients.Despite the benefits of LLMs for mental health support, there is limited research on their application in mental health systems for languages other than English. Our study addresses this gap by focusing on the detection of depression severity in Greek through user-generated posts which are automatically translated from English. Our results show that GPT3.5-turbo is not very successful in identifying the severity of depression in English, and it has a varying performance in Greek as well. Our study underscores the necessity for further research, especially in languages with less resources.Also, careful implementation is necessary to ensure that LLMs are used effectively in mental health platforms, and human supervision remains crucial to avoid misdiagnosis.

Estimating the Emotion of Disgust in Greek Parliament Records
Vanessa Lislevand | John Pavlopoulos | Panos Louridas | Konstantina Dritsa
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

We present an analysis of the sentiment in Greek political speech, by focusing on the most frequently occurring emotion in electoral data, the emotion of “disgust”. We show that emotion classification is generally tough, but high accuracy can be achieved for that particular emotion. Using our best-performing model to classify political records of the Greek Parliament Corpus from 1989 to 2020, we studied the points in time when this emotion was frequently occurring and we ranked the Greek political parties based on their estimated score. We then devised an algorithm to investigate the emotional context shift of words that describe specific conditions and that can be used to stigmatise. Given that early detection of such word usage is essential for policy-making, we report two words we found being increasingly used in a negative emotional context, and one that is likely to be carrying stigma, in the studied parliamentary records.

2023

Dating Greek Papyri with Text Regression
John Pavlopoulos | Maria Konstantinidou | Isabelle Marthot-Santaniello | Holger Essler | Asimina Paparigopoulou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dating Greek papyri accurately is crucial not only to edit their texts but also to understand numerous other aspects of ancient writing, document and book production and circulation, as well as various other aspects of administration, everyday life and intellectual history of antiquity. Although a substantial number of Greek papyri documents bear a date or other conclusive data as to their chronological placement, an even larger number can only be dated tentatively or in approximation, due to the lack of decisive evidence. By creating a dataset of 389 transcriptions of documentary Greek papyri, we train 389 regression models and we predict a date for the papyri with an average MAE of 54 years and an MSE of 1.17, outperforming image classifiers and other baselines. Last, we release date estimations for 159 manuscripts, for which only the upper limit is known.

Machine Learning for Ancient Languages: A Survey
Thea Sommerschield | Yannis Assael | John Pavlopoulos | Vanessa Stefanak | Andrew Senior | Chris Dyer | John Bodel | Jonathan Prag | Ion Androutsopoulos | Nando de Freitas
Computational Linguistics, Volume 49, Issue 3 - September 2023

Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning.

Annotating Homeric Emotions by a Domain-Specific Language
Federico Boschetti | Laura Chilla | Maria Konstantinidou | John Pavlopoulos
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Detecting Erroneously Recognized Handwritten Byzantine Text
John Pavlopoulos | Vasiliki Kougia | Paraskevi Platanou | Holger Essler
Findings of the Association for Computational Linguistics: EMNLP 2023

Handwritten text recognition (HTR) yields textual output that comprises errors, which are considerably more compared to that of recognised printed (OCRed) text. Post-correcting methods can eliminate such errors but may also introduce errors. In this study, we investigate the issues arising from this reality in Byzantine Greek. We investigate the properties of the texts that lead post-correction systems to this adversarial behaviour and we experiment with text classification systems that learn to detect incorrect recognition output. A large masked language model, pre-trained in modern and fine-tuned in Byzantine Greek, achieves an Average Precision score of 95%. The score improves to 97% when using a model that is pre-trained in modern and then in ancient Greek, the two language forms Byzantine Greek combines elements from. A century-based analysis shows that the advantage of the classifier that is further-pre-trained in ancient Greek concerns texts of older centuries. The application of this classifier before a neural post-corrector on HTRed text reduced significantly the post-correction mistakes.

JUAGE at SemEval-2023 Task 10: Parameter Efficient Classification
Jeffrey Sorensen | Katerina Korre | John Pavlopoulos | Katrin Tomanek | Nithum Thain | Lucas Dixon | Léo Laugier
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Using pre-trained language models to implement classifiers from small to modest amounts of training data is an area of active research. The ability of large language models to generalize from few-shot examples and to produce strong classifiers is extended using the engineering approach of parameter-efficient tuning. Using the Explainable Detection of Online Sexism (EDOS) training data and a small number of trainable weights to create a tuned prompt vector, a competitive model for this task was built, which was top-ranked in Subtask B.

Harmful Language Datasets: An Assessment of Robustness
Katerina Korre | John Pavlopoulos | Jeffrey Sorensen | Léo Laugier | Ion Androutsopoulos | Lucas Dixon | Alberto Barrón-cedeño
The 7th Workshop on Online Abuse and Harms (WOAH)

The automated detection of harmful language has been of great importance for the online world, especially with the growing importance of social media and, consequently, polarisation. There are many open challenges to high quality detection of harmful text, from dataset creation to generalisable application, thus calling for more systematic studies. In this paper, we explore re-annotation as a means of examining the robustness of already existing labelled datasets, showing that, despite using alternative definitions, the inter-annotator agreement remains very inconsistent, highlighting the intrinsically subjective and variable nature of the task. In addition, we build automatic toxicity detectors using the existing datasets, with their original labels, and we evaluate them on our multi-definition and multi-source datasets. Surprisingly, while other studies show that hate speech detection models perform better on data that are derived from the same distribution as the training set, our analysis demonstrates this is not necessarily true.

2022

From the Detection of Toxic Spans in Online Discussions to the Analysis of Toxic-to-Civil Transfer
John Pavlopoulos | Leo Laugier | Alexandros Xenos | Jeffrey Sorensen | Ion Androutsopoulos
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study the task of toxic spans detection, which concerns the detection of the spans that make a text toxic, when detecting such spans is possible. We introduce a dataset for this task, ToxicSpans, which we release publicly. By experimenting with several methods, we show that sequence labeling models perform best, but methods that add generic rationale extraction mechanisms on top of classifiers trained to predict if a post is toxic or not are also surprisingly promising. Finally, we use ToxicSpans and systems trained on it, to provide further analysis of state-of-the-art toxic to non-toxic transfer systems, as well as of human performance on that latter task. Our work highlights challenges in finer toxicity detection and mitigation.

Enriching Grammatical Error Correction Resources for Modern Greek
Katerina Korre | John Pavlopoulos
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Grammatical Error Correction (GEC), a task of Natural Language Processing (NLP), is challenging for underepresented languages. This issue is most prominent in languages other than English. This paper addresses the issue of data and system sparsity for GEC purposes in the modern Greek Language. Following the most popular current approaches in GEC, we develop and test an MT5 multilingual text-to-text transformer for Greek. To our knowledge this the first attempt to create a fully-fledged GEC model for Greek. Our evaluation shows that our system reaches up to 52.63% F0.5 score on part of the Greek Native Corpus (GNC), which is 16% below the winning system of the BEA-19 shared task on English GEC. In addition, we provide an extended version of the Greek Learner Corpus (GLC), on which our model reaches up to 22.76% F0.5. Previous versions did not include corrections with the annotations which hindered the potential development of efficient GEC systems. For that reason we provide a new set of corrections. This new dataset facilitates an exploration of the generalisation abilities and robustness of our system, given that the assessment is conducted on learner data while the training on native data.

A Study of Distant Viewing of ukiyo-e prints
Konstantina Liagkou | John Pavlopoulos | Ewa Machotka
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper contributes to studying relationships between Japanese topography and places featured in early modern landscape prints, so-called ukiyo-e or ‘pictures of the floating world’. The printed inscriptions on these images feature diverse place-names, both man-made and natural formations. However, due to the corpus’s richness and diversity, the precise nature of artistic mediation of the depicted places remains little understood. In this paper, we explored a new analytical approach based on the macroanalysis of images facilitated by Natural Language Processing technologies. This paper presents a small dataset with inscriptions on prints that have been annotated by an art historian for included place-name entities. Our dataset is released for public use. By fine-tuning and applying a Japanese BERT-based Name Entity Recogniser, we provide a use-case of a macroanalysis of a visual dataset that is hosted by the digital database of the Art Research Center at the Ritsumeikan University, Kyoto. Our work studies the relationship between topography and its visual renderings in early modern Japanese ukiyo-e landscape prints, demonstrating how an art historian’s work can be improved with Natural Language Processing toward distant viewing of visual datasets. We release our dataset and code for public use: https://github.com/connalia/ukiyo-e_meisho_nlp

Handwritten Paleographic Greek Text Recognition: A Century-Based Approach
Paraskevi Platanou | John Pavlopoulos | Georgios Papaioannou
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Today classicists are provided with a great number of digital tools which, in turn, offer possibilities for further study and new research goals. In this paper we explore the idea that old Greek handwriting can be machine-readable and consequently, researchers can study the target material fast and efficiently. Previous studies have shown that Handwritten Text Recognition (HTR) models are capable of attaining high accuracy rates. However, achieving high accuracy HTR results for Greek manuscripts is still considered to be a major challenge. The overall aim of this paper is to assess HTR for old Greek manuscripts. To address this statement, we study and use digitized images of the Oxford University Bodleian Library Greek manuscripts. By manually transcribing 77 images, we created and present here a new dataset for Handwritten Paleographic Greek Text Recognition. The dataset instances were organized by establishing as a leading factor the century to which the manuscript and hence the image belongs. Experimenting then with an HTR model we show that the error rate depends on the century of the image.

Sentiment Analysis of Homeric Text: The 1st Book of Iliad
John Pavlopoulos | Alexandros Xenos | Davide Picca
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Sentiment analysis studies are focused more on online customer reviews or social media, and less on literary studies. The problem is greater for ancient languages, where the linguistic expression of sentiments may diverge from modern linguistic forms. This work presents the outcome of a sentiment annotation task of the first Book of Iliad, an ancient Greek poem. The annotators were provided with verses translated into modern Greek and they annotated the perceived emotions and sentiments verse by verse. By estimating the fraction of annotators that found a verse as belonging to a specific sentiment class, we model the poem’s perceived sentiment as a multi-variate time series. By experimenting with a state of the art deep learning masked language model, pre-trained on modern Greek and fine-tuned to estimate the sentiment of our data, we registered a mean squared error of 0.063. This low error indicates that sentiment estimators built on our dataset can potentially be used as mechanical annotators, hence facilitating the distant reading of Homeric text. Our dataset is released for public use.

2021

Civil Rephrases Of Toxic Texts With Self-Supervised Transformers
Léo Laugier | John Pavlopoulos | Jeffrey Sorensen | Lucas Dixon
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Platforms that support online commentary, from social networks to news sites, are increasingly leveraging machine learning to assist their moderation efforts. But this process does not typically provide feedback to the author that would help them contribute according to the community guidelines. This is prohibitively time-consuming for human moderators to do, and computational approaches are still nascent. This work focuses on models that can help suggest rephrasings of toxic comments in a more civil manner. Inspired by recent progress in unpaired sequence-to-sequence tasks, a self-supervised learning model is introduced, called CAE-T5. CAE-T5 employs a pre-trained text-to-text transformer, which is fine tuned with a denoising and cyclic auto-encoder loss. Experimenting with the largest toxicity detection dataset to date (Civil Comments) our model generates sentences that are more fluent and better at preserving the initial content compared to earlier text style transfer systems which we compare with using several scoring systems and human evaluation.

ELERRANT: Automatic Grammatical Error Type Classification for Greek
Katerina Korre | Marita Chatzipanagiotou | John Pavlopoulos
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we introduce the Greek version of the automatic annotation tool ERRANT (Bryant et al., 2017), which we named ELERRANT. ERRANT functions as a rule-based error type classifier and was used as the main evaluation tool of the systems participating in the BEA-2019 (Bryant et al., 2019) shared task. Here, we discuss grammatical and morphological differences between English and Greek and how these differences affected the development of ELERRANT. We also introduce the first Greek Native Corpus (GNC) and the Greek WikiEdits Corpus (GWE), two new evaluation datasets with errors from native Greek learners and Wikipedia Talk Pages edits respectively. These two datasets are used for the evaluation of ELERRANT. This paper is a sole fragment of a bigger picture which illustrates the attempt to solve the problem of low-resource languages in NLP, in our case Greek.

SemEval-2021 Task 5: Toxic Spans Detection
John Pavlopoulos | Jeffrey Sorensen | Léo Laugier | Ion Androutsopoulos
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

The Toxic Spans Detection task of SemEval-2021 required participants to predict the spans of toxic posts that were responsible for the toxic label of the posts. The task could be addressed as supervised sequence labeling, using training data with gold toxic spans provided by the organisers. It could also be treated as rationale extraction, using classifiers trained on potentially larger external datasets of posts manually annotated as toxic or not, without toxic span annotations. For the supervised sequence labeling approach and evaluation purposes, posts previously labeled as toxic were crowd-annotated for toxic spans. Participants submitted their predicted spans for a held-out test set and were scored using character-based F1. This overview summarises the work of the 36 teams that provided system descriptions.

Context Sensitivity Estimation in Toxicity Detection
Alexandros Xenos | John Pavlopoulos | Ion Androutsopoulos
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

User posts whose perceived toxicity depends on the conversational context are rare in current toxicity detection datasets. Hence, toxicity detectors trained on current datasets will also disregard context, making the detection of context-sensitive toxicity a lot harder when it occurs. We constructed and publicly release a dataset of 10k posts with two kinds of toxicity labels per post, obtained from annotators who considered (i) both the current post and the previous one as context, or (ii) only the current post. We introduce a new task, context-sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context (previous post) is also considered. Using the new dataset, we show that systems can be developed for this task. Such systems could be used to enhance toxicity detection datasets with more context-dependent posts or to suggest when moderators should consider the parent posts, which may not always be necessary and may introduce additional costs.

Multimodal or Text? Retrieval or BERT? Benchmarking Classifiers for the Shared Task on Hateful Memes
Vasiliki Kougia | John Pavlopoulos
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

The Shared Task on Hateful Memes is a challenge that aims at the detection of hateful content in memes by inviting the implementation of systems that understand memes, potentially by combining image and textual information. The challenge consists of three detection tasks: hate, protected category and attack type. The first is a binary classification task, while the other two are multi-label classification tasks. Our participation included a text-based BERT baseline (TxtBERT), the same but adding information from the image (ImgBERT), and neural retrieval approaches. We also experimented with retrieval augmented classification models. We found that an ensemble of TxtBERT and ImgBERT achieves the best performance in terms of ROC AUC score in two out of the three tasks on our development set.

2020

Toxicity Detection: Does Context Really Matter?
John Pavlopoulos | Jeffrey Sorensen | Lucas Dixon | Nithum Thain | Ion Androutsopoulos
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Moderation is crucial to promoting healthy online discussions. Although several ‘toxicity’ detection datasets and models have been published, most of them ignore the context of the posts, implicitly assuming that comments may be judged independently. We investigate this assumption by focusing on two questions: (a) does context affect the human judgement, and (b) does conditioning on context improve performance of toxicity detection systems? We experiment with Wikipedia conversations, limiting the notion of context to the previous post in the thread and the discussion title. We find that context can both amplify or mitigate the perceived toxicity of posts. Moreover, a small but significant subset of manually labeled posts (5% in one of our experiments) end up having the opposite toxicity labels if the annotators are not provided with context. Surprisingly, we also find no evidence that context actually improves the performance of toxicity classifiers, having tried a range of classifiers and mechanisms to make them context aware. This points to the need for larger datasets of comments annotated in context. We make our code and data publicly available.

ERRANT: Assessing and Improving Grammatical Error Type Classification
Katerina Korre | John Pavlopoulos
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Grammatical Error Correction (GEC) is the task of correcting different types of errors in written texts. To manage this task, large amounts of annotated data that contain erroneous sentences are required. This data, however, is usually annotated according to each annotator’s standards, making it difficult to manage multiple sets of data at the same time. The recently introduced Error Annotation Toolkit (ERRANT) tackled this problem by presenting a way to automatically annotate data that contain grammatical errors, while also providing a standardisation for annotation. ERRANT extracts the errors and classifies them into error types, in the form of an edit that can be used in the creation of GEC systems, as well as for grammatical error analysis. However, we observe that certain errors are falsely or ambiguously classified. This could obstruct any qualitative or quantitative grammatical error type analysis, as the results would be inaccurate. In this work, we use a sample of the FCE coprus (Yannakoudakis et al., 2011) for secondary error type annotation and we show that up to 39% of the annotations of the most frequent type should be re-classified. Our corrections will be publicly released, so that they can serve as the starting point of a broader, collaborative, ongoing correction process.

2019

ConvAI at SemEval-2019 Task 6: Offensive Language Identification and Categorization with Perspective and BERT
John Pavlopoulos | Nithum Thain | Lucas Dixon | Ion Androutsopoulos
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper presents the application of two strong baseline systems for toxicity detection and evaluates their performance in identifying and categorizing offensive language in social media. PERSPECTIVE is an API, that serves multiple machine learning models for the improvement of conversations online, as well as a toxicity detection system, trained on a wide variety of comments from platforms across the Internet. BERT is a recently popular language representation model, fine tuned per task and achieving state of the art performance in multiple NLP tasks. PERSPECTIVE performed better than BERT in detecting toxicity, but BERT was much better in categorizing the offensive type. Both baselines were ranked surprisingly high in the SEMEVAL-2019 OFFENSEVAL competition, PERSPECTIVE in detecting an offensive post (12th) and BERT in categorizing it (11th). The main contribution of this paper is the assessment of two strong baselines for the identification (PERSPECTIVE) and the categorization (BERT) of offensive language with little or no additional training data.

A Survey on Biomedical Image Captioning
John Pavlopoulos | Vasiliki Kougia | Ion Androutsopoulos
Proceedings of the Second Workshop on Shortcomings in Vision and Language

Image captioning applied to biomedical images can assist and accelerate the diagnosis process followed by clinicians. This article is the first survey of biomedical image captioning, discussing datasets, evaluation measures, and state of the art methods. Additionally, we suggest two baselines, a weak and a stronger one; the latter outperforms all current state of the art systems on one of the datasets.

2017

Deeper Attention to Abusive User Content Moderation
John Pavlopoulos | Prodromos Malakasiotis | Ion Androutsopoulos
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Experimenting with a new dataset of 1.6M user comments from a news portal and an existing dataset of 115K Wikipedia talk page comments, we show that an RNN operating on word embeddings outpeforms the previous state of the art in moderation, which used logistic regression or an MLP classifier with character or word n-grams. We also compare against a CNN operating on word embeddings, and a word-list baseline. A novel, deep, classificationspecific attention mechanism improves the performance of the RNN further, and can also highlight suspicious words for free, without including highlighted words in the training data. We consider both fully automatic and semi-automatic moderation.

Deep Learning for User Comment Moderation
John Pavlopoulos | Prodromos Malakasiotis | Ion Androutsopoulos
Proceedings of the First Workshop on Abusive Language Online

Experimenting with a new dataset of 1.6M user comments from a Greek news portal and existing datasets of EnglishWikipedia comments, we show that an RNN outperforms the previous state of the art in moderation. A deep, classification-specific attention mechanism improves further the overall performance of the RNN. We also compare against a CNN and a word-list baseline, considering both fully automatic and semi-automatic moderation.

Improved Abusive Comment Moderation with User Embeddings
John Pavlopoulos | Prodromos Malakasiotis | Juli Bakagianni | Ion Androutsopoulos
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

Experimenting with a dataset of approximately 1.6M user comments from a Greek news sports portal, we explore how a state of the art RNN-based moderation method can be improved by adding user embeddings, user type embeddings, user biases, or user type biases. We observe improvements in all cases, with user embeddings leading to the biggest performance gains.

2016

aueb.twitter.sentiment at SemEval-2016 Task 4: A Weighted Ensemble of SVMs for Twitter Sentiment Analysis
Stavros Giorgis | Apostolos Rousas | John Pavlopoulos | Prodromos Malakasiotis | Ion Androutsopoulos
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

AUEB-ABSA at SemEval-2016 Task 5: Ensembles of Classifiers and Embeddings for Aspect Based Sentiment Analysis
Dionysios Xenos | Panagiotis Theodorakakos | John Pavlopoulos | Prodromos Malakasiotis | Ion Androutsopoulos
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2014

Multi-Granular Aspect Aggregation in Aspect-Based Sentiment Analysis
John Pavlopoulos | Ion Androutsopoulos
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

A Vague Sense Classifier for Detecting Vague Definitions in Ontologies
Panos Alexopoulos | John Pavlopoulos
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

SemEval-2014 Task 4: Aspect Based Sentiment Analysis
Maria Pontiki | Dimitris Galanis | John Pavlopoulos | Harris Papageorgiou | Ion Androutsopoulos | Suresh Manandhar
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

AUEB: Two Stage Sentiment Analysis of Social Network Messages
Rafael Michael Karampatsis | John Pavlopoulos | Prodromos Malakasiotis
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Aspect Term Extraction for Sentiment Analysis: New Datasets, New Evaluation Measures and an Improved Unsupervised Method
John Pavlopoulos | Ion Androutsopoulos
Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)

2013

nlp.cs.aueb.gr: Two Stage Sentiment Analysis
Prodromos Malakasiotis | Rafael Michael Karampatsis | Konstantina Makrynioti | John Pavlopoulos
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

Co-authors

Maria Konstantinidou 4

Vasiliki Kougia 4

Juli Bakagianni 3

Holger Essler 3

Aron Henriksson 3

Panos Louridas 3

Paraskevi Platanou 3

Korbinian Randl 3

Alexandros Xenos 3

Yannis Assael 2

Panagiotis Kaliosis 2

Rafael - Michael Karampatsis 2

Konstantina Liagkou 2

Tony Lindgren 2

Thea Sommerschield 2

Anastasios Toumazatos 2

Stavros Vassos 2

Panos Alexopoulos 1

Antonios Anastasopoulos 1

Adam Anderson 1

Spiros Barbakos 1

Alberto Barrón-Cedeño 1

Federico Boschetti 1

Chiara Bozzone 1

Jean-Baptiste Camps 1

Foivos Charalampakos 1

Marita Chatzipanagiotou 1

Kyunghyun Cho 1

Chrysa Dikonomaki 1

Antonios Dimakis 1

A. Seza Doğruöz 1

Konstantina Dritsa 1

Konstantinos Eleftheriou 1

Theodoros Evgeniou 1

Panagiotis Filos 1

Franz Fischer 1

Dimitrios Galanis 1

Esteban Garces Arias 1

Mary Georgiou 1

Stavros Giorgis 1

Nikos Gkoumas 1

John Koutsikakis 1

Manolis Kyriakakis 1

Aristidis Likas 1

Vanessa Lislevand 1

Lefteris Loukas 1

Konstantina Makrynioti 1

Suresh Manandhar 1

Isabelle Marthot-Santaniello 1

Georgios Moschovis 1

Danai Myrtzani 1

Franco Maria Nardini 1

Emmanouil Papadatos 1

Harris Papageorgiou 1

Georgios Papaioannou 1

Asimina Paparigopoulou 1

Marco Passarotti 1

Maria Pontiki 1

Jonathan Prag 1

Guido Rocchietti 1

Anna Romanova 1

Apostolos Rousas 1

Andrew Senior 1

Stepan Shabalin 1

Konstantinos Skianis 1

Nikolaos Smyrnioudis 1

Rachele Sprugnoli 1

Vanessa Stefanak 1

Panagiotis Theodorakakos 1

Katrin Tomanek 1

Salvatore Trani 1

Dimitris Tsirmpas 1

Dionysios Xenos 1

Nando de Freitas 1

Venues