Grigorios Tsoumakas

2024

pdf bib abs
Topic-Controllable Summarization: Topic-Aware Evaluation and Transformer Methods
Tatiana Passali | Grigorios Tsoumakas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. For example, the majority of existing methods built upon recurrent architectures, which can significantly limit their performance compared to more recent Transformer-based architectures, while they also require modifications to the model’s architecture for controlling the topic. At the same time, there is currently no established evaluation metric designed specifically for topic-controllable summarization. This work proposes a new topic-oriented evaluation measure to automatically evaluate the generated summaries based on the topic affinity between the generated summary and the desired topic. The reliability of the proposed measure is demonstrated through appropriately designed human evaluation. In addition, we adapt topic embeddings to work with powerful Transformer architectures and propose a novel and efficient approach for guiding the summary generation through control tokens. Experimental results reveal that control tokens can achieve better performance compared to more complicated embedding-based approaches while also being significantly faster.

pdf bib abs
Plain Language Summarization of Clinical Trials
Polydoros Giannouris | Theodoros Myridis | Tatiana Passali | Grigorios Tsoumakas
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

Plain language summarization, or lay summarization, is an emerging natural language processing task, aiming to make scientific articles accessible to an audience of non-scientific backgrounds. The healthcare domain can greatly benefit from applications of automatic plain language summarization, as results that concern a large portion of the population are reported in large documents with complex terminology. However, existing corpora for this task are limited in scope, usually regarding conference or journal article abstracts. In this paper, we introduce the task of automated generation of plain language summaries for clinical trials, and construct CARES (Clinical Abstractive Result Extraction and Simplification), the first corresponding dataset. CARES consists of publicly available, human-written summaries of clinical trials conducted by Pfizer. Source text is identified from documents released throughout the life-cycle of the trial, and steps are taken to remove noise and select the appropriate sections. Experiments show that state-of-the-art models achieve satisfactory results in most evaluation metrics

2022

pdf bib abs
Should We Trust This Summary? Bayesian Abstractive Summarization to The Rescue
Alexios Gidiotis | Grigorios Tsoumakas
Findings of the Association for Computational Linguistics: ACL 2022

We explore the notion of uncertainty in the context of modern abstractive summarization models, using the tools of Bayesian Deep Learning. Our approach approximates Bayesian inference by first extending state-of-the-art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes. Based on Bayesian inference we are able to effectively quantify uncertainty at prediction time. Having a reliable uncertainty measure, we can improve the experience of the end user by filtering out generated summaries of high uncertainty. Furthermore, uncertainty estimation could be used as a criterion for selecting samples for annotation, and can be paired nicely with active learning and human-in-the-loop approaches. Finally, Bayesian inference enables us to find a Bayesian summary which performs better than a deterministic one and is more robust to uncertainty. In practice, we show that our Variational Bayesian equivalents of BART and PEGASUS can outperform their deterministic counterparts on multiple benchmark datasets.

pdf bib abs
LARD: Large-scale Artificial Disfluency Generation
Tatiana Passali | Thanassis Mavropoulos | Grigorios Tsoumakas | Georgios Meditskos | Stefanos Vrochidis
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the model on rare classes, as it is demonstrated in this paper. To this end, we propose LARD, a method for generating complex and realistic artificial disfluencies with little effort. The proposed method can handle three of the most common types of disfluencies: repetitions, replacements, and restarts. In addition, we release a new large-scale dataset with disfluencies that can be used on four different tasks: disfluency detection, classification, extraction, and correction. Experimental results on the LARD dataset demonstrate that the data produced by the proposed method can be effectively used for detecting and removing disfluencies, while also addressing limitations of existing datasets.

2021

pdf bib abs
The concept of nation in nineteenth-century Greek fiction through computational literary analysis
Fotini Koidaki | Despina Christou | Katerina Tiktopoulou | Grigorios Tsoumakas
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

How the construction of national consciousness may be captured in the literary production of a whole century? What can the macro-analysis of the 19th-century prose fiction reveal about the formation of the concept of the nation-state of Greece? How could the concept of nationality be detected in literary writing and then interpreted? These are the questions addressed by the research that is published in this paper and which focuses on exploring how the concept of the nation is figured and shaped in 19th-century Greek prose fiction. This paper proposes a methodological approach that combines well-known text mining techniques with computational close reading methods in order to retrieve the nation-related passages and to analyze them linguistically and semantically. The main objective of the paper at hand is to map the frequency and the phraseology of the nation-related references, as well as to explore the phrase patterns in relation to the topic modeling results.

pdf bib abs
Keyphrase Extraction from Scientific Articles via Extractive Summarization
Chrysovalantis Giorgos Kontoulis | Eirini Papagiannopoulou | Grigorios Tsoumakas
Proceedings of the Second Workshop on Scholarly Document Processing

Automatically extracting keyphrases from scholarly documents leads to a valuable concise representation that humans can understand and machines can process for tasks, such as information retrieval, article clustering and article classification. This paper is concerned with the parts of a scientific article that should be given as input to keyphrase extraction methods. Recent deep learning methods take titles and abstracts as input due to the increased computational complexity in processing long sequences, whereas traditional approaches can also work with full-texts. Titles and abstracts are dense in keyphrases, but often miss important aspects of the articles, while full-texts on the other hand are richer in keyphrases but much noisier. To address this trade-off, we propose the use of extractive summarization models on the full-texts of scholarly documents. Our empirical study on 3 article collections using 3 keyphrase extraction methods shows promising results.

pdf bib abs
Towards Human-Centered Summarization: A Case Study on Financial News
Tatiana Passali | Alexios Gidiotis | Efstathios Chatzikyriakidis | Grigorios Tsoumakas
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

Recent Deep Learning (DL) summarization models greatly outperform traditional summarization methodologies, generating high-quality summaries. Despite their success, there are still important open issues, such as the limited engagement and trust of users in the whole process. In order to overcome these issues, we reconsider the task of summarization from a human-centered perspective. We propose to integrate a user interface with an underlying DL model, instead of tackling summarization as an isolated task from the end user. We present a novel system, where the user can actively participate in the whole summarization process. We also enable the user to gather insights into the causative factors that drive the model’s behavior, exploiting the self-attention mechanism. We focus on the financial domain, in order to demonstrate the efficiency of generic DL models for domain-specific applications. Our work takes a first step towards a model-interface co-design approach, where DL models evolve along user needs, paving the way towards human-computer text summarization interfaces.

pdf bib abs
Keyword Extraction Using Unsupervised Learning on the Document’s Adjacency Matrix
Eirini Papagiannopoulou | Grigorios Tsoumakas | Apostolos Papadopoulos
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

This work revisits the information given by the graph-of-words and its typical utilization through graph-based ranking approaches in the context of keyword extraction. Recent, well-known graph-based approaches typically employ the knowledge from word vector representations during the ranking process via popular centrality measures (e.g., PageRank) without giving the primary role to vectors’ distribution. We consider the adjacency matrix that corresponds to the graph-of-words of a target text document as the vector representation of its vocabulary. We propose the distribution-based modeling of this adjacency matrix using unsupervised (learning) algorithms. The efficacy of the distribution-based modeling approaches compared to state-of-the-art graph-based methods is confirmed by an extensive experimental study according to the F1 score. Our code is available on GitHub.

2020

pdf bib abs
AUTH @ CLSciSumm 20, LaySumm 20, LongSumm 20
Alexios Gidiotis | Stefanos Stefanidis | Grigorios Tsoumakas
Proceedings of the First Workshop on Scholarly Document Processing

We present the systems we submitted for the shared tasks of the Workshop on Scholarly Document Processing at EMNLP 2020. Our approaches to the tasks are focused on exploiting large Transformer models pre-trained on huge corpora and adapting them to the different shared tasks. For tasks 1A and 1B of CL-SciSumm we are using different variants of the BERT model to tackle the tasks of “cited text span” and “facet” identification. For the summarization tasks 2 of CL-SciSumm, LaySumm and LongSumm we make use of different variants of the PEGASUS model, with and without fine-tuning, adapted to the nuances of each one of those particular tasks.

2016

Venues

ws1