Michalis Vazirgiannis - ACL Anthology

Michalis Vazirgiannis

2026

Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Yang Zhang | Amr Mohamed | Hadi Abdine | Guokan Shang | Michalis Vazirgiannis
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Curriculum learning—organizing training data from easy to hard—has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases, reducing training steps by 18-45% to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to 3.5%. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering—orthogonal to existing data selection methods—provides a practical mechanism for more efficient LLM pretraining.

2025

Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks
Virgile Rennard | Christos Xypolopoulos | Michalis Vazirgiannis
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.

LLM as a Broken Telephone: Iterative Generation Distorts Information
Amr Mohamed | Mingmeng Geng | Michalis Vazirgiannis | Guokan Shang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As large language models are increasingly responsible for online content, concerns arise about the impact of repeatedly processing their own outputs.Inspired by the “broken telephone” effect in chained human communication, this study investigates whether LLMs similarly distort information through iterative generation.Through translation-based experiments, we find that distortion accumulates over time, influenced by language choice and chain complexity. While degradation is inevitable, it can be mitigated through strategic prompting techniques. These findings contribute to discussions on the long-term effects of AI-mediated information propagation, raising important questions about the reliability of LLM-generated content in iterative workflows.

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Guokan Shang | Hadi Abdine | Yousef Khoubrane | Amr Mohamed | Yassine Abbahaddou | Sofiane Ennadir | Imane Momayiz | Xuguang Ren | Eric Moulines | Preslav Nakov | Michalis Vazirgiannis | Eric Xing
Proceedings of the First Workshop on Language Models for Low-Resource Languages

We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.

Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts
Guokan Shang | Hadi Abdine | Ahmad Chamma | Amr Mohamed | Mohamed Anwar | Abdelaziz Bounhar | Omar El Herraoui | Preslav Nakov | Michalis Vazirgiannis | Eric P. Xing
Proceedings of The Third Arabic Natural Language Processing Conference

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model delivers a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to a single language with dual-script usage, addressing an often overlooked aspect in contemporary LLM development.

2024

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
Yanzhu Guo | Guokan Shang | Michalis Vazirgiannis | Chloé Clavel
Findings of the Association for Computational Linguistics: NAACL 2024

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
Iakovos Evdaimon | Hadi Abdine | Christos Xypolopoulos | Stamatis Outsios | Michalis Vazirgiannis | Giorgos Stamou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The era of transfer learning has revolutionized the fields of Computer Vision and Natural Language Processing, bringing powerful pretrained models with exceptional performance across a variety of tasks. Specifically, Natural Language Processing tasks have been dominated by transformer-based language models. In Natural Language Inference and Natural Language Generation tasks, the BERT model and its variants, as well as the GPT model and its successors, demonstrated exemplary performance. However, the majority of these models are pretrained and assessed primarily for the English language or on a multilingual corpus. In this paper, we introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus. We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks. In addition, we examine its performance on two NLG tasks from GreekSUM, a newly introduced summarization dataset for the Greek language. The model, the code, and the new summarization dataset will be publicly available.

2023

FREDSum: A Dialogue Summarization Corpus for French Political Debates
Virgile Rennard | Guokan Shang | Damien Grari | Julie Hunter | Michalis Vazirgiannis
Findings of the Association for Computational Linguistics: EMNLP 2023

Recent advances in deep learning, and especially the invention of encoder-decoder architectures, have significantly improved the performance of abstractive summarization systems. While the majority of research has focused on written documents, we have observed an increasing interest in the summarization of dialogues and multi-party conversations over the past few years. In this paper, we present a dataset of French political debates for the purpose of enhancing resources for multi-lingual dialogue summarization. Our dataset consists of manually transcribed and annotated political debates, covering a range of topics and perspectives. We highlight the importance of high-quality transcription and annotations for training accurate and effective dialogue summarization models, and emphasize the need for multilingual resources to support dialogue summarization in non-English languages. We also provide baseline experiments using state-of-the-art methods, and encourage further research in this area to advance the field of dialogue summarization. Our dataset will be made publicly available for use by the research community, enabling further advances in multilingual dialogue summarization.

Automatic Analysis of Substantiation in Scientific Peer Reviews
Yanzhu Guo | Guokan Shang | Virgile Rennard | Michalis Vazirgiannis | Chloé Clavel
Findings of the Association for Computational Linguistics: EMNLP 2023

With the increasing amount of problematic peer reviews in top AI conferences, the community is urgently in need of automatic quality control measures. In this paper, we restrict our attention to substantiation — one popular quality aspect indicating whether the claims in a review are sufficiently supported by evidence — and provide a solution automatizing this evaluation process. To achieve this goal, we first formulate the problem as claim-evidence pair extraction in scientific peer reviews, and collect SubstanReview, the first annotated dataset for this task. SubstanReview consists of 550 reviews from NLP conferences annotated by domain experts. On the basis of this dataset, we train an argument mining system to automatically analyze the level of substantiation in peer reviews. We also perform data analysis on the SubstanReview dataset to obtain meaningful insights on peer reviewing quality in NLP conferences over recent years. The dataset is available at https://github.com/YanzhuGuo/SubstanReview.

DATScore: Evaluating Translation with Data Augmented Translations
Moussa Kamal Eddine | Guokan Shang | Michalis Vazirgiannis
Findings of the Association for Computational Linguistics: EACL 2023

The rapid development of large pretrained language models has revolutionized not only the field of Natural Language Generation (NLG) but also its evaluation. Inspired by the recent work of BARTScore: a metric leveraging the BART language model to evaluate the quality of generated text from various aspects, we introduce DATScore. DATScore uses data augmentation techniques to improve the evaluation of machine translation. Our main finding is that introducing data augmented translations of the source and reference texts is greatly helpful in evaluating the quality of the generated translation. We also propose two novel score averaging and term weighting strategies to improve the original score computing process of BARTScore. Experimental results on WMT show that DATScore correlates better with human meta-evaluations than the other recent state-of-the-art metrics, especially for low-resource languages. Ablation studies demonstrate the value added by our new scoring strategies. Moreover, we report in our extended experiments the performance of DATScore on 3 NLG tasks other than translation.

Abstractive Meeting Summarization: A Survey
Virgile Rennard | Guokan Shang | Julie Hunter | Michalis Vazirgiannis
Transactions of the Association for Computational Linguistics, Volume 11

A system that could reliably identify and sum up the most important points of a conversation would be valuable in a wide variety of real-world contexts, from business meetings to medical consultations to customer service calls. Recent advances in deep learning, and especially the invention of encoder-decoder architectures, has significantly improved language generation systems, opening the door to improved forms of abstractive summarization—a form of summarization particularly well-suited for multi-party conversation. In this paper, we provide an overview of the challenges raised by the task of abstractive meeting summarization and of the data sets, models, and evaluation metrics that have been used to tackle the problems.

2022

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency
Yanzhu Guo | Chloé Clavel | Moussa Kamal Eddine | Michalis Vazirgiannis
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The topic of summarization evaluation has recently attracted a surge of attention due to the rapid development of abstractive summarization systems. However, the formulation of the task is rather ambiguous, neither the linguistic nor the natural language processing communities have succeeded in giving a mutually agreed-upon definition. Due to this lack of well-defined formulation, a large number of popular abstractive summarization datasets are constructed in a manner that neither guarantees validity nor meets one of the most essential criteria of summarization: factual consistency. In this paper, we address this issue by combining state-of-the-art factual consistency models to identify the problematic instances present in popular summarization datasets. We release SummFC, a filtered summarization dataset with improved factual consistency, and demonstrate that models trained on this dataset achieve improved performance in nearly all quality aspects. We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.

Political Communities on Twitter: Case Study of the 2022 French Presidential Election
Hadi Abdine | Yanzhu Guo | Virgile Rennard | Michalis Vazirgiannis
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences

With the significant increase in users on social media platforms, a new means of political campaigning has appeared. Twitter and Facebook are now notable campaigning tools during elections. Indeed, the candidates and their parties now take to the internet to interact and spread their ideas. In this paper, we aim to identify political communities formed on Twitter during the 2022 French presidential election and analyze each respective community. We create a large-scale Twitter dataset containing 1.2 million users and 62.6 million tweets that mention keywords relevant to the election. We perform community detection on a retweet graph of users and propose an in-depth analysis of the stance of each community. Finally, we attempt to detect offensive tweets and automatic bots, comparing across communities in order to gain insight into each candidate’s supporter demographics and online campaign strategy.

AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization
Moussa Kamal Eddine | Nadi Tomeh | Nizar Habash | Joseph Le Roux | Michalis Vazirgiannis
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Like most natural language understanding and generation tasks, state-of-the-art models for summarization are transformer-based sequence-to-sequence architectures that are pretrained on large corpora. While most existing models focus on English, Arabic remains understudied. In this paper we propose AraBART, the first Arabic model in which the encoder and the decoder are pretrained end-to-end, based on BART. We show that AraBART achieves the best performance on multiple abstractive summarization datasets, outperforming strong baselines including a pretrained Arabic BERT-based model, multilingual BART, Arabic T5, and a multilingual T5 model. AraBART is publicly available.

FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation
Moussa Kamal Eddine | Guokan Shang | Antoine Tixier | Michalis Vazirgiannis
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fast and reliable evaluation metrics are key to R&D progress. While traditional natural language generation metrics are fast, they are not very reliable. Conversely, new metrics based on large pretrained language models are much more reliable, but require significant computational resources. In this paper, we propose FrugalScore, an approach to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude less parameters and running several times faster. On average over all learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times less parameters than the original metrics. We make our trained metrics publicly available, to benefit the entire NLP community and in particular researchers and practitioners with limited resources.

2021

JuriBERT: A Masked-Language Model Adaptation for French Legal Text
Stella Douka | Hadi Abdine | Michalis Vazirgiannis | Rajaa El Hamdani | David Restrepo Amariles
Proceedings of the Natural Legal Language Processing Workshop 2021

Language models have proven to be very useful when adapted to specific domains. Nonetheless, little research has been done on the adaptation of domain-specific BERT models in the French language. In this paper, we focus on creating a language model adapted to French legal text with the goal of helping law professionals. We conclude that some specific tasks do not benefit from generic language models pre-trained on large amounts of data. We explore the use of smaller architectures in domain-specific sub-languages and their benefits for French legal text. We prove that domain-specific pre-trained models can perform better than their equivalent generalised ones in the legal domain. Finally, we release JuriBERT, a new set of BERT models adapted to the French legal domain.

Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings
Christos Xypolopoulos | Antoine Tixier | Michalis Vazirgiannis
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. Through rigorous experiments, we show that our rankings are well correlated, with strong statistical significance, with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia, etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings and make interesting observations. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our approach makes it applicable to any language. Code and data are publicly available https://github.com/ksipos/polysemy-assessment .

BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
Moussa Kamal Eddine | Antoine Tixier | Michalis Vazirgiannis
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez’ corpus, and show our resulting model, mBARThez, to significantly boost BARThez’ generative performance.

BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets
Yanzhu Guo | Virgile Rennard | Christos Xypolopoulos | Michalis Vazirgiannis
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialised using a general-domain French language model CamemBERT which follows the base architecture of BERT. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.

2020

Energy-based Self-attentive Learning of Abstractive Communities for Spoken Language Understanding
Guokan Shang | Antoine Tixier | Michalis Vazirgiannis | Jean-Pierre Lorré
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Abstractive community detection is an important spoken language understanding task, whose goal is to group utterances in a conversation according to whether they can be jointly summarized by a common abstractive sentence. This paper provides a novel approach to this task. We first introduce a neural contextual utterance encoder featuring three types of self-attention mechanisms. We then train it using the siamese and triplet energy-based meta-architectures. Experiments on the AMI corpus show that our system outperforms multiple energy-based and non-energy based baselines from the state-of-the-art. Code and data are publicly available.

An Ensemble Method for Producing Word Representations focusing on the Greek Language
Michalis Lioudakis | Stamatis Outsios | Michalis Vazirgiannis
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

In this paper we present a new ensemble method, Continuous Bag-of-Skip-grams (CBOS), that produces high-quality word representations putting emphasis on the Greek language. The CBOS method combines the pioneering approaches for learning word representations: Continuous Bag-of-Words (CBOW) and Continuous Skip-gram. These methods are compared through intrinsic and extrinsic evaluation tasks on three different sources of data: the English Wikipedia corpus, the Greek Wikipedia corpus, and the Greek Web Content corpus. By comparing these methods across different tasks and datasets, it is evident that the CBOS method achieves state-of-the-art performance.

Speaker-change Aware CRF for Dialogue Act Classification
Guokan Shang | Antoine Tixier | Michalis Vazirgiannis | Jean-Pierre Lorré
Proceedings of the 28th International Conference on Computational Linguistics

Recent work in Dialogue Act (DA) classification approaches the task as a sequence labeling problem, using neural network models coupled with a Conditional Random Field (CRF) as the last layer. CRF models the conditional probability of the target DA label sequence given the input utterance sequence. However, the task involves another important input sequence, that of speakers, which is ignored by previous work. To address this limitation, this paper proposes a simple modification of the CRF layer that takes speaker-change into account. Experiments on the SwDA corpus show that our modified CRF layer outperforms the original one, with very wide margins for some DA labels. Further, visualizations demonstrate that our CRF layer can learn meaningful, sophisticated transition patterns between DA label pairs conditioned on speaker-change in an end-to-end way. Code is publicly available.

Evaluation of Greek Word Embeddings
Stamatis Outsios | Christos Karatsalos | Konstantinos Skianis | Michalis Vazirgiannis
Proceedings of the Twelfth Language Resources and Evaluation Conference

Since word embeddings have been the most popular input for many NLP tasks, evaluating their quality is critical. Most research efforts are focusing on English word embeddings. This paper addresses the problem of training and evaluating such models for the Greek language. We present a new word analogy test set considering the original English Word2vec analogy test set and some specific linguistic aspects of the Greek language as well. Moreover, we create a Greek version of WordSim353 test collection for a basic evaluation of word similarities. Produced resources are available for download. We test seven word vector models and our evaluation shows that we are able to create meaningful representations. Last, we discover that the morphological complexity of the Greek language and polysemy can influence the quality of the resulting word embeddings.

2019

Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)
Dmitry Ustalov | Swapna Somasundaran | Peter Jansen | Goran Glavaš | Martin Riedl | Mihai Surdeanu | Michalis Vazirgiannis
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

Scalable graph-based method for individual named entity identification
Sammy Khalife | Michalis Vazirgiannis
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

In this paper, we consider the named entity linking (NEL) problem. We assume a set of queries, named entities, that have to be identified within a knowledge base. This knowledge base is represented by a text database paired with a semantic graph, endowed with a classification of entities (ontology). We present state-of-the-art methods in NEL, and propose a new method for individual identification requiring few annotated data samples. We demonstrate its scalability and performance over standard datasets, for several ontology configurations. Our approach is well-motivated for integration in real systems. Indeed, recent deep learning methods, despite their capacity to improve experimental precision, require lots of parameter tuning along with large volume of annotated data.

2018

Orthogonal Matching Pursuit for Text Classification
Konstantinos Skianis | Nikolaos Tziortziotis | Michalis Vazirgiannis
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential. Although classic regularizers provide sparsity, they fail to return highly accurate models. On the contrary, state-of-the-art group-lasso regularizers provide better results at the expense of low sparsity. In this paper, we apply a greedy variable selection algorithm, called Orthogonal Matching Pursuit, for the text classification task. We also extend standard group OMP by introducing overlapping Group OMP to handle overlapping groups of features. Empirical analysis verifies that both OMP and overlapping GOMP constitute powerful regularizers, able to produce effective and very sparse models. Code and data are available online.

Fusing Document, Collection and Label Graph-based Representations with Word Embeddings for Text Classification
Konstantinos Skianis | Fragkiskos Malliaros | Michalis Vazirgiannis
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)

Contrary to the traditional Bag-of-Words approach, we consider the Graph-of-Words(GoW) model in which each document is represented by a graph that encodes relationships between the different terms. Based on this formulation, the importance of a term is determined by weighting the corresponding node in the document, collection and label graphs, using node centrality criteria. We also introduce novel graph-based weighting schemes by enriching graphs with word-embedding similarities, in order to reward or penalize semantic relationships. Our methods produce more discriminative feature weights for text categorization, outperforming existing frequency-based criteria.

Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization
Guokan Shang | Wensi Ding | Zekun Zhang | Antoine Tixier | Polykarpos Meladianos | Michalis Vazirgiannis | Jean-Pierre Lorré
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce a novel graph-based framework for abstractive meeting speech summarization that is fully unsupervised and does not rely on any annotations. Our work combines the strengths of multiple recent approaches while addressing their weaknesses. Moreover, we leverage recent advances in word embeddings and graph degeneracy applied to NLP to take exterior semantic knowledge into account, and to design custom diversity and informativeness measures. Experiments on the AMI and ICSI corpus show that our system improves on the state-of-the-art. Code and data are publicly available, and our system can be interactively tested.

2017

Graph-based Text Representations: Boosting Text Mining, NLP and Information Retrieval with Graphs
Fragkiskos D. Malliaros | Michalis Vazirgiannis
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Graphs or networks have been widely used as modeling tools in Natural Language Processing (NLP), Text Mining (TM) and Information Retrieval (IR). Traditionally, the unigram bag-of-words representation is applied; that way, a document is represented as a multiset of its terms, disregarding dependencies between the terms. Although several variants and extensions of this modeling approach have been proposed (e.g., the n-gram model), the main weakness comes from the underlying term independence assumption. The order of the terms within a document is completely disregarded and any relationship between terms is not taken into account in the final task (e.g., text categorization). Nevertheless, as the heterogeneity of text collections is increasing (especially with respect to document length and vocabulary), the research community has started exploring different document representations aiming to capture more fine-grained contexts of co-occurrence between different terms, challenging the well-established unigram bag-of-words model. To this direction, graphs constitute a well-developed model that has been adopted for text representation. The goal of this tutorial is to offer a comprehensive presentation of recent methods that rely on graph-based text representations to deal with various tasks in NLP and IR. We will describe basic as well as novel graph theoretic concepts and we will examine how they can be applied in a wide range of text-related application domains.All the material associated to the tutorial will be available at: http://fragkiskosm.github.io/projects/graph_text_tutorial

Real-Time Keyword Extraction from Conversations
Polykarpos Meladianos | Antoine Tixier | Ioannis Nikolentzos | Michalis Vazirgiannis
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We introduce a novel method to extract keywords from meeting speech in real-time. Our approach builds on the graph-of-words representation of text and leverages the k-core decomposition algorithm and properties of submodular functions. We outperform multiple baselines in a real-time scenario emulated from the AMI and ICSI meeting corpora. Evaluation is conducted against both extractive and abstractive gold standard using two standard performance metrics and a newer one based on word embeddings.

Shortest-Path Graph Kernels for Document Similarity
Giannis Nikolentzos | Polykarpos Meladianos | François Rousseau | Yannis Stavrakas | Michalis Vazirgiannis
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we present a novel document similarity measure based on the definition of a graph kernel between pairs of documents. The proposed measure takes into account both the terms contained in the documents and the relationships between them. By representing each document as a graph-of-words, we are able to model these relationships and then determine how similar two documents are by using a modified shortest-path graph kernel. We evaluate our approach on two tasks and compare it against several baseline approaches using various performance metrics such as DET curves and macro-average F1-score. Experimental results on a range of datasets showed that our proposed approach outperforms traditional techniques and is capable of measuring more accurately the similarity between two documents.

Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization
Giannis Nikolentzos | Polykarpos Meladianos | François Rousseau | Yannis Stavrakas | Michalis Vazirgiannis
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Recently, there has been a lot of activity in learning distributed representations of words in vector spaces. Although there are models capable of learning high-quality distributed representations of words, how to generate vector representations of the same quality for phrases or documents still remains a challenge. In this paper, we propose to model each document as a multivariate Gaussian distribution based on the distributed representations of its words. We then measure the similarity between two documents based on the similarity of their distributions. Experiments on eight standard text categorization datasets demonstrate the effectiveness of the proposed approach in comparison with state-of-the-art methods.

Combining Graph Degeneracy and Submodularity for Unsupervised Extractive Summarization
Antoine Tixier | Polykarpos Meladianos | Michalis Vazirgiannis
Proceedings of the Workshop on New Frontiers in Summarization

We present a fully unsupervised, extractive text summarization system that leverages a submodularity framework introduced by past research. The framework allows summaries to be generated in a greedy way while preserving near-optimal performance guarantees. Our main contribution is the novel coverage reward term of the objective function optimized by the greedy algorithm. This component builds on the graph-of-words representation of text and the k-core decomposition algorithm to assign meaningful scores to words. We evaluate our approach on the AMI and ICSI meeting speech corpora, and on the DUC2001 news corpus. We reach state-of-the-art performance on all datasets. Results indicate that our method is particularly well-suited to the meeting domain.

2016

A Graph Degeneracy-based Approach to Keyword Extraction
Antoine Tixier | Fragkiskos Malliaros | Michalis Vazirgiannis
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

GoWvis: A Web Application for Graph-of-Words-based Text Visualization and Summarization
Antoine Tixier | Konstantinos Skianis | Michalis Vazirgiannis
Proceedings of ACL-2016 System Demonstrations

Regularizing Text Categorization with Clusters of Words
Konstantinos Skianis | François Rousseau | Michalis Vazirgiannis
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization
Jonghoon Kim | François Rousseau | Michalis Vazirgiannis
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Text Categorization as a Graph Classification Problem
François Rousseau | Emmanouil Kiagias | Michalis Vazirgiannis
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Co-authors

Polykarpos Meladianos 5

François Rousseau 5

Konstantinos Skianis 5

Christos Xypolopoulos 4

Chloé Clavel 3

Jean-Pierre Lorré 3

Stamatis Outsios 3

Fragkiskos Malliaros 2

Preslav Nakov 2

Giannis Nikolentzos 2

Yannis Stavrakas 2

Yassine Abbahaddou 1

Mohamed Anwar 1

Abdelaziz Bounhar 1

Rajaa El Hamdani 1

Omar El Herraoui 1

Sofiane Ennadir 1

Iakovos Evdaimon 1

Mingmeng Geng 1

Goran Glavaš 1

Christos Karatsalos 1

Sammy Khalife 1

Yousef Khoubrane 1

Emmanouil Kiagias 1

Joseph Le Roux 1

Michalis Lioudakis 1

Fragkiskos D. Malliaros 1

Imane Momayiz 1

Eric Moulines 1

Ioannis Nikolentzos 1

David Restrepo Amariles 1

Swapna Somasundaran 1

Giorgos Stamou 1

Mihai Surdeanu 1

Nikolaos Tziortziotis 1

Dmitry Ustalov 1

Venues