Michalis Vazirgiannis


2021

pdf bib
BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
Moussa Kamal Eddine | Antoine Tixier | Michalis Vazirgiannis
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez’ corpus, and show our resulting model, mBARThez, to significantly boost BARThez’ generative performance.

pdf bib
BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets
Yanzhu Guo | Virgile Rennard | Christos Xypolopoulos | Michalis Vazirgiannis
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialised using a general-domain French language model CamemBERT which follows the base architecture of BERT. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.

pdf bib
Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings
Christos Xypolopoulos | Antoine Tixier | Michalis Vazirgiannis
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. Through rigorous experiments, we show that our rankings are well correlated, with strong statistical significance, with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia, etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings and make interesting observations. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our approach makes it applicable to any language. Code and data are publicly available https://github.com/ksipos/polysemy-assessment .

pdf bib
JuriBERT: A Masked-Language Model Adaptation for French Legal Text
Stella Douka | Hadi Abdine | Michalis Vazirgiannis | Rajaa El Hamdani | David Restrepo Amariles
Proceedings of the Natural Legal Language Processing Workshop 2021

Language models have proven to be very useful when adapted to specific domains. Nonetheless, little research has been done on the adaptation of domain-specific BERT models in the French language. In this paper, we focus on creating a language model adapted to French legal text with the goal of helping law professionals. We conclude that some specific tasks do not benefit from generic language models pre-trained on large amounts of data. We explore the use of smaller architectures in domain-specific sub-languages and their benefits for French legal text. We prove that domain-specific pre-trained models can perform better than their equivalent generalised ones in the legal domain. Finally, we release JuriBERT, a new set of BERT models adapted to the French legal domain.

2020

pdf bib
An Ensemble Method for Producing Word Representations focusing on the Greek Language
Michalis Lioudakis | Stamatis Outsios | Michalis Vazirgiannis
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

In this paper we present a new ensemble method, Continuous Bag-of-Skip-grams (CBOS), that produces high-quality word representations putting emphasis on the Greek language. The CBOS method combines the pioneering approaches for learning word representations: Continuous Bag-of-Words (CBOW) and Continuous Skip-gram. These methods are compared through intrinsic and extrinsic evaluation tasks on three different sources of data: the English Wikipedia corpus, the Greek Wikipedia corpus, and the Greek Web Content corpus. By comparing these methods across different tasks and datasets, it is evident that the CBOS method achieves state-of-the-art performance.

pdf bib
Energy-based Self-attentive Learning of Abstractive Communities for Spoken Language Understanding
Guokan Shang | Antoine Tixier | Michalis Vazirgiannis | Jean-Pierre Lorré
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Abstractive community detection is an important spoken language understanding task, whose goal is to group utterances in a conversation according to whether they can be jointly summarized by a common abstractive sentence. This paper provides a novel approach to this task. We first introduce a neural contextual utterance encoder featuring three types of self-attention mechanisms. We then train it using the siamese and triplet energy-based meta-architectures. Experiments on the AMI corpus show that our system outperforms multiple energy-based and non-energy based baselines from the state-of-the-art. Code and data are publicly available.

pdf bib
Speaker-change Aware CRF for Dialogue Act Classification
Guokan Shang | Antoine Tixier | Michalis Vazirgiannis | Jean-Pierre Lorré
Proceedings of the 28th International Conference on Computational Linguistics

Recent work in Dialogue Act (DA) classification approaches the task as a sequence labeling problem, using neural network models coupled with a Conditional Random Field (CRF) as the last layer. CRF models the conditional probability of the target DA label sequence given the input utterance sequence. However, the task involves another important input sequence, that of speakers, which is ignored by previous work. To address this limitation, this paper proposes a simple modification of the CRF layer that takes speaker-change into account. Experiments on the SwDA corpus show that our modified CRF layer outperforms the original one, with very wide margins for some DA labels. Further, visualizations demonstrate that our CRF layer can learn meaningful, sophisticated transition patterns between DA label pairs conditioned on speaker-change in an end-to-end way. Code is publicly available.

pdf bib
Evaluation of Greek Word Embeddings
Stamatis Outsios | Christos Karatsalos | Konstantinos Skianis | Michalis Vazirgiannis
Proceedings of the 12th Language Resources and Evaluation Conference

Since word embeddings have been the most popular input for many NLP tasks, evaluating their quality is critical. Most research efforts are focusing on English word embeddings. This paper addresses the problem of training and evaluating such models for the Greek language. We present a new word analogy test set considering the original English Word2vec analogy test set and some specific linguistic aspects of the Greek language as well. Moreover, we create a Greek version of WordSim353 test collection for a basic evaluation of word similarities. Produced resources are available for download. We test seven word vector models and our evaluation shows that we are able to create meaningful representations. Last, we discover that the morphological complexity of the Greek language and polysemy can influence the quality of the resulting word embeddings.

2019

pdf bib
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)
Dmitry Ustalov | Swapna Somasundaran | Peter Jansen | Goran Glavaš | Martin Riedl | Mihai Surdeanu | Michalis Vazirgiannis
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

pdf bib
Scalable graph-based method for individual named entity identification
Sammy Khalife | Michalis Vazirgiannis
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

In this paper, we consider the named entity linking (NEL) problem. We assume a set of queries, named entities, that have to be identified within a knowledge base. This knowledge base is represented by a text database paired with a semantic graph, endowed with a classification of entities (ontology). We present state-of-the-art methods in NEL, and propose a new method for individual identification requiring few annotated data samples. We demonstrate its scalability and performance over standard datasets, for several ontology configurations. Our approach is well-motivated for integration in real systems. Indeed, recent deep learning methods, despite their capacity to improve experimental precision, require lots of parameter tuning along with large volume of annotated data.

2018

pdf bib
Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization
Guokan Shang | Wensi Ding | Zekun Zhang | Antoine Tixier | Polykarpos Meladianos | Michalis Vazirgiannis | Jean-Pierre Lorré
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce a novel graph-based framework for abstractive meeting speech summarization that is fully unsupervised and does not rely on any annotations. Our work combines the strengths of multiple recent approaches while addressing their weaknesses. Moreover, we leverage recent advances in word embeddings and graph degeneracy applied to NLP to take exterior semantic knowledge into account, and to design custom diversity and informativeness measures. Experiments on the AMI and ICSI corpus show that our system improves on the state-of-the-art. Code and data are publicly available, and our system can be interactively tested.

pdf bib
Fusing Document, Collection and Label Graph-based Representations with Word Embeddings for Text Classification
Konstantinos Skianis | Fragkiskos Malliaros | Michalis Vazirgiannis
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)

Contrary to the traditional Bag-of-Words approach, we consider the Graph-of-Words(GoW) model in which each document is represented by a graph that encodes relationships between the different terms. Based on this formulation, the importance of a term is determined by weighting the corresponding node in the document, collection and label graphs, using node centrality criteria. We also introduce novel graph-based weighting schemes by enriching graphs with word-embedding similarities, in order to reward or penalize semantic relationships. Our methods produce more discriminative feature weights for text categorization, outperforming existing frequency-based criteria.

pdf bib
Orthogonal Matching Pursuit for Text Classification
Konstantinos Skianis | Nikolaos Tziortziotis | Michalis Vazirgiannis
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential. Although classic regularizers provide sparsity, they fail to return highly accurate models. On the contrary, state-of-the-art group-lasso regularizers provide better results at the expense of low sparsity. In this paper, we apply a greedy variable selection algorithm, called Orthogonal Matching Pursuit, for the text classification task. We also extend standard group OMP by introducing overlapping Group OMP to handle overlapping groups of features. Empirical analysis verifies that both OMP and overlapping GOMP constitute powerful regularizers, able to produce effective and very sparse models. Code and data are available online.

2017

pdf bib
Shortest-Path Graph Kernels for Document Similarity
Giannis Nikolentzos | Polykarpos Meladianos | François Rousseau | Yannis Stavrakas | Michalis Vazirgiannis
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we present a novel document similarity measure based on the definition of a graph kernel between pairs of documents. The proposed measure takes into account both the terms contained in the documents and the relationships between them. By representing each document as a graph-of-words, we are able to model these relationships and then determine how similar two documents are by using a modified shortest-path graph kernel. We evaluate our approach on two tasks and compare it against several baseline approaches using various performance metrics such as DET curves and macro-average F1-score. Experimental results on a range of datasets showed that our proposed approach outperforms traditional techniques and is capable of measuring more accurately the similarity between two documents.

bib
Graph-based Text Representations: Boosting Text Mining, NLP and Information Retrieval with Graphs
Fragkiskos D. Malliaros | Michalis Vazirgiannis
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Graphs or networks have been widely used as modeling tools in Natural Language Processing (NLP), Text Mining (TM) and Information Retrieval (IR). Traditionally, the unigram bag-of-words representation is applied; that way, a document is represented as a multiset of its terms, disregarding dependencies between the terms. Although several variants and extensions of this modeling approach have been proposed (e.g., the n-gram model), the main weakness comes from the underlying term independence assumption. The order of the terms within a document is completely disregarded and any relationship between terms is not taken into account in the final task (e.g., text categorization). Nevertheless, as the heterogeneity of text collections is increasing (especially with respect to document length and vocabulary), the research community has started exploring different document representations aiming to capture more fine-grained contexts of co-occurrence between different terms, challenging the well-established unigram bag-of-words model. To this direction, graphs constitute a well-developed model that has been adopted for text representation. The goal of this tutorial is to offer a comprehensive presentation of recent methods that rely on graph-based text representations to deal with various tasks in NLP and IR. We will describe basic as well as novel graph theoretic concepts and we will examine how they can be applied in a wide range of text-related application domains.All the material associated to the tutorial will be available at: http://fragkiskosm.github.io/projects/graph_text_tutorial

pdf bib
Combining Graph Degeneracy and Submodularity for Unsupervised Extractive Summarization
Antoine Tixier | Polykarpos Meladianos | Michalis Vazirgiannis
Proceedings of the Workshop on New Frontiers in Summarization

We present a fully unsupervised, extractive text summarization system that leverages a submodularity framework introduced by past research. The framework allows summaries to be generated in a greedy way while preserving near-optimal performance guarantees. Our main contribution is the novel coverage reward term of the objective function optimized by the greedy algorithm. This component builds on the graph-of-words representation of text and the k-core decomposition algorithm to assign meaningful scores to words. We evaluate our approach on the AMI and ICSI meeting speech corpora, and on the DUC2001 news corpus. We reach state-of-the-art performance on all datasets. Results indicate that our method is particularly well-suited to the meeting domain.

pdf bib
Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization
Giannis Nikolentzos | Polykarpos Meladianos | François Rousseau | Yannis Stavrakas | Michalis Vazirgiannis
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Recently, there has been a lot of activity in learning distributed representations of words in vector spaces. Although there are models capable of learning high-quality distributed representations of words, how to generate vector representations of the same quality for phrases or documents still remains a challenge. In this paper, we propose to model each document as a multivariate Gaussian distribution based on the distributed representations of its words. We then measure the similarity between two documents based on the similarity of their distributions. Experiments on eight standard text categorization datasets demonstrate the effectiveness of the proposed approach in comparison with state-of-the-art methods.

pdf bib
Real-Time Keyword Extraction from Conversations
Polykarpos Meladianos | Antoine Tixier | Ioannis Nikolentzos | Michalis Vazirgiannis
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We introduce a novel method to extract keywords from meeting speech in real-time. Our approach builds on the graph-of-words representation of text and leverages the k-core decomposition algorithm and properties of submodular functions. We outperform multiple baselines in a real-time scenario emulated from the AMI and ICSI meeting corpora. Evaluation is conducted against both extractive and abstractive gold standard using two standard performance metrics and a newer one based on word embeddings.

2016

pdf bib
Regularizing Text Categorization with Clusters of Words
Konstantinos Skianis | François Rousseau | Michalis Vazirgiannis
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Graph Degeneracy-based Approach to Keyword Extraction
Antoine Tixier | Fragkiskos Malliaros | Michalis Vazirgiannis
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
GoWvis: A Web Application for Graph-of-Words-based Text Visualization and Summarization
Antoine Tixier | Konstantinos Skianis | Michalis Vazirgiannis
Proceedings of ACL-2016 System Demonstrations

2015

pdf bib
Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization
Jonghoon Kim | François Rousseau | Michalis Vazirgiannis
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Text Categorization as a Graph Classification Problem
François Rousseau | Emmanouil Kiagias | Michalis Vazirgiannis
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)