Thomas Scialom


2022

pdf bib
Choisir le bon co-équipier pour la génération coopérative de texte (Choosing The Right Teammate For Cooperative Text Generation)
Antoine Chaffin | Vincent Claveau | Ewa Kijak | Sylvain Lamprier | Benjamin Piwowarski | Thomas Scialom | Jacopo Staiano
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Les modèles de langue génèrent des textes en prédisant successivement des distributions de probabilité pour les prochains tokens en fonction des tokens précédents. Pour générer des textes avec des propriétés souhaitées (par ex. être plus naturels, non toxiques ou avoir un style d’écriture spécifique), une solution — le décodage coopératif — consiste à utiliser un classifieur lors de la génération pour guider l’échantillonnage de la distribution du modèle de langue vers des textes ayant la propriété attendue. Dans cet article, nous examinons trois familles de discriminateurs (basés sur des transformers) pour cette tâche de décodage coopératif : les discriminateurs bidirectionnels, unidirectionnels (de gauche à droite) et génératifs. Nous évaluons leurs avantages et inconvénients, en explorant leur précision respective sur des tâches de classification, ainsi que leur impact sur la génération coopérative et leur coût de calcul, dans le cadre d’une stratégie de décodage état de l’art, basée sur une recherche arborescente de Monte-Carlo (MCTS). Nous fournissons également l’implémentation (batchée) utilisée pour nos expériences.

pdf bib
TRUE: Re-evaluating Factual Consistency Evaluation
Or Honovich | Roee Aharoni | Jonathan Herzig | Hagai Taitelbaum | Doron Kukliansy | Vered Cohen | Thomas Scialom | Idan Szpektor | Avinatan Hassidim | Yossi Matias
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear.In this work, we introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better methods.

pdf bib
TRUE: Re-evaluating Factual Consistency Evaluation
Or Honovich | Roee Aharoni | Jonathan Herzig | Hagai Taitelbaum | Doron Kukliansy | Vered Cohen | Thomas Scialom | Idan Szpektor | Avinatan Hassidim | Yossi Matias
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear.In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods.

2021

pdf bib
QuestEval: Summarization Asks for Fact-based Evaluation
Thomas Scialom | Paul-Alexis Dray | Sylvain Lamprier | Benjamin Piwowarski | Jacopo Staiano | Alex Wang | Patrick Gallinari
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Summarization evaluation remains an open research problem: current metrics such as ROUGE are known to be limited and to correlate poorly with human judgments. To alleviate this issue, recent work has proposed evaluation metrics which rely on question answering models to assess whether a summary contains all the relevant information in its source document. Though promising, the proposed approaches have so far failed to correlate better than ROUGE with human judgments. In this paper, we extend previous approaches and propose a unified framework, named QuestEval. In contrast to established metrics such as ROUGE or BERTScore, QuestEval does not require any ground-truth reference. Nonetheless, QuestEval substantially improves the correlation with human judgments over four evaluation dimensions (consistency, coherence, fluency, and relevance), as shown in extensive experiments.

pdf bib
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
Arij Riabi | Thomas Scialom | Rachel Keraron | Benoît Sagot | Djamé Seddah | Jacopo Staiano
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).

pdf bib
Data-QuestEval: A Referenceless Metric for Data-to-Text Semantic Evaluation
Clement Rebuffel | Thomas Scialom | Laure Soulier | Benjamin Piwowarski | Sylvain Lamprier | Jacopo Staiano | Geoffrey Scoutheeten | Patrick Gallinari
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

QuestEval is a reference-less metric used in text-to-text tasks, that compares the generated summaries directly to the source text, by automatically asking and answering questions. Its adaptation to Data-to-Text tasks is not straightforward, as it requires multimodal Question Generation and Answering systems on the considered tasks, which are seldom available. To this purpose, we propose a method to build synthetic multimodal corpora enabling to train multimodal components for a data-QuestEval metric. The resulting metric is reference-less and multimodal; it obtains state-of-the-art correlations with human judgment on the WebNLG and WikiBio benchmarks. We make data-QuestEval’s code and models available for reproducibility purpose, as part of the QuestEval project.

pdf bib
Skim-Attention: Learning to Focus via Document Layout
Laura Nguyen | Thomas Scialom | Jacopo Staiano | Benjamin Piwowarski
Findings of the Association for Computational Linguistics: EMNLP 2021

Transformer-based pre-training techniques of text and layout have proven effective in a number of document understanding tasks. Despite this success, multimodal pre-training models suffer from very high computational and memory costs. Motivated by human reading strategies, this paper presents Skim-Attention, a new attention mechanism that takes advantage of the structure of the document and its layout. Skim-Attention only attends to the 2-dimensional position of the words in a document. Our experiments show that Skim-Attention obtains a lower perplexity than prior works, while being more computationally efficient. Skim-Attention can be further combined with long-range Transformers to efficiently process long documents. We also show how Skim-Attention can be used off-the-shelf as a mask for any Pre-trained Language Model, allowing to improve their performance while restricting attention. Finally, we show the emergence of a document structure representation in Skim-Attention.

pdf bib
QACE: Asking Questions to Evaluate an Image Caption
Hwanhee Lee | Thomas Scialom | Seunghyun Yoon | Franck Dernoncourt | Kyomin Jung
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper we propose QACE, a new metric based on Question Answering for Caption Evaluation to evaluate image captioning based on Question Generation(QG) and Question Answering(QA) systems. QACE generates questions on the evaluated caption and check its content by asking the questions on either the reference caption or the source image. We first develop QACE_Ref that compares the answers of the evaluated caption to its reference, and report competitive results with the state-of-the-art metrics. To go further, we propose QACE_Img, that asks the questions directly on the image, instead of reference. A Visual-QA system is necessary for QACE_Img. Unfortunately, the standard VQA models are actually framed a classification among only few thousands categories. Instead, we propose Visual-T5, an abstractive VQA system. The resulting metric, QACE_Img is multi-modal, reference-less and explainable. Our experiments show that QACE_Img compares favorably w.r.t. other reference-less metrics.

2020

pdf bib
Project PIAF: Building a Native French Question-Answering Dataset
Rachel Keraron | Guillaume Lancrenon | Mathilde Bras | Frédéric Allary | Gilles Moyse | Thomas Scialom | Edmundo-Pavel Soriano-Morales | Jacopo Staiano
Proceedings of the Twelfth Language Resources and Evaluation Conference

Motivated by the lack of data for non-English languages, in particular for the evaluation of downstream tasks such as Question Answering, we present a participatory effort to collect a native French Question Answering Dataset. Furthermore, we describe and publicly release the annotation tool developed for our collection effort, along with the data obtained and preliminary baselines.

pdf bib
Ask to Learn: A Study on Curiosity-driven Question Generation
Thomas Scialom | Jacopo Staiano
Proceedings of the 28th International Conference on Computational Linguistics

We propose a novel text generation task, namely Curiosity-driven Question Generation. We start from the observation that the Question Generation task has traditionally been considered as the dual problem of Question Answering, hence tackling the problem of generating a question given the text that contains its answer. Such questions can be used to evaluate machine reading comprehension. However, in real life, and especially in conversational settings, humans tend to ask questions with the goal of enriching their knowledge and/or clarifying aspects of previously gathered information. We refer to these inquisitive questions as Curiosity-driven: these questions are generated with the goal of obtaining new information (the answer) which is not present in the input text. In this work, we experiment on this new task using a conversational Question Answering (QA) dataset; further, since the majority of QA dataset are not built in a conversational manner, we describe a methodology to derive data for this novel task from non-conversational QA data. We investigate several automated metrics to measure the different properties of Curious Questions, and experiment different approaches on the Curiosity-driven Question Generation task, including model pre-training and reinforcement learning. Finally, we report a qualitative evaluation of the generated outputs.

pdf bib
Toward Stance-based Personas for Opinionated Dialogues
Thomas Scialom | Serra Sinem Tekiroğlu | Jacopo Staiano | Marco Guerini
Findings of the Association for Computational Linguistics: EMNLP 2020

In the context of chit-chat dialogues it has been shown that endowing systems with a persona profile is important to produce more coherent and meaningful conversations. Still, the representation of such personas has thus far been limited to a fact-based representation (e.g. “I have two cats.”). We argue that these representations remain superficial w.r.t. the complexity of human personality. In this work, we propose to make a step forward and investigate stance-based persona, trying to grasp more profound characteristics, such as opinions, values, and beliefs to drive language generation. To this end, we introduce a novel dataset allowing to explore different stance-based persona representations and their impact on claim generation, showing that they are able to grasp abstract and profound aspects of the author persona.

pdf bib
MLSUM: The Multilingual Summarization Corpus
Thomas Scialom | Paul-Alexis Dray | Sylvain Lamprier | Benjamin Piwowarski | Jacopo Staiano
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

pdf bib
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom | Patrick Bordes | Paul-Alexis Dray | Jacopo Staiano | Patrick Gallinari
Proceedings of the 13th International Conference on Natural Language Generation

Pre-trained language models have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported under different configurations indicate an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training. The proposed model obtains substantial improvements over the state-of-the-art on two established VQG datasets.

2019

pdf bib
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models
Thomas Scialom | Sylvain Lamprier | Benjamin Piwowarski | Jacopo Staiano
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Abstractive summarization approaches based on Reinforcement Learning (RL) have recently been proposed to overcome classical likelihood maximization. RL enables to consider complex, possibly non differentiable, metrics that globally assess the quality and relevance of the generated outputs. ROUGE, the most used summarization metric, is known to suffer from bias towards lexical similarity as well as from sub-optimal accounting for fluency and readability of the generated abstracts. We thus explore and propose alternative evaluation measures: the reported human-evaluation analysis shows that the proposed metrics, based on Question Answering, favorably compare to ROUGE – with the additional property of not requiring reference summaries. Training a RL-based model on these metrics leads to improvements (both in terms of human or automated metrics) over current approaches that use ROUGE as reward.

pdf bib
Self-Attention Architectures for Answer-Agnostic Neural Question Generation
Thomas Scialom | Benjamin Piwowarski | Jacopo Staiano
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Neural architectures based on self-attention, such as Transformers, recently attracted interest from the research community, and obtained significant improvements over the state of the art in several tasks. We explore how Transformers can be adapted to the task of Neural Question Generation without constraining the model to focus on a specific answer passage. We study the effect of several strategies to deal with out-of-vocabulary words such as copy mechanisms, placeholders, and contextual word embeddings. We report improvements obtained over the state-of-the-art on the SQuAD dataset according to automated metrics (BLEU, ROUGE), as well as qualitative human assessments of the system outputs.