Maxime Peyrard


2021

pdf bib
Better than Average: Paired Evaluation of NLP systems
Maxime Peyrard | Wei Zhao | Steffen Eger | Robert West
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instancelevel pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley–Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30% of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.

2020

pdf bib
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Steffen Eger | Yang Gao | Maxime Peyrard | Wei Zhao | Eduard Hovy
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

pdf bib
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
Wei Zhao | Goran Glavaš | Maxime Peyrard | Yang Gao | Robert West | Steffen Eger
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish “translationese”, i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.

pdf bib
KLearn: Background Knowledge Inference from Summarization Data
Maxime Peyrard | Robert West
Findings of the Association for Computational Linguistics: EMNLP 2020

The goal of text summarization is to compress documents to the relevant information while excluding background information already known to the receiver. So far, summarization researchers have given considerably more attention to relevance than to background knowledge. In contrast, this work puts background knowledge in the foreground. Building on the realization that the choices made by human summarizers and annotators contain implicit information about their background knowledge, we develop and compare techniques for inferring background knowledge from summarization data. Based on this framework, we define summary scoring functions that explicitly model background knowledge, and show that these scoring functions fit human judgments significantly better than baselines. We illustrate some of the many potential applications of our framework. First, we provide insights into human information importance priors. Second, we demonstrate that averaging the background knowledge of multiple, potentially biased annotators or corpora greatly improves summaryscoring performance. Finally, we discuss potential applications of our framework beyond summarization.

2019

pdf bib
A Simple Theoretical Model of Importance for Summarization
Maxime Peyrard
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Research on summarization has mainly been driven by empirical approaches, crafting systems to perform well on standard datasets with the notion of information Importance remaining latent. We argue that establishing theoretical models of Importance will advance our understanding of the task and help to further improve summarization systems. To this end, we propose simple but rigorous definitions of several concepts that were previously used only intuitively in summarization: Redundancy, Relevance, and Informativeness. Importance arises as a single quantity naturally unifying these concepts. Additionally, we provide intuitions to interpret the proposed quantities and experiments to demonstrate the potential of the framework to inform and guide subsequent works.

pdf bib
Studying Summarization Evaluation Metrics in the Appropriate Scoring Range
Maxime Peyrard
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/TAC shared tasks. However, modern systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can’t decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.

pdf bib
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao | Maxime Peyrard | Fei Liu | Yang Gao | Christian M. Meyer | Steffen Eger
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

2018

pdf bib
Objective Function Learning to Match Human Judgements for Optimization-Based Summarization
Maxime Peyrard | Iryna Gurevych
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Supervised summarization systems usually rely on supervision at the sentence or n-gram level provided by automatic metrics like ROUGE, which act as noisy proxies for human judgments. In this work, we learn a summary-level scoring function 𝜃 including human judgments as supervision and automatically generated data as regularization. We extract summaries with a genetic algorithm using 𝜃 as a fitness function. We observe strong and promising performances across datasets in both automatic and manual evaluation.

pdf bib
Live Blog Corpus for Summarization
Avinesh P.V.S. | Maxime Peyrard | Christian M. Meyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Supervised Learning of Automatic Pyramid for Optimization-Based Multi-Document Summarization
Maxime Peyrard | Judith Eckle-Kohler
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a new supervised framework that learns to estimate automatic Pyramid scores and uses them for optimization-based extractive multi-document summarization. For learning automatic Pyramid scores, we developed a method for automatic training data generation which is based on a genetic algorithm using automatic Pyramid as the fitness function. Our experimental evaluation shows that our new framework significantly outperforms strong baselines regarding automatic Pyramid, and that there is much room for improvement in comparison with the upper-bound for automatic Pyramid.

pdf bib
A Principled Framework for Evaluating Summarizers: Comparing Models of Summary Quality against Human Judgments
Maxime Peyrard | Judith Eckle-Kohler
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present a new framework for evaluating extractive summarizers, which is based on a principled representation as optimization problem. We prove that every extractive summarizer can be decomposed into an objective function and an optimization technique. We perform a comparative analysis and evaluation of several objective functions embedded in well-known summarizers regarding their correlation with human judgments. Our comparison of these correlations across two datasets yields surprising insights into the role and performance of objective functions in the different summarizers.

pdf bib
LSDSem 2017: Exploring Data Generation Methods for the Story Cloze Test
Michael Bugert | Yevgeniy Puzikov | Andreas Rücklé | Judith Eckle-Kohler | Teresa Martin | Eugenio Martínez-Cámara | Daniil Sorokin | Maxime Peyrard | Iryna Gurevych
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

The Story Cloze test is a recent effort in providing a common test scenario for text understanding systems. As part of the LSDSem 2017 shared task, we present a system based on a deep learning architecture combined with a rich set of manually-crafted linguistic features. The system outperforms all known baselines for the task, suggesting that the chosen approach is promising. We additionally present two methods for generating further training data based on stories from the ROCStories corpus.

pdf bib
Learning to Score System Summaries for Better Content Selection Evaluation.
Maxime Peyrard | Teresa Botschen | Iryna Gurevych
Proceedings of the Workshop on New Frontiers in Summarization

The evaluation of summaries is a challenging but crucial task of the summarization field. In this work, we propose to learn an automatic scoring metric based on the human judgements available as part of classical summarization datasets like TAC-2008 and TAC-2009. Any existing automatic scoring metrics can be included as features, the model learns the combination exhibiting the best correlation with human judgments. The reliability of the new metric is tested in a further manual evaluation where we ask humans to evaluate summaries covering the whole scoring spectrum of the metric. We release the trained metric as an open-source tool.

2016

pdf bib
A General Optimization Framework for Multi-Document Summarization Using Genetic Algorithms and Swarm Intelligence
Maxime Peyrard | Judith Eckle-Kohler
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Extracting summaries via integer linear programming and submodularity are popular and successful techniques in extractive multi-document summarization. However, many interesting optimization objectives are neither submodular nor factorizable into an integer linear program. We address this issue and present a general optimization framework where any function of input documents and a system summary can be plugged in. Our framework includes two kinds of summarizers – one based on genetic algorithms, the other using a swarm intelligence approach. In our experimental evaluation, we investigate the optimization of two information-theoretic summary evaluation metrics and find that our framework yields competitive results compared to several strong summarization baselines. Our comparative analysis of the genetic and swarm summarizers reveals interesting complementary properties.

pdf bib
The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach
Markus Zopf | Maxime Peyrard | Judith Eckle-Kohler
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Research in multi-document summarization has focused on newswire corpora since the early beginnings. However, the newswire genre provides genre-specific features such as sentence position which are easy to exploit in summarization systems. Such easy to exploit genre-specific features are available in other genres as well. We therefore present the new hMDS corpus for multi-document summarization, which contains heterogeneous source documents from multiple text genres, as well as summaries with different lengths. For the construction of the corpus, we developed a novel construction approach which is suited to build large and heterogeneous summarization corpora with little effort. The method reverses the usual process of writing summaries for given source documents: it combines already available summaries with appropriate source documents. In a detailed analysis, we show that our new corpus is significantly different from the homogeneous corpora commonly used, and that it is heterogeneous along several dimensions. Our experimental evaluation using well-known state-of-the-art summarization systems shows that our corpus poses new challenges in the field of multi-document summarization. Last but not least, we make our corpus publicly available to the research community at the corpus web page https://github.com/AIPHES/hMDS.

pdf bib
Optimizing an Approximation of ROUGE - a Problem-Reduction Approach to Extractive Multi-Document Summarization
Maxime Peyrard | Judith Eckle-Kohler
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)