Matthias Gallé


2024

pdf bib
LLMCrit: Teaching Large Language Models to Use Criteria
Weizhe Yuan | Pengfei Liu | Matthias Gallé
Findings of the Association for Computational Linguistics: ACL 2024

Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, current research in this area tends to consider only a limited number of criteria, or only a limited number of quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of adding criteria and demonstrations and provide valuable guidance on how to teach LLMs to use criteria more effectively.

pdf bib
On Leakage of Code Generation Evaluation Datasets
Alexandre Matton | Tom Sherborne | Dennis Aumiller | Elena Tommasone | Milad Alizadeh | Jingyi He | Raymond Ma | Maxime Voisin | Ellen Gilsenan-McMahon | Matthias Gallé
Findings of the Association for Computational Linguistics: EMNLP 2024

In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models.We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection.To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp

pdf bib
Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian | Chris Cremer | Matthias Gallé | Marzieh Fadaee | Julia Kreutzer | Olivier Pietquin | Ahmet Üstün | Sara Hooker
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been installed by the seminal literature as the standard method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit how alignment from human preferences is formulated in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed “RL-free” methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics allows benefiting from online RL optimization at low cost.

2022

pdf bib
Speeding Up Entmax
Maxat Tezekbayev | Vassilina Nikoulina | Matthias Gallé | Zhenisbek Assylbekov
Findings of the Association for Computational Linguistics: NAACL 2022

Softmax is the de facto standard for normalizing logits in modern neural networks for language processing. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. 𝛼-entmax of Peters et al. (2019) solves this problem, but is unfortunately slower than softmax. In this paper, we propose an alternative to 𝛼-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.

pdf bib
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
Angela Fan | Suzana Ilic | Thomas Wolf | Matthias Gallé
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

2021

pdf bib
Self-Supervised and Controlled Multi-Document Opinion Summarization
Hady Elsahar | Maximin Coavoux | Jos Rozen | Matthias Gallé
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We address the problem of unsupervised abstractive summarization of collections of user generated reviews through self-supervision and control. We propose a self-supervised setup that considers an individual document as a target summary for a set of similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss and mainstream models. We address the problem of hallucinations through the use of control codes, to steer the generation towards more coherent and relevant summaries.

pdf bib
Breaking Writer’s Block: Low-cost Fine-tuning of Natural Language Generation Models
Alexandre Duval | Thomas Lamson | Gaël de Léséleuc de Kérouara | Matthias Gallé
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

It is standard procedure these days to solve Information Extraction task by fine-tuning large pre-trained language models. This is not the case for generation task, which relies on a variety of techniques for controlled language generation. In this paper, we describe a system that fine-tunes a natural language generation model for the problem of solving writer’s block. The fine-tuning changes the conditioning to also include the right context in addition to the left context, as well as an optional list of entities, the size, the genre and a summary of the paragraph that the human author wishes to generate. Our proposed fine-tuning obtains excellent results, even with a small number of epochs and a total cost of USD 150. The system can be accessed as a web-service and all the code is released. A video showcasing the interface and the model is also available.

pdf bib
Findings of the WMT Shared Task on Machine Translation Using Terminologies
Md Mahfuz Ibn Alam | Ivana Kvapilíková | Antonios Anastasopoulos | Laurent Besacier | Georgiana Dinu | Marcello Federico | Matthias Gallé | Kweonwoo Jung | Philipp Koehn | Vassilina Nikoulina
Proceedings of the Sixth Conference on Machine Translation

Language domains that require very careful use of terminology are abundant and reflect a significant part of the translation industry. In this work we introduce a benchmark for evaluating the quality and consistency of terminology translation, focusing on the medical (and COVID-19 specifically) domain for five language pairs: English to French, Chinese, Russian, and Korean, as well as Czech to German. We report the descriptions and results of the participating systems, commenting on the need for further research efforts towards both more adequate handling of terminologies as well as towards a proper formulation and evaluation of the task.

pdf bib
Multilingual Unsupervised Neural Machine Translation with Denoising Adapters
Ahmet Üstün | Alexandre Berard | Laurent Besacier | Matthias Gallé
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data by using auxiliary parallel language pairs. For this problem the standard procedure so far to leverage the monolingual data is _back-translation_, which is computationally costly and hard to tune. In this paper we propose instead to use _denoising adapters_, adapter layers with a denoising objective, on top of pre-trained mBART-50. In addition to the modularity and flexibility of such an approach we show that the resulting translations are on-par with back-translating as measured by BLEU, and furthermore it allows adding unseen languages incrementally.

2020

pdf bib
A Multilingual Neural Machine Translation Model for Biomedical Data
Alexandre Bérard | Zae Myung Kim | Vassilina Nikoulina | Eunjeong Lucy Park | Matthias Gallé
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic and biomedical data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) and biomedical test sets, and that it outperforms the existing publicly released models. We believe that this release will help the large-scale multilingual analysis of the digital content of the COVID-19 crisis and of its effects on society, economy, and healthcare policies. We also release a test set of biomedical text for Korean-English. It consists of 758 sentences from official guidelines and recent papers, all about COVID-19.

pdf bib
Monolingual Adapters for Zero-Shot Neural Machine Translation
Jerin Philip | Alexandre Berard | Matthias Gallé | Laurent Besacier
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose a novel adapter layer formalism for adapting multilingual models. They are more parameter-efficient than existing adapter layers while obtaining as good or better performance. The layers are specific to one language (as opposed to bilingual adapters) allowing to compose them and generalize to unseen language-pairs. In this zero-shot setting, they obtain a median improvement of +2.77 BLEU points over a strong 20-language multilingual Transformer baseline trained on TED talks.

2019

pdf bib
Investigating the Effectiveness of BPE: The Power of Shorter Sequences
Matthias Gallé
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that - given a fixed vocabulary size budget - the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

pdf bib
To Annotate or Not? Predicting Performance Drop under Domain Shift
Hady Elsahar | Matthias Gallé
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Performance drop due to domain-shift is an endemic problem for NLP models in production. This problem creates an urge to continuously annotate evaluation datasets to measure the expected drop in the model performance which can be prohibitively expensive and slow. In this paper, we study the problem of predicting the performance drop of modern NLP models under domain-shift, in the absence of any target domain labels. We investigate three families of methods (-divergence, reverse classification accuracy and confidence measures), show how they can be used to predict the performance drop and study their robustness to adversarial domain-shifts. Our results on sentiment classification and sequence labelling show that our method is able to predict performance drops with an error rate as low as 2.15% and 0.89% for sentiment analysis and POS tagging respectively.

pdf bib
Joint Semantic and Distributional Word Representations with Multi-Graph Embeddings
Pierre Daix-Moreux | Matthias Gallé
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

Word embeddings continue to be of great use for NLP researchers and practitioners due to their training speed and easiness of use and distribution. Prior work has shown that the representation of those words can be improved by the use of semantic knowledge-bases. In this paper we propose a novel way of combining those knowledge-bases while the lexical information of co-occurrences of words remains. It is conceptually clear, as it consists in mapping both distributional and semantic information into a multi-graph and modifying existing node embeddings techniques to compute word representations. Our experiments show improved results compared to vanilla word embeddings, retrofitting and concatenation techniques using the same information, on a variety of data-sets of word similarities.

pdf bib
Unsupervised Aspect-Based Multi-Document Abstractive Summarization
Maximin Coavoux | Hady Elsahar | Matthias Gallé
Proceedings of the 2nd Workshop on New Frontiers in Summarization

User-generated reviews of products or services provide valuable information to customers. However, it is often impossible to read each of the potentially thousands of reviews: it would therefore save valuable time to provide short summaries of their contents. We address opinion summarization, a multi-document summarization task, with an unsupervised abstractive summarization neural system. Our system is based on (i) a language model that is meant to encode reviews to a vector space, and to generate fluent sentences from the same vector space (ii) a clustering step that groups together reviews about the same aspects and allows the system to generate summary sentences focused on these aspects. Our experiments on the Oposum dataset empirically show the importance of the clustering step.

pdf bib
Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges
Jason Eisner | Matthias Gallé | Jeffrey Heinz | Ariadna Quattoni | Guillaume Rabusseau
Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges