Patrick Gallinari - ACL Anthology

Patrick Gallinari

2025

MEXMA: Token-level objectives improve sentence representations
João Maria Janeiro | Benjamin Piwowarski | Patrick Gallinari | Loic Barrault
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-lingual sentence encoders (CLSE) create fixed-size sentence representations with aligned translations. Current pre-trained CLSE approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and *all tokens directly update the encoder*. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bitext mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.

Mixture of Languages: Improved Multilingual Encoders Through Language Grouping
João Maria Janeiro | Belen Alastruey | Francisco Massa | Maha Elbayad | Benjamin Piwowarski | Patrick Gallinari | Loic Barrault
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We propose Mixture of Languages (MoL), a new strategy to pretrain largely multilingual encoders. Recent work in this field has relied on training transformer encoders on a large amount of multilingual data, with all parameters shared across all languages, without studying how to optimally balance language transfer and interference to achieve better performance. To address this, MoL proposes to group languages based on their similarity, and add parallel, sparsely activated layers that process each group independently. This architecture allows MoL to boost language transfer while minimizing interference, without increasing the active parameter count. We show that MoL largely outperforms a dense counterpart trained with the same configuration, as well as MoE models and public multilingual encoders such as XLM-R or mBERT on downstream tasks.

Context Copying Modulation: The Role of Entropy Neurons in Managing Parametric and Contextual Knowledge Conflicts
Zineddine Tighidet | Andrea Mogini | Hedi Ben younes | Jiali Mei | Patrick Gallinari | Benjamin Piwowarski
Findings of the Association for Computational Linguistics: EMNLP 2025

The behavior of Large Language Models (LLMs) when facing contextual information that conflicts with their internal parametric knowledge is inconsistent, with no generally accepted explanation for the expected outcome distribution. Recent work has identified in autoregressive transformer models a class of neurons – called entropy neurons – that produce a significant effect on the model output entropy while having an overall moderate impact on the ranking of the predicted tokens. In this paper, we investigate the preliminary claim that these neurons are involved in inhibiting context copying behavior in transformers by looking at their role in resolving conflicts between contextual and parametric information. We show that entropy neurons are responsible for suppressing context copying across a range of LLMs, and that ablating them leads to a significant change in the generation process. These results enhance our understanding of the internal dynamics of LLMs when handling conflicting information.

SCOPE : un cadre d’entrainement auto-supervisé pour améliorer la fidélité dans la génération conditionnelle de texte
Song Duong | Florian Le Bronnec | Alexandre Allauzen | Vincent Guigue | Alberto Lumbreras | Laure Soulier | Patrick Gallinari
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Les modèles de langage (LLM) produisent souvent des hallucinations lors de la génération conditionnelle de texte, introduisant des informations non fidèles ou non ancrées dans le contexte. Ce phénomène est particulièrement problématique en résumé automatique et en génération texte-à-partirde-données, où les sorties doivent refléter précisément l’entrée. Nous proposons SCOPE, une méthode auto-supervisée innovante générant automatiquement des exemples non fidèles plausibles pour affiner les modèles par apprentissage par préférences. SCOPE pousse ainsi les modèles à préférer les sorties fidèles. Nous évaluons notre approche sur divers jeux de données de génération texte-à-partirde-données et de résumé. Simple à implémenter, notre méthode nettement les alternatives existantes selon des métriques automatiques et des évaluations humaines ainsi qu’avec GPT-4.

Sondage des Modèles de Langue sur leur Source de Connaissance
Zineddine Tighidet | Andrea Mogini | Jiali Mei | Patrick Gallinari | Benjamin Piwowarski
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Les grands modèles de langue (GML) sont souvent confrontés à des conflits entre leurs connaissance interne (connaissance paramétrique, CP) et la connaissance externe fournie pendant l’inférence (connaissance contextuelle, CC). Comprendre comment les GML priorisent une source de connaissance par rapport à l’autre reste un défi. Dans cet article, nous proposons un nouveau cadre de sondage pour explorer les mécanismes régissant la sélection entre CP et CC dans les GML. En utilisant des prompts contrôlées conçues pour contredire la CP du modèle, nous démontrons que des activations spécifiques du modèle sont indicatives de la source de connaissance employée. Nous évaluons ce cadre sur divers GML de différentes tailles et démontrons que les activations des couches intermédiaires, en particulier celles liées aux relations dans l’entrée, sont cruciales pour prédire la sélection de la source de connaissances, ouvrant la voie à des modèles plus fiables capables de gérer efficacement les conflits de connaissances.

2024

Probing Language Models on Their Knowledge Source
Zineddine Tighidet | Jiali Mei | Benjamin Piwowarski | Patrick Gallinari
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Large Language Models (LLMs) often encounter conflicts between their learned, internal (parametric knowledge, PK) and external knowledge provided during inference (contextual knowledge, CK). Understanding how LLMs models prioritize one knowledge source over the other remains a challenge. In this paper, we propose a novel probing framework to explore the mechanisms governing the selection between PK and CK in LLMs. Using controlled prompts designed to contradict the model’s PK, we demonstrate that specific model activations are indicative of the knowledge source employed. We evaluate this framework on various LLMs of different sizes and demonstrate that mid-layer activations, particularly those related to relations in the input, are crucial in predicting knowledge source selection, paving the way for more reliable models capable of handling knowledge conflicts effectively.

LOCOST: State-Space Models for Long Document Abstractive Summarization
Florian Le Bronnec | Song Duong | Mathieu Ravaut | Alexandre Allauzen | Nancy Chen | Vincent Guigue | Alberto Lumbreras | Laure Soulier | Patrick Gallinari
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of 𝒪(L log L), this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.

LOCOST: Modèles Espace-État pour le Résumé Abstractif de Documents Longs
Florian Le Bronnec | Song Duong | Alexandre Allauzen | Vincent Guigue | Alberto Lumbreras | Laure Soulier | Patrick Gallinari
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès

Les modèles espace-état constituent une alternative peu coûteuse en termes de complexité de calcul aux transformeurs pour le codage de longues séquences et la capture de longues dépendances. Nous proposons LOCOST: une architecture encodeur-décodeur basée sur des modèles espace-état pour la génération de textes conditionnels avec de longues entrées contextuelles. Avec une complexité de calcul de O(L log L), cette architecture peut traiter des séquences beaucoup plus longues que les modèles de référence qui sont basés sur des modèles d’attention parcimonieux. Nous évaluons notre modèle sur une série de tâches de résumé abstractif de longs documents. Le modèle atteint un niveau de performance qui est 93-96 comparable aux transformeurs parcimonieux les plus performants de la même taille tout en économisant jusqu’à 50 de mémoire pendant l’apprentissage et jusqu’à 87 pendant l’inférence. En outre, LOCOST traite efficacement les entrées dépassant 600K tokens au moment de l’inférence, établissant de nouveaux résultats de référence sur le résumé de livre complet et ouvrant de nouvelles perspectives pour le traitement des entrées longues.

2023

Enhancing factualness and controllability of Data-to-Text Generation via data Views and constraints
Craig Thomson | Clement Rebuffel | Ehud Reiter | Laure Soulier | Somayajulu Sripada | Patrick Gallinari
Proceedings of the 16th International Natural Language Generation Conference

Neural data-to-text systems lack the control and factual accuracy required to generate useful and insightful summaries of multidimensional data. We propose a solution in the form of data views, where each view describes an entity and its attributes along specific dimensions. A sequence of views can then be used as a high-level schema for document planning, with the neural model handling the complexities of micro-planning and surface realization. We show that our view-based system retains factual accuracy while offering high-level control of output that can be tailored based on user preference or other norms within the domain.

Evaluating the Generalization Property of Prefix-based Methods for Data-to-text Generation
Clarine Vongpaseut | Alberto Lumbreras | Mike Gartrell | Patrick Gallinari
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : travaux de recherche originaux -- articles courts

Fine-tuning is the prevalent paradigm to adapt pre-trained language models to downstream tasks. Lightweight fine-tuning methods, such as prefix-tuning, only tune a small set of parameters which alleviates cost. Such methods were shown to achieve results similar to fine-tuning; however, performance can decrease when the inputs get farther from the training domain. Moreover, latest works questioned the efficiency of recent lightweight fine-tuning techniques depending on the task and the size of the model. In this paper, we propose to evaluate the generalization property of prefix-based methods depending on the size of the pre-trained language model in the multi-domain setting on data-to-text generation. We found that their performance depends heavily on the size of the model.

2021

QuestEval: Summarization Asks for Fact-based Evaluation
Thomas Scialom | Paul-Alexis Dray | Sylvain Lamprier | Benjamin Piwowarski | Jacopo Staiano | Alex Wang | Patrick Gallinari
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Summarization evaluation remains an open research problem: current metrics such as ROUGE are known to be limited and to correlate poorly with human judgments. To alleviate this issue, recent work has proposed evaluation metrics which rely on question answering models to assess whether a summary contains all the relevant information in its source document. Though promising, the proposed approaches have so far failed to correlate better than ROUGE with human judgments. In this paper, we extend previous approaches and propose a unified framework, named QuestEval. In contrast to established metrics such as ROUGE or BERTScore, QuestEval does not require any ground-truth reference. Nonetheless, QuestEval substantially improves the correlation with human judgments over four evaluation dimensions (consistency, coherence, fluency, and relevance), as shown in extensive experiments.

Data-QuestEval: A Referenceless Metric for Data-to-Text Semantic Evaluation
Clement Rebuffel | Thomas Scialom | Laure Soulier | Benjamin Piwowarski | Sylvain Lamprier | Jacopo Staiano | Geoffrey Scoutheeten | Patrick Gallinari
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

QuestEval is a reference-less metric used in text-to-text tasks, that compares the generated summaries directly to the source text, by automatically asking and answering questions. Its adaptation to Data-to-Text tasks is not straightforward, as it requires multimodal Question Generation and Answering systems on the considered tasks, which are seldom available. To this purpose, we propose a method to build synthetic multimodal corpora enabling to train multimodal components for a data-QuestEval metric. The resulting metric is reference-less and multimodal; it obtains state-of-the-art correlations with human judgment on the WebNLG and WikiBio benchmarks. We make data-QuestEval’s code and models available for reproducibility purpose, as part of the QuestEval project.

Separating Retention from Extraction in the Evaluation of End-to-end Relation Extraction
Bruno Taillé | Vincent Guigue | Geoffrey Scoutheeten | Patrick Gallinari
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

State-of-the-art NLP models can adopt shallow heuristics that limit their generalization capability (McCoy et al., 2019). Such heuristics include lexical overlap with the training set in Named-Entity Recognition (Taille et al., 2020) and Event or Type heuristics in Relation Extraction (Rosenman et al., 2020). In the more realistic end-to-end RE setting, we can expect yet another heuristic: the mere retention of training relation triples. In this paper we propose two experiments confirming that retention of known facts is a key factor of performance on standard benchmarks. Furthermore, one experiment suggests that a pipeline model able to use intermediate type representations is less prone to over-rely on retention.

2020

Let’s Stop Incorrect Comparisons in End-to-end Relation Extraction!
Bruno Taillé | Vincent Guigue | Geoffrey Scoutheeten | Patrick Gallinari
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Despite efforts to distinguish three different evaluation setups (Bekoulis et al., 2018), numerous end-to-end Relation Extraction (RE) articles present unreliable performance comparison to previous work. In this paper, we first identify several patterns of invalid comparisons in published papers and describe them to avoid their propagation. We then propose a small empirical study to quantify the most common mistake’s impact and evaluate it leads to overestimating the final RE performance by around 5% on ACE05. We also seize this opportunity to study the unexplored ablations of two recent developments: the use of language model pretraining (specifically BERT) and span-level NER. This meta-analysis emphasizes the need for rigor in the report of both the evaluation setting and the dataset statistics. We finally call for unifying the evaluation setting in end-to-end RE.

PARENTing via Model-Agnostic Reinforcement Learning to Correct Pathological Behaviors in Data-to-Text Generation
Clement Rebuffel | Laure Soulier | Geoffrey Scoutheeten | Patrick Gallinari
Proceedings of the 13th International Conference on Natural Language Generation

In language generation models conditioned by structured data, the classical training via maximum likelihood almost always leads models to pick up on dataset divergence (i.e., hallucinations or omissions), and to incorporate them erroneously in their own generations at inference. In this work, we build on top of previous Reinforcement Learning based approaches and show that a model-agnostic framework relying on the recently introduced PARENT metric is efficient at reducing both hallucinations and omissions. Evaluations on the widely used WikiBIO and WebNLG benchmarks demonstrate the effectiveness of this framework compared to state-of-the-art models.

What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom | Patrick Bordes | Paul-Alexis Dray | Jacopo Staiano | Patrick Gallinari
Proceedings of the 13th International Conference on Natural Language Generation

Pre-trained language models have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported under different configurations indicate an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training. The proposed model obtains substantial improvements over the state-of-the-art on two established VQG datasets.

2019

Incorporating Visual Semantics into Sentence Representations within a Grounded Space
Patrick Bordes | Eloi Zablocki | Laure Soulier | Benjamin Piwowarski | Patrick Gallinari
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations — the focus of this paper — as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.

2018

DEFT 2018: Attention sélective pour classification de microblogs (DEFT 2018 : Selective Attention for Microblogging Classification )
Charles-Emmanuel Dias | Clara de Forsan de Gainon Gabriac | Patrick Gallinari | Vincent Guigue
Actes de la Conférence TALN. Volume 2 - Démonstrations, articles des Rencontres Jeunes Chercheurs, ateliers DeFT

Dans le cadre de l’atelier DEFT 2018 nous nous sommes intéressés à la classification de microblogs (ici, des tweets) rédigés en français. Ici, nous proposons une méthode se basant sur un réseau hiérarchique de neurones récurrent avec attention. La spécificité de notre architecture est de prendre en compte –via un mechanisme d’attention et de portes– les hashtags et les mentions directes (e.g., @user), spécifiques aux microblogs. Notre modèle a obtenu de très bon résultats sur la première tâche et des résultats compétitifs sur la seconde.

2006

A Machine Learning based Approach to Evaluating Retrieval Systems
Huyen-Trang Vu | Patrick Gallinari
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference