Leonardo F. R. Ribeiro - ACL Anthology

Leonardo F. R. Ribeiro

Also published as: Leonardo F . R. Ribeiro

2026

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
Aashiq Muhamed | Leonardo F. R. Ribeiro | Markus Dreyer | Virginia Smith | Mona T. Diab
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting dangerous over-confidence or over-caution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks—RefusalBench-NQ (single-document) and RefusalBench-GaRAGe (multi-document), and our complete generation framework to enable continued, dynamic evaluation of this critical capability.

What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation
Shaomu Tan | Dawei Zhu | Ke Tran | Michael Denkowski | Sony Trenous | Leonardo F. R. Ribeiro | Bill Byrne | Felix Hieber
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Iterative refinement is a simple inference-time strategy for machine translation: given an initial translation, an LLM revises it without additional training. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering six LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, a) we find a robust recipe: document-level MT followed by segment-level refinement yields the strongest and most stable improvements. In our setting, doc-level refinement often makes fewer edits and leads to smaller or less reliable gains. Surprisingly, a simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. b) Fine-grained MQM analyses and professional-translator evaluation show that gains come primarily from fluency, with limited improvements in adequacy. c) Probing translator-refiner strength interactions suggests refinement behaves less like targeted post-editing and more like projecting outputs toward the refiner’s learned distribution while remaining anchored to the initial translation.

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
Yukun Huang | Leonardo F. R. Ribeiro | Momchil Hardalov | Bhuwan Dhingra | Markus Dreyer | Venkatesh Saligrama
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers usually target general-domain atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs.Yet building such a benchmark for DRR fact-checkers is itself difficult because it requires expert judgments over cognitively demanding, domain-specific claims.In a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on hidden known-answer claims. We therefore propose evolving benchmarking via **Audit-then-Score** (**AtS**), in which labels and rationales remain revisable: when a verifier disagrees with the current benchmark, it submits evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before scoring. After three additional **AtS** rounds, expert accuracy rises to 90.9%, showing that experts are better auditors than one-shot labelers.We instantiate **AtS** as **DeepFactBench**, a versioned DRR factuality benchmark with auditable rationales, and introduce **DeepFactEval**, a claim-level verifier.On the frozen **DeepFactBench** release, **DeepFactEval** achieves 83.4% accuracy, outperforming the best prior deep-research and traditional fact-checkers by 14.3 and 24.9 points, respectively, and transferring well to external factuality datasets.

Benchmarking Deflection and Hallucination in Large Vision-Language Models
Nicholas Moratelli | Christopher Davis | Leonardo F. R. Ribeiro | Bill Byrne | Gonzalo Iglesias
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Vision–Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., "Sorry, I cannot answer...") when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

2025

Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning
Hyundong Justin Cho | Karishma Sharma | Nicolaas Paul Jedema | Leonardo F. R. Ribeiro | Jonathan May | Alessandro Moschitti
Findings of the Association for Computational Linguistics: NAACL 2025

Language models are aligned to the collective voice of many, resulting in generic outputs that do not align with specific users’ styles. In this work, we present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method that personalizes language models for text generation tasks with fewer than 10 examples per user. TICL iteratively expands an in-context learning prompt via a trial-error-explain process, adding model-generated negative samples and explanations that provide fine-grained guidance towards a specific user’s style. TICL achieves favorable win rates on pairwise comparisons with LLM-as-a-judge up to 91.5% against the previous state-of-the-art and outperforms competitive tuning-free baselines for personalized alignment tasks of writing emails, essays and news articles. Both lexical and qualitative analyses show that the negative samples and explanations enable language models to learn stylistic context more effectively and overcome the bias towards structural and formal phrases observed in their zero-shot outputs. By front-loading inference compute to create a user-specific in-context learning prompt that does not require extra generation steps at test time, presents a novel yet simple approach for personalized alignment.

XRAG: Cross-lingual Retrieval-Augmented Generation
Wei Liu | Sony Trenous | Leonardo F. R. Ribeiro | Bill Byrne | Felix Hieber
Findings of the Association for Computational Linguistics: EMNLP 2025

We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external know-ledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.

GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation
Ionut Teodor Sorodoc | Leonardo F. R. Ribeiro | Rexhina Blloshmi | Christopher Davis | Adrià de Gispert
Findings of the Association for Computational Linguistics: ACL 2025

We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM’s ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of 60%), or (b) deflect when no relevant grounding is available (reaching at most 31% true positive rate in deflections). The F₁ in attribution to relevant sources is at most 58.9%, and we show that performance is particularly reduced when answering time-sensitive questions and when having to draw knowledge from sparser private grounding sources.

NeoQA: Evidence-based Question Answering with Generated News Events
Max Glockner | Xiang Jiang | Leonardo F. R. Ribeiro | Iryna Gurevych | Markus Dreyer
Findings of the Association for Computational Linguistics: ACL 2025

Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.

RAGferee: Building Contextual Reward Models for Retrieval-Augmented Generation
Andrei C. Coman | Ionut-Teodor Sorodoc | Leonardo F. R. Ribeiro | James Henderson | Bill Byrne | Adrià de Gispert
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Existing Reward Models (RMs), typically trained on general preference data, struggle in Retrieval Augmented Generation (RAG) settings, which require judging responses for faithfulness to retrieved context, relevance to the user query, appropriate refusals when context is insufficient, completeness and conciseness of information. To address the lack of publicly available RAG-centric preference datasets and specialised RMs, we introduce RAGferee, a methodology that repurposes question-answering (QA) datasets into preference pairs that prioritise groundedness over stylistic features, enabling the training of contextual RMs better suited to judging RAG responses. Using RAGferee, we curate a small preference dataset of 4K samples and fine-tune RMs ranging from 7B to 24B parameters. Our RAG-centric RMs achieve state-of-the-art performance on ContextualJudgeBench, surpassing existing 70B+ RMs trained on much larger (up to 2.4M samples) general corpora, with an absolute improvement of +15.5%.

2024

At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs areconcise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages?ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategiesto approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when rankingshort summaries, but may not help as much when ranking systems or longer summaries.

Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA
Nirmal Roy | Leonardo F. R. Ribeiro | Rexhina Blloshmi | Kevin Small
Findings of the Association for Computational Linguistics: EMNLP 2024

Augmenting Large Language Models (LLMs) with information retrieval capabilities (i.e., Retrieval-Augmented Generation (RAG)) has proven beneficial for knowledge-intensive tasks. However, understanding users’ contextual search intent when generating responses is an understudied topic for conversational question answering (QA). This conversational extension leads to additional concerns when compared to single-turn QA as it is more challenging for systems to comprehend conversational context and manage retrieved passages over multiple turns. In this work, we propose a method for enabling LLMs to decide when to retrieve in RAG settings given a conversational context. When retrieval is deemed necessary, the LLM then rewrites the conversation for passage retrieval and judges the relevance of returned passages before response generation. Operationally, we build on the single-turn SELF-RAG framework (Asai et al., 2023) and propose SELF-multi-RAG for conversational settings. SELF-multi-RAG demonstrates improved capabilities over single-turn variants with respect to retrieving relevant passages (by using summarized conversational context) and assessing the quality of generated responses. Experiments on three conversational QA datasets validate the enhanced response generation capabilities of SELF-multi-RAG with improvements of ~13% measured by human annotation.

FANTAstic SEquences and Where to Find Them: Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking
Zhuoer Wang | Leonardo F. R. Ribeiro | Alexandros Papangelis | Rohan Mukherjee | Tzu-Yen Wang | Xinyan Zhao | Arijit Biswas | James Caverlee | Angeliki Metallinou
Findings of the Association for Computational Linguistics: EMNLP 2024

API call generation is the cornerstone of large language models’ tool-using ability that provides access to the larger world. However, existing supervised and in-context learning approaches suffer from high training costs, poor data efficiency, and generated API calls that can be unfaithful to the API documentation and the user’s request. To address these limitations, we propose an output-side optimization approach called FANTASE. Two of the unique contributions of FANTASE are its State-Tracked Constrained Decoding (SCD) and Reranking components. SCD dynamically incorporates appropriate API constraints in the form of Token Search Trie for efficient and guaranteed generation faithfulness with respect to the API documentation. The Reranking component efficiently brings in the supervised signal by leveraging a lightweight model as the discriminator to rerank the beam-searched candidate generations of the large language model. We demonstrate the superior performance of FANTASE in API call generation accuracy, inference efficiency, and context efficiency with DSTC8 and API Bank datasets.

Measuring Retrieval Complexity in Question Answering Systems
Matteo Gabburo | Nicolaas Paul Jedema | Siddhant Garg | Leonardo F. R. Ribeiro | Alessandro Moschitti
Findings of the Association for Computational Linguistics: ACL 2024

In this paper, we investigate which questions are challenging for retrieval-based Question Answering (QA). We (i) propose retrieval complexity (RC), a novel metric conditioned on the completeness of retrieved documents, which measures the difficulty of answering questions, and (ii) propose an unsupervised pipeline to measure RC given an arbitrary retrieval system.Our proposed pipeline measures RC more accurately than alternative estimators, including LLMs, on six challenging QA benchmarks. Further investigation reveals that RC scores strongly correlate with both QA performance and expert judgment across five of the six studied benchmarks, indicating that RC is an effective measure of question difficulty.Subsequent categorization of high-RC questions shows that they span a broad set of question shapes, including multi-hop, compositional, and temporal QA, indicating that RC scores can categorize a new subset of complex questions. Our system can also have a major impact on retrieval-based systems by helping to identify more challenging questions on existing datasets.

Speechworthy Instruction-tuned Language Models
Hyundong Justin Cho | Nicolaas Paul Jedema | Leonardo F. R. Ribeiro | Karishma Sharma | Pedro Szekely | Alessandro Moschitti | Ruben Janssen | Jonathan May
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Current instruction-tuned language models are exclusively trained with textual preference data and thus may not be aligned to the unique requirements of other modalities, such as speech. To better align language models with the speech domain, we explore i) prompting strategies based on radio-industry best practices and ii) preference learning using a novel speech-based preference data of 20K samples collected by annotators who listen to response pairs. Both human and automatic evaluation show that both prompting and preference learning increase the speech-suitability of popular instruction tuned LLMs. More interestingly, we show that these methods are additive; combining them achieves the best win rates in head-to-head comparison, resulting in responses that are preferred or tied to the base model in 76.2% of comparisons on average. Lastly, we share lexical, syntactical, and qualitative analyses that elicit how our studied methods differ with baselines in generating more speech-suitable responses.

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)
Luis Chiruzzo | Hung-yi Lee | Leonardo F. R. Ribeiro
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)

REFINESUMM: Self-Refining MLLM for Generating a Multimodal Summarization Dataset
Vaidehi Patil | Leonardo F. R. Ribeiro | Mengwen Liu | Mohit Bansal | Markus Dreyer
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal Large Language Models (MLLMs) excel at synthesizing key information from diverse sources. However, generating accurate and faithful multimodal summaries is challenging, primarily due to the lack of appropriate multimodal datasets for fine-tuning that meaningfully integrate textual and visual modalities. To address this gap, we present a new dataset designed specifically for image-text multimodal summarization, harnessing the capabilities of state-of-the-art MLLMs. We generate summaries from Wikipedia sections and corresponding images and evaluate them across text-based, visual and multimodal dimensions, employing reference-free metrics. To refine the dataset, we: (1) Filter the MLLM-generated summaries by training a critic model on human annotations and using its predictions to remove low-quality summaries; (2) Fine-tune the MLLM with the filtered high-quality summaries; (3) Use the fine-tuned model in turn to regenerate the summaries. This self-refinement process significantly improves summary quality, as measured by human judgements and automatic multimodal metrics, resulting in a valuable dataset for multimodal summarization research. The dataset is publicly available at https://github.com/amazon-science/refinesumm.

2023

Generating Summaries with Controllable Readability Levels
Leonardo F. R. Ribeiro | Mohit Bansal | Markus Dreyer
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Readability refers to how easily a reader can understand a written text. Several factors affect the readability level, such as the complexity of the text, its subject matter, and the reader’s background knowledge. Generating summaries based on different readability levels is critical for enabling knowledge consumption by diverse audiences. However, current text generation approaches lack refined control, resulting in texts that are not customized to readers’ proficiency levels. In this work, we bridge this gap and study techniques to generate summaries at specified readability levels. Unlike previous methods that focus on a specific readability level (e.g., lay summarization), we generate summaries with fine-grained control over their readability. We develop three text generation techniques for controlling readability: (1) instruction-based readability control, (2) reinforcement learning to minimize the gap between requested and observed readability and (3) a decoding approach that uses lookahead to estimate the readability of upcoming decoding steps. We show that our generation methods significantly improve readability control on news summarization (CNN/DM dataset), as measured by various readability metrics and human judgement, establishing strong baselines for controllable readability in summarization.

2022

FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations
Leonardo F. R. Ribeiro | Mengwen Liu | Iryna Gurevych | Markus Dreyer | Mohit Bansal
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Despite recent improvements in abstractive summarization, most current approaches generate summaries that are not factually consistent with the source document, severely restricting their trust and usage in real-world applications. Recent works have shown promising improvements in factuality error identification using text or dependency arc entailments; however, they do not consider the entire semantic graph simultaneously. To this end, we propose FactGraph, a method that decomposes the document and the summary into structured meaning representations (MR), which are more suitable for factuality evaluation. MRs describe core semantic concepts and their relations, aggregating the main content in both document and summary in a canonical form, and reducing data sparsity. FactGraph encodes such graphs using a graph encoder augmented with structure-aware adapters to capture interactions among the concepts based on the graph connectivity, along with text representations using an adapter-based text encoder. Experiments on different benchmarks for evaluating factuality show that FactGraph outperforms previous approaches by up to 15%. Furthermore, FactGraph improves performance on identifying content verifiability errors and better captures subsentence-level factual inconsistencies.

Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking
Tim Baumgärtner | Leonardo F. R. Ribeiro | Nils Reimers | Iryna Gurevych
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Pairing a lexical retriever with a neural re-ranking model has set state-of-the-art performance on large-scale information retrieval datasets. This pipeline covers scenarios like question answering or navigational queries, however, for information-seeking scenarios, users often provide information on whether a document is relevant to their query in form of clicks or explicit feedback. Therefore, in this work, we explore how relevance feedback can be directly integrated into neural re-ranking models by adopting few-shot and parameter-efficient learning techniques. Specifically, we introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant. Further, we explore Cross-Encoder models that we pre-train using meta-learning and subsequently fine-tune for each query, training only on the feedback documents. To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario. Extensive experiments demonstrate that integrating relevance feedback directly in neural re-ranking models improves their performance, and fusing lexical ranking with our best performing neural re-ranker outperforms all other methods by 5.2% nDCG@20.

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

UKP-SQUARE: An Online Platform for Question Answering Research
Tim Baumgärtner | Kexin Wang | Rachneet Sachdeva | Gregor Geigle | Max Eichler | Clifton Poth | Hannah Sterz | Haritz Puerto | Leonardo F. R. Ribeiro | Jonas Pfeiffer | Nils Reimers | Gözde Şahin | Iryna Gurevych
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Recent advances in NLP and information retrieval have given rise to a diverse set of question answering tasks that are of different formats (e.g., extractive, abstractive), require different model architectures (e.g., generative, discriminative), and setups (e.g., with or without retrieval). Despite having a large number of powerful, specialized QA pipelines (which we refer to as Skills) that consider a single domain, model or setup, there exists no framework where users can easily explore and compare such pipelines and can extend them according to their needs. To address this issue, we present UKP-SQuARE, an extensible online QA platform for researchers which allows users to query and analyze a large collection of modern Skills via a user-friendly web interface and integrated behavioural tests. In addition, QA researchers can develop, manage, and share their custom Skills using our microservices that support a wide range of models (Transformers, Adapters, ONNX), datastores and retrieval techniques (e.g., sparse and dense). UKP-SQuARE is available on https://square.ukp-lab.de

UKP-SQuARE v2: Explainability and Adversarial Attacks for Trustworthy QA
Rachneet Sachdeva | Haritz Puerto | Tim Baumgärtner | Sewin Tariverdian | Hao Zhang | Kexin Wang | Hossain Shaikh Saadi | Leonardo F. R. Ribeiro | Iryna Gurevych
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations

Question Answering (QA) systems are increasingly deployed in applications where they support real-world decisions. However, state-of-the-art models rely on deep neural networks, which are difficult to interpret by humans. Inherently interpretable models or post hoc explainability methods can help users to comprehend how a model arrives at its prediction and, if successful, increase their trust in the system. Furthermore, researchers can leverage these insights to develop new methods that are more accurate and less biased. In this paper, we introduce SQuARE v2, the new version of SQuARE, to provide an explainability infrastructure for comparing models based on methods such as saliency maps and graph-based explanations. While saliency maps are useful to inspect the importance of each input token for the model’s prediction, graph-based explanations from external Knowledge Graphs enable the users to verify the reasoning behind the model prediction. In addition, we provide multiple adversarial attacks to compare the robustness of QA models. With these explainability methods and adversarial attacks, we aim to ease the research on trustworthy QA models. SQuARE is available on https://square.ukp-lab.de.

2021

Modeling Graph Structure via Relative Position for Text Generation from Knowledge Graphs
Martin Schmitt | Leonardo F. R. Ribeiro | Philipp Dufter | Iryna Gurevych | Hinrich Schütze
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

We present Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation. With our novel graph self-attention, the encoding of a node relies on all nodes in the input graph - not only direct neighbors - facilitating the detection of global patterns. We represent the relation between two nodes as the length of the shortest path between them. Graformer learns to weight these node-node relations differently for different attention heads, thus virtually learning differently connected views of the input graph. We evaluate Graformer on two popular graph-to-text generation benchmarks, AGENDA and WebNLG, where it achieves strong performance while using many fewer parameters than other approaches.

Investigating Pretrained Language Models for Graph-to-Text Generation
Leonardo F. R. Ribeiro | Martin Schmitt | Hinrich Schütze | Iryna Gurevych
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Graph-to-text generation aims to generate fluent texts from graph-based data. In this paper, we investigate two recent pretrained language models (PLMs) and analyze the impact of different task-adaptive pretraining strategies for PLMs in graph-to-text generation. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs. We show that approaches based on PLMs BART and T5 achieve new state-of-the-art results and that task-adaptive pretraining strategies improve their performance even further. We report new state-of-the-art BLEU scores of 49.72 on AMR-LDC2017T10, 59.70 on WebNLG, and 25.66 on AGENDA datasets - a relative improvement of 31.8%, 4.5%, and 42.4%, respectively, with our models generating significantly more fluent texts than human references. In an extensive analysis, we identify possible reasons for the PLMs’ success on graph-to-text tasks. Our findings suggest that the PLMs benefit from similar facts seen during pretraining or fine-tuning, such that they perform well even when the input graph is reduced to a simple bag of node and edge labels.

A Neural Graph-based Local Coherence Model
Mohsen Mesgar | Leonardo F. R. Ribeiro | Iryna Gurevych
Findings of the Association for Computational Linguistics: EMNLP 2021

Entity grids and entity graphs are two frameworks for modeling local coherence. These frameworks represent entity relations between sentences and then extract features from such representations to encode coherence. The benefits of convolutional neural models for extracting informative features from entity grids have been recently studied. In this work, we study the benefits of Relational Graph Convolutional Networks (RGCN) to encode entity graphs for measuring local coherence. We evaluate our neural graph-based model for two benchmark coherence evaluation tasks: sentence ordering (SO) and summary coherence rating (SCR). The results show that our neural graph-based model consistently outperforms the neural grid-based model for both tasks. Our model performs competitively with a strong baseline coherence model, while our model uses 50% fewer parameters. Our work defines a new, efficient, and effective baseline for local coherence modeling.

Smelting Gold and Silver for Improved Multilingual AMR-to-Text Generation
Leonardo F. R. Ribeiro | Jonas Pfeiffer | Yue Zhang | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent work on multilingual AMR-to-text generation has exclusively focused on data augmentation strategies that utilize silver AMR. However, this assumes a high quality of generated AMRs, potentially limiting the transferability to the target task. In this paper, we investigate different techniques for automatically generating AMR annotations, where we aim to study which source of information yields better multilingual results. Our models trained on gold AMR with silver (machine translated) sentences outperform approaches which leverage generated silver AMR. We find that combining both complementary sources of information further improves multilingual AMR-to-text generation. Our models surpass the previous state of the art for German, Italian, Spanish, and Chinese by a large margin.

Structural Adapters in Pretrained Language Models for AMR-to-Text Generation
Leonardo F. R. Ribeiro | Yue Zhang | Iryna Gurevych
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pretrained language models (PLM) have recently advanced graph-to-text generation, where the input graph is linearized into a sequence and fed into the PLM to obtain its representation. However, efficiently encoding the graph structure in PLMs is challenging because such models were pretrained on natural language, and modeling structured data may lead to catastrophic forgetting of distributional knowledge. In this paper, we propose StructAdapt, an adapter method to encode graph structure into PLMs. Contrary to prior work, StructAdapt effectively models interactions among the nodes based on the graph connectivity, only training graph structure-aware adapter parameters. In this way, we incorporate task-specific knowledge while maintaining the topological structure of the graph. We empirically show the benefits of explicitly encoding graph structure into PLMs using StructAdapt, outperforming the state of the art on two AMR-to-text datasets, training only 5.1% of the PLM parameters.

2020

Modeling Global and Local Node Contexts for Text Generation from Knowledge Graphs
Leonardo F. R. Ribeiro | Yue Zhang | Claire Gardent | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 8

Recent graph-to-text models generate text from graph-based data using either global or local aggregation to learn node representations. Global node encoding allows explicit communication between two distant nodes, thereby neglecting graph topology as all nodes are directly connected. In contrast, local node encoding considers the relations between neighbor nodes capturing the graph structure, but it can fail to capture long-range relations. In this work, we gather both encoding strategies, proposing novel neural models that encode an input graph combining both global and local node contexts, in order to learn better contextualized node embeddings. In our experiments, we demonstrate that our approaches lead to significant improvements on two graph-to-text datasets achieving BLEU scores of 18.01 on the AGENDA dataset, and 63.69 on the WebNLG dataset for seen categories, outperforming state-of-the-art models by 3.7 and 3.1 points, respectively.1

Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers
Anne Lauscher | Olga Majewska | Leonardo F. R. Ribeiro | Iryna Gurevych | Nikolai Rozanov | Goran Glavaš
Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Following the major success of neural language models (LMs) such as BERT or GPT-2 on a variety of language understanding tasks, recent work focused on injecting (structured) knowledge from external resources into these models. While on the one hand, joint pre-training (i.e., training from scratch, adding objectives based on external knowledge to the primary LM objective) may be prohibitively computationally expensive, post-hoc fine-tuning on external knowledge, on the other hand, may lead to the catastrophic forgetting of distributional knowledge. In this work, we investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus, respectively, using adapter training. While overall results on the GLUE benchmark paint an inconclusive picture, a deeper analysis reveals that our adapter-based models substantially outperform BERT (up to 15-20 performance points) on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS. We also open source all our experiments and relevant code under: https://github.com/wluper/retrograph.

2019

Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference
Tobias Falke | Leonardo F. R. Ribeiro | Prasetya Ajie Utama | Ido Dagan | Iryna Gurevych
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While recent progress on abstractive summarization has led to remarkably fluent summaries, factual errors in generated summaries still severely limit their use in practice. In this paper, we evaluate summaries produced by state-of-the-art models via crowdsourcing and show that such errors occur frequently, in particular with more abstractive models. We study whether textual entailment predictions can be used to detect such errors and if they can be reduced by reranking alternative predicted summaries. That leads to an interesting downstream application for entailment models. In our experiments, we find that out-of-the-box entailment models trained on NLI datasets do not yet offer the desired performance for the downstream task and we therefore release our annotations as additional test data for future extrinsic evaluations of NLI.

Enhancing AMR-to-Text Generation with Dual Graph Representations
Leonardo F. R. Ribeiro | Claire Gardent | Iryna Gurevych
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating text from graph-based data, such as Abstract Meaning Representation (AMR), is a challenging task due to the inherent difficulty in how to properly encode the structure of a graph with labeled edges. To address this difficulty, we propose a novel graph-to-sequence model that encodes different but complementary perspectives of the structural information contained in the AMR graph. The model learns parallel top-down and bottom-up representations of nodes capturing contrasting views of the graph. We also investigate the use of different node message passing strategies, employing different state-of-the-art graph encoders to compute node representations based on incoming and outgoing perspectives. In our experiments, we demonstrate that the dual graph representation leads to improvements in AMR-to-text generation, achieving state-of-the-art results on two AMR datasets

Co-authors

Alessandro Moschitti 3

Rexhina Blloshmi 2

Hyundong Justin Cho 2

Christopher Davis 2

Daniel Deutsch 2

Claire Gardent 2

Sebastian Gehrmann 2

Saad Mahamood 2

Alexandros Papangelis 2

Jonas Pfeiffer 2

Haritz Puerto 2

Rachneet Sachdeva 2

Martin Schmitt 2

Hinrich Schütze 2

Karishma Sharma 2

Ionut Sorodoc 2

Adrià de Gispert 2

Tosin Adewumi 1

Pawan Sasanka Ammanamanchi 1

Chandra Bhagavatula 1

Abhik Bhattacharjee 1

Arijit Biswas 1

Samuel Cahyawijaya 1

Ronald Cardenas 1

James Caverlee 1

Khyathi Chandu 1

Khyathi Raghavi Chandu 1

Luis Chiruzzo 1

Elizabeth Clark 1

Miruna Clinciu 1

Andrei C. Coman 1

Mathias Creutz 1

Michael Denkowski 1

Bhuwan Dhingra 1

Kaustubh Dhole 1

Philipp Dufter 1

Ondřej Dušek 1

Moussa Kamal Eddine 1

Matteo Gabburo 1

Cristina Garbacea 1

Siddhant Garg 1

Gregor Geigle 1

Dimitra Gkatzia 1

Goran Glavaš 1

Momchil Hardalov 1

Hiroaki Hayashi 1

James Henderson 1

Gonzalo Iglesias 1

Ruben Janssen 1

Yacine Jernite 1

Shailza Jolly 1

Juraj Juraska 1

Mihir Sanjay Kale 1

Jenna Kanerva 1

Faisal Ladhak 1

Anne Lauscher 1

Paul Pu Liang 1

Abinaya Mahendiran 1

Olga Majewska 1

Joshua Maynez 1

Angelina McMillan-Major 1

Mohsen Mesgar 1

Angeliki Metallinou 1

Sebastien Montella 1

Nicholas Moratelli 1

Aashiq Muhamed 1

Rohan Mukherjee 1

Marcel Nawrath 1

Vitaly Nikolaev 1

Jekaterina Novikova 1

Agnieszka Nowak 1

Vaidehi Patil 1

Laura Perez-Beltrachini 1

Ratish Puduppully 1

Mahim Pushkarna 1

Dragomir Radev 1

Nikolai Rozanov 1

Hossain Shaikh Saadi 1

Venkatesh Saligrama 1

Rifat Shahriyar 1

Virginia Smith 1

Hendrik Strobelt 1

Nishant Subramani 1

Pedro Szekely 1

Sewin Tariverdian 1

Craig Thomson 1

Lewis Tunstall 1

Ashish Upadhyay 1

Prasetya Ajie Utama 1

Danilo C. Walenta 1

Michael White 1

Genta Indra Winata 1

Deyi Xiong (德意熊) 1

Bingsheng Yao 1

Gözde Gül Şahin 1

Sanja Štajner 1

Venues