Pasquale Minervini - ACL Anthology

Pasquale Minervini

2026

Analyzing LLM Instruction Optimization for Tabular Fact Verification
Xiaotang Du | Giwon Hong | Wai-Chung Kwan | Rohit Saxena | Ivan Titov | Pasquale Minervini | Emily Allaway
Findings of the Association for Computational Linguistics: EACL 2026

Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework—COPRO, MiPROv2, and SIMBA—across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.

2025

Mixtures of In-Context Learners
Giwon Hong | Emile Van Krieken | Edoardo Ponti | Nikolay Malkin | Pasquale Minervini
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it is very sensitive to the choice of in-context demonstrations, and processing many demonstrations can be computationally demanding. We propose Mixtures of In-Context Learners (MoICL), a novel approach that uses subsets of demonstrations to train a set of experts via ICL and learns a weighting function to merge their output distributions via gradient-based optimisation. In our experiments, we show performance improvements on 5 out of 7 classification datasets compared to a set of strong baselines (e.g., up to +13% compared to ICL and LENS). Moreover, we improve the Pareto frontier of ICL by reducing the inference time needed to achieve the same performance with fewer demonstrations. Finally, MoICL is more robust to out-of-domain (up to +11%), imbalanced (up to +49%) and perturbed demonstrations (up to +38%).

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Gayane Ghazaryan | Erik Arakelyan | Isabelle Augenstein | Pasquale Minervini
Proceedings of the 31st International Conference on Computational Linguistics

Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose SynDARin, a method for generating and validating QA datasets for low-resoucre languages. We utilize parallel content mining to obtain human-curated paragraphs between English and the target language. We use the English data as context to generate synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English human-curated paragraphs form the final QA dataset. The method allows to maintain content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with 1.2K samples for the Armenian language. The human evaluation shows that 98% of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out ~70% of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

GRADA: Graph-based Reranking against Adversarial Documents Attack
Jingjie Zheng | Aryo Pradipta Gema | Giwon Hong | Xuanli He | Pasquale Minervini | Youcheng Sun | Qiongkai Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Retrieval Augmented Generation (RAG) frameworks can improve the factual accuracy of large language models (LLMs) by integrating external knowledge from retrieved documents, thereby overcoming the limitations of models’ static intrinsic knowledge. However, these systems are susceptible to adversarial attacks that manipulate the retrieval process by introducing documents that are adversarial yet semantically similar to the query. Notably, while these adversarial documents resemble the query, they exhibit weak similarity to benign documents in the retrieval set. Thus, we propose a simple yet effective **G**raph-based **R**eranking against **A**dversarial **D**ocument **A**ttacks (GRADA) framework aiming at preserving retrieval quality while significantly reducing the success of adversaries. Our study evaluates the effectiveness of our approach through experiments conducted on six LLMs: GPT-3.5-Turbo, GPT-4o, Llama3.1-8b-Instruct, Llama3.1-70b-Instruct, Qwen2.5-7b-Instruct and Qwen2.5-14b-Instruct. We use three datasets to assess performance, with results from the Natural Questions dataset demonstrating up to an 80% reduction in attack success rates while maintaining minimal loss in accuracy.

FLARE: Faithful Logic-Aided Reasoning and Exploration
Erik Arakelyan | Pasquale Minervini | Patrick Lewis | Pat Verga | Isabelle Augenstein
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Modern Question Answering (QA) and Reasoning approaches with Large Language Models (LLMs) commonly use Chain-of-Thought (CoT) prompting but struggle with generating outputs faithful to their intermediate reasoning chains. While neuro-symbolic methods like Faithful CoT (F-CoT) offer higher faithfulness through external solvers, they require code-specialized models and struggle with ambiguous tasks.We introduce Faithful Logic-Aided Reasoning and Exploration (FLARE), which uses LLMs to plan solutions, formalize queries into logic programs, and simulate code execution through multi-hop search without external solvers. Our method achieves SOTA results on 𝟕 out of 𝟗 diverse reasoning benchmarks and 3 out of 3 logic inference benchmarks while enabling measurement of reasoning faithfulness. We demonstrate that model faithfulness correlates with performance and that successful reasoning traces show an 18.1% increase in unique emergent facts, 8.6% higher overlap between code-defined and execution-trace relations, and 3.6% reduction in unused relations.

Self-Training Large Language Models for Tool-Use Without Demonstrations
Ne Luo | Aryo Pradipta Gema | Xuanli He | Emile Van Krieken | Pietro Lesci | Pasquale Minervini
Findings of the Association for Computational Linguistics: NAACL 2025

Large language models (LLMs) remain prone to factual inaccuracies and computational errors, including hallucinations and mistakes in mathematical reasoning. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. In this paper, we investigate whether LLMs can learn to use tools without demonstrations. First, we analyse zero-shot prompting strategies to guide LLMs in tool utilisation. Second, we propose a self-training method to synthesise tool-use traces using the LLM itself. We compare supervised fine-tuning and preference fine-tuning techniques for fine-tuning the model on datasets constructed using existing Question Answering (QA) datasets, i.e., TriviaQA and GSM8K. Experiments show that tool-use enhances performance on a long-tail knowledge task: 3.7% on PopQA, which is used solely for evaluation, but leads to mixed results on other datasets, i.e., TriviaQA, GSM8K, and NQ-Open. Our findings highlight the potential and challenges of integrating external tools into LLMs without demonstrations.

TUBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
Xuanli He | Jun Wang | Qiongkai Xu | Pasquale Minervini | Pontus Stenetorp | Benjamin I. P. Rubinstein | Trevor Cohn
Findings of the Association for Computational Linguistics: ACL 2025

The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined — such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. Despite the increasing support for multilingual capabilities in open-source and proprietary LLMs, the impact of backdoor attacks on these systems remains largely under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instructiontuning data for one or two languages can affect the outputs for languages whose instructiontuning data were not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like BLOOM and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages across various scenarios. Our findings also indicate that more powerful models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English/Chinese data, such as Llama2, Llama3, Qwen2.5, and Gemma. Moreover, our experiments demonstrate 1) High Transferability: the backdoor mechanism operates successfully in cross lingual response scenarios across 26 languages, achieving an average attack success rate of 99%, and 2) Robustness: the proposed attack remains effective even after defenses are applied. These findings expose critical security vulnerabilities in multilingual LLMs and highlight the urgent need for more robust, targeted defense strategies to address the unique challenges posed by cross-lingual backdoor transfer.

DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations
Aryo Pradipta Gema | Chen Jin | Ahmed Abdulaal | Tom Diethe | Philip Alexander Teare | Beatrice Alex | Pasquale Minervini | Amrutha Saseendran
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) often hallucinate, producing unfaithful or factually incorrect outputs by misrepresenting the provided context or incorrectly recalling internal knowledge. Recent studies have identified specific attention heads within the Transformer architecture, known as retrieval heads, responsible for extracting relevant contextual information. We hypothesise that masking these retrieval heads can induce hallucinations and that contrasting the outputs of the base LLM and the masked LLM can reduce hallucinations. To this end, we propose Decoding by Contrasting Retrieval Heads (DeCoRe), a novel training-free decoding strategy that amplifies information found in the context and model parameters. DeCoRe mitigates potentially hallucinated responses by dynamically contrasting the outputs of the base LLM and the masked LLM, using conditional entropy as a guide. Our extensive experiments confirm that DeCoRe improves performance on tasks requiring high contextual faithfulness, such as summarisation (XSum by 18.6%), instruction following (MemoTrap by 10.9%), and open-book question answering (NQ-Open by 2.4% and NQ-Swap by 5.5%).

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
Rohit Saxena | Pasquale Minervini | Frank Keller
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.

Enhancing Long Document Long Form Summarisation with Self-Planning
Xiaotang Du | Rohit Saxena | Laura Perez-Beltrachini | Pasquale Minervini | Ivan Titov
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation.

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao | Alessio Devoto | Giwon Hong | Xiaotang Du | Aryo Pradipta Gema | Hongru Wang | Xuanli He | Kam-Fai Wong | Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context—this phenomenon, known as context-memory knowledge conflicts, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use inference-time intervention strategies to resolve it. In this work, we propose SpARE, a training-free representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. SpARE identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that SpARE can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods (+10%) as well as contrastive decoding methods (+15%).

2024

SparseFit: Few-shot Prompting with Sparse Fine-tuning for Jointly Generating Predictions and Natural Language Explanations
Jesus Solano | Mardhiyah Sanni | Oana-Maria Camburu | Pasquale Minervini
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Models that generate natural language explanations (NLEs) for their predictions have recently gained increasing interest. However, this approach usually demands large datasets of human-written NLEs for the ground-truth answers at training time, which can be expensive and potentially infeasible for some applications. When only a few NLEs are available (a few-shot setup), fine-tuning pre-trained language models (PLMs) in conjunction with prompt-based learning has recently shown promising results. However, PLMs typically have billions of parameters, making full fine-tuning expensive. We propose SparseFit, a sparse few-shot fine-tuning strategy that leverages discrete prompts to jointly generate predictions and NLEs. We experiment with SparseFit on three sizes of the T5 language model and four datasets and compare it against existing state-of-the-art Parameter-Efficient Fine-Tuning (PEFT) techniques. We find that fine-tuning only 6.8% of the model parameters leads to competitive results for both the task performance and the quality of the generated NLEs compared to full fine-tuning of the model and produces better results on average than other PEFT methods in terms of predictive accuracy and NLE quality.

Analysing The Impact of Sequence Composition on Language Model Pre-Training
Yu Zhao | Yuanbin Qu | Konrad Staniszewski | Szymon Tworkowski | Wei Liu | Piotr Miłoś | Yuxiang Wu | Pasquale Minervini
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most language model pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored.In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on language modelling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, Bm25Chunk, can improve in-context learning (+11.6%), knowledge memorisation (+9.8%), and context utilisation (+7.2%) abilities of language models without sacrificing efficiency.

Using Natural Language Explanations to Improve Robustness of In-context Learning
Xuanli He | Yuxiang Wu | Oana-Maria Camburu | Pasquale Minervini | Pontus Stenetorp
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent studies demonstrated that large language models (LLMs) can excel in many tasks via in-context learning (ICL). However, recentworks show that ICL-prompted models tend to produce inaccurate results when presented with adversarial inputs. In this work, we investigate whether augmenting ICL with natural language explanations (NLEs) improves the robustness of LLMs on adversarial datasets covering natural language inference and paraphrasing identification. We prompt LLMs with a small set of human-generated NLEs to produce further NLEs, yielding more accurate results than both a zero-shot-ICL setting and using only human-generated NLEs. Our results on five popular LLMs (GPT3.5-turbo, Llama2, Vicuna, Zephyr, and Mistral) show that our approach yields over 6% improvement over baseline approaches for eight adversarial datasets: HANS, ISCS, NaN, ST, PICD, PISP, ANLI, and PAWS. Furthermore, previous studies have demonstrated that prompt selection strategies significantly enhance ICL on in-distribution test sets. However, our findings reveal that these strategies do not match the efficacy of our approach for robustness evaluations, resulting in an accuracy drop of 8% compared to the proposed approach.

Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain
Aryo Gema | Pasquale Minervini | Luke Daines | Tom Hope | Beatrice Alex
Proceedings of the 6th Clinical Natural Language Processing Workshop

Adapting pretrained language models to novel domains, such as clinical applications, traditionally involves retraining their entire set of parameters. Parameter-Efficient Fine-Tuning (PEFT) techniques for fine-tuning language models significantly reduce computational requirements by selectively fine-tuning small subsets of parameters. In this study, we propose a two-step PEFT framework and evaluate it in the clinical domain. Our approach combines a specialised PEFT adapter layer designed for clinical domain adaptation with another adapter specialised for downstream tasks. We evaluate the framework on multiple clinical outcome prediction datasets, comparing it to clinically trained language models. Our framework achieves a better AUROC score averaged across all clinical downstream tasks compared to clinical language models. In particular, we observe large improvements of 4-5% AUROC in large-scale multilabel classification tasks, such as diagnoses and procedures classification. To our knowledge, this study is the first to provide an extensive empirical analysis of the interplay between PEFT techniques and domain adaptation in an important real-world domain of clinical applications.

Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain
Burcu Sayin | Pasquale Minervini | Jacopo Staiano | Andrea Passerini
Proceedings of the 6th Clinical Natural Language Processing Workshop

We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.

Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints
Aryo Gema | Chaeeun Lee | Pasquale Minervini | Luke Daines | T. Simpson | Beatrice Alex
Proceedings of the 6th Clinical Natural Language Processing Workshop

The MEDIQA-CORR 2024 shared task aims to assess the ability of Large Language Models (LLMs) to identify and correct medical errors in clinical notes. In this study, we evaluate the capability of general LLMs, specifically GPT-3.5 and GPT-4, to identify and correct medical errors with multiple prompting strategies. Recognising the limitation of LLMs in generating accurate corrections only via prompting strategies, we propose incorporating error-span predictions from a smaller, fine-tuned model in two ways: 1) by presenting it as a hint in the prompt and 2) by framing it as multiple-choice questions from which the LLM can choose the best correction. We found that our proposed prompting strategies significantly improve the LLM’s ability to generate corrections. Our best-performing solution with 8-shot + CoT + hints ranked sixth in the shared task leaderboard. Additionally, our comprehensive analyses show the impact of the location of the error sentence, the prompted role, and the position of the multiple-choice option on the accuracy of the LLM. This prompts further questions about the readiness of LLM to be implemented in real-world clinical settings.

Atomic Inference for NLI with Generated Facts as Atoms
Joe Stacey | Pasquale Minervini | Haim Dubossarsky | Oana-Maria Camburu | Marek Rei
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

With recent advances, neural models can achieve human-level performance on various natural language tasks. However, there are no guarantees that any explanations from these models are faithful, i.e. that they reflect the inner workings of the model. Atomic inference overcomes this issue, providing interpretable and faithful model decisions. This approach involves making predictions for different components (or atoms) of an instance, before using interpretable and deterministic rules to derive the overall prediction based on the individual atom-level predictions. We investigate the effectiveness of using LLM-generated facts as atoms, decomposing Natural Language Inference premises into lists of facts. While directly using generated facts in atomic inference systems can result in worse performance, with 1) a multi-stage fact generation process, and 2) a training regime that incorporates the facts, our fact-based method outperforms other approaches.

Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs
Xin Zhou | Ping Nie | Yiwen Guo | Haojie Wei | Zhanqiu Zhang | Pasquale Minervini | Ruotian Ma | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Retrieval-Augmented Generation (RAG) significantly improved the ability of Large Language Models (LLMs) to solve knowledge-intensive tasks. While existing research seeks to enhance RAG performance by retrieving higher-quality documents or designing RAG-specific LLMs, the internal mechanisms within LLMs that contribute to RAG’s effectiveness remain underexplored. In this paper, we aim to investigate these internal mechanisms within the popular Mixture-of-Expert (MoE)-based LLMs and demonstrate how to improve RAG by examining expert activations in these LLMs. Our controlled experiments reveal that several core groups of experts are primarily responsible for RAG-related behaviors. The activation of these core experts can signify the model’s inclination towards external/internal knowledge and adjust its behavior. For instance, we identify core experts that can (1) indicate the sufficiency of the model’s internal knowledge, (2) assess the quality of retrieved documents, and (3) enhance the model’s ability to utilize context. Based on these findings, we propose several strategies to enhance RAG’s efficiency and effectiveness through expert activation. Experimental results across various datasets and MoE LLMs show the effectiveness of our method.

A Simple and Effective L_2 Norm-Based Strategy for KV Cache Compression
Alessio Devoto | Yu Zhao | Simone Scardapane | Pasquale Minervini
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the L₂ norm and the attention scores over cached KV pairs, where a low L₂ norm of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the L₂ norm of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.

Probing the Emergence of Cross-lingual Alignment during LLM Training
Hetong Wang | Pasquale Minervini | Edoardo Ponti
Findings of the Association for Computational Linguistics: ACL 2024

Multilingual Large Language Models (LLMs) achieve remarkable levels of zero-shot cross-lingual transfer performance. We speculate that this is predicated on their ability to align languages without explicit supervision from parallel sentences. While representations of translationally equivalent sentences in different languages are known to be similar after convergence, however, it remains unclear how such cross-lingual alignment emerges during pre-training of LLMs. Our study leverages intrinsic probing techniques, which identify which subsets of neurons encode linguistic features, to correlate the degree of cross-lingual neuron overlap with the zero-shot cross-lingual transfer performance for a given model. In particular, we rely on checkpoints of BLOOM, a multilingual autoregressive LLM, across different training steps and model scales. We observe a high correlation between neuron overlap and downstream performance, which supports our hypothesis on the conditions leading to effective cross-lingual transfer. Interestingly, we also detect a degradation of both implicit alignment and multilingual abilities in certain phases of the pre-training process, providing new insights into the multilingual pretraining dynamics.

Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)
Russa Biswas | Lucie-Aimée Kaffee | Oshin Agarwal | Pasquale Minervini | Sameer Singh | Gerard de Melo
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)

Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4
Aryo Gema | Giwon Hong | Pasquale Minervini | Luke Daines | Beatrice Alex
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

The NLI4CT task assesses Natural Language Inference systems in predicting whether hypotheses entail or contradict evidence from Clinical Trial Reports. In this study, we evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Parameter-Efficient Fine-Tuning (PEFT). We propose a PEFT method to improve the consistency of LLMs by merging adapters that were fine-tuned separately using triplet and language modelling objectives. We found that merging the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. However, our novel methods did not produce more accurate results than GPT-4 in terms of faithfulness and consistency. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328. Finally, our contamination analysis with GPT-4 indicates that there was no test data leakage. Our code is available at https://github.com/EdinburghClinicalNLP/semeval_nli4ct.

FairBelief - Assessing Harmful Beliefs in Language Models
Mattia Setzu | Marta Marchiori Manerba | Pasquale Minervini | Debora Nozza
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

Language Models (LMs) have been shown to inherit undesired biases that might hurt minorities and underrepresented groups if such systems were integrated into real-world applications without careful fairness auditing.This paper proposes FairBelief, an analytical approach to capture and assess beliefs, i.e., propositions that an LM may embed with different degrees of confidence and that covertly influence its predictions. With FairBelief, we leverage prompting to study the behavior of several state-of-the-art LMs across different previously neglected axes, such as model scale and likelihood, assessing predictions on a fairness dataset specifically designed to quantify LMs’ outputs’ hurtfulness.Finally, we conclude with an in-depth qualitative assessment of the beliefs emitted by the models.We apply FairBelief to English LMs, revealing that, although these architectures enable high performances on diverse natural language processing tasks, they show hurtful beliefs about specific genders. Interestingly, training procedure and dataset, model scale, and architecture induce beliefs of different degrees of hurtfulness.

2023

REFER: An End-to-end Rationale Extraction Framework for Explanation Regularization
Mohammad Reza Ghasemi Madani | Pasquale Minervini
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Human-annotated textual explanations are becoming increasingly important in Explainable Natural Language Processing. Rationale extraction aims to provide faithful (i.e. reflective of the behavior of the model) and plausible (i.e. convincing to humans) explanations by highlighting the inputs that had the largest impact on the prediction without compromising the performance of the task model. In recent works, the focus of training rationale extractors was primarily on optimizing for plausibility using human highlights, while the task model was trained on jointly optimizing for task predictive accuracy and faithfulness. We propose REFER, a framework that employs a differentiable rationale extractor that allows to back-propagate through the rationale extraction process. We analyze the impact of using human highlights during training by jointly training the task model and the rationale extractor. In our experiments, REFER yields significantly better results in terms of faithfulness, plausibility, and downstream task accuracy on both in-distribution and out-of-distribution data. On both e-SNLI and CoS-E, our best setting produces better results in terms of composite normalized relative gain than the previous baselines by 11% and 3%, respectively.

XQA-DST: Multi-Domain and Multi-Lingual Dialogue State Tracking
Han Zhou | Ignacio Iacobacci | Pasquale Minervini
Findings of the Association for Computational Linguistics: EACL 2023

Dialogue State Tracking (DST), a crucial component of task-oriented dialogue (ToD) systems, keeps track of all important information pertaining to dialogue history: filling slots with the most probable values throughout the conversation. Existing methods generally rely on a predefined set of values and struggle to generalise to previously unseen slots in new domains. To overcome these challenges, we propose a domain-agnostic extractive question answering (QA) approach with shared weights across domains. To disentangle the complex domain information in ToDs, we train our DST with a novel domain filtering strategy by excluding out-of-domain question samples. With an independent classifier that predicts the presence of multiple domains given the context, our model tackles DST by extracting spans in active domains. Empirical results demonstrate that our model can efficiently leverage domain-agnostic QA datasets by two-stage fine-tuning while being both domain-scalable and open vocabulary in DST. It shows strong transferability by achieving zero-shot domain-adaptation results on MultiWOZ 2.1 with an average JGA of 36.7%. It further achieves cross-lingual transfer with state-of-the-art zero-shot results, 66.2% JGA from English to German and 75.7% JGA from English to Italian on WOZ 2.0.

AsyLex: A Dataset for Legal Language Processing of Refugee Claims
Claire Barale | Mark Klaisoongnoen | Pasquale Minervini | Michael Rovatsos | Nehal Bhuta
Proceedings of the Natural Legal Language Processing Workshop 2023

Advancements in natural language processing (NLP) and language models have demonstrated immense potential in the legal domain, enabling automated analysis and comprehension of legal texts. However, developing robust models in Legal NLP is significantly challenged by the scarcity of resources. This paper presents AsyLex, the first dataset specifically designed for Refugee Law applications to address this gap. The dataset introduces 59,112 documents on refugee status determination in Canada from 1996 to 2022, providing researchers and practitioners with essential material for training and evaluating NLP models for legal research and case review. Case review is defined as entity extraction and outcome prediction tasks. The dataset includes 19,115 gold-standard human-labeled annotations for 20 legally relevant entity types curated with the help of legal experts and 1,682 gold-standard labeled documents for the case outcome. Furthermore, we supply the corresponding trained entity extraction models and the resulting labeled entities generated through the inference process on AsyLex. Four supplementary features are obtained through rule-based extraction. We demonstrate the usefulness of our dataset on the legal judgment prediction task to predict the binary outcome and test a set of baselines using the text of the documents and our annotations. We observe that models pretrained on similar legal documents reach better scores, suggesting that acquiring more datasets for specialized domains such as law is crucial.

2022

MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction
Saadullah Amin | Pasquale Minervini | David Chang | Pontus Stenetorp | Guenter Neumann
Proceedings of the 29th International Conference on Computational Linguistics

Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction findings to biomedical relation extraction.

Logical Reasoning with Span-Level Predictions for Interpretable and Robust NLI Models
Joe Stacey | Pasquale Minervini | Haim Dubossarsky | Marek Rei
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Current Natural Language Inference (NLI) models achieve impressive results, sometimes outperforming humans when evaluating on in-distribution test sets. However, as these models are known to learn from annotation artefacts and dataset biases, it is unclear to what extent the models are learning the task of NLI instead of learning from shallow heuristics in their training data.We address this issue by introducing a logical reasoning framework for NLI, creating highly transparent model decisions that are based on logical rules. Unlike prior work, we show that improved interpretability can be achieved without decreasing the predictive accuracy. We almost fully retain performance on SNLI, while also identifying the exact hypothesis spans that are responsible for each model prediction.Using the e-SNLI human explanations, we verify that our model makes sensible decisions at a span level, despite not using any span labels during training. We can further improve model performance and the span-level decisions by using the e-SNLI explanations during training. Finally, our model is more robust in a reduced data setting. When training with only 1,000 examples, out-of-distribution performance improves on the MNLI matched and mismatched validation sets by 13% and 16% relative to the baseline. Training with fewer observations yields further improvements, both in-distribution and out-of-distribution.

An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks
Yuxiang Wu | Yu Zhao | Baotian Hu | Pasquale Minervini | Pontus Stenetorp | Sebastian Riedel
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Access to external knowledge is essential for many natural language processing tasks, such as question answering and dialogue. Existing methods often rely on a parametric model that stores knowledge in its parameters, or use a retrieval-augmented model that has access to an external knowledge source. Parametric and retrieval-augmented models have complementary strengths in terms of computational efficiency and predictive accuracy. To combine the strength of both approaches, we propose the Efficient Memory-Augmented Transformer (EMAT) – it encodes external knowledge into a key-value memory and exploits the fast maximum inner product search for memory querying. We also introduce pre-training tasks that allow EMAT to encode informative key-value representations, and to learn an implicit strategy to integrate multiple memory slots into the transformer. Experiments on various knowledge-intensive tasks such as question answering and dialogue datasets show that, simply augmenting parametric models (T5-base) using our method produces more accurate results (e.g., 25.8 → 44.3 EM on NQ) while retaining a high throughput (e.g., 1000 queries/s on NQ). Compared to retrieval-augmented models, EMAT runs substantially faster across the board and produces more accurate results on WoW and ELI5.

2021

Training Adaptive Computation for Open-Domain Question Answering with Computational Constraints
Yuxiang Wu | Pasquale Minervini | Pontus Stenetorp | Sebastian Riedel
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Adaptive Computation (AC) has been shown to be effective in improving the efficiency of Open-Domain Question Answering (ODQA) systems. However, the current AC approaches require tuning of all model parameters, and training state-of-the-art ODQA models requires significant computational resources that may not be available for most researchers. We propose Adaptive Passage Encoder, an AC method that can be applied to an existing ODQA model and can be trained efficiently on a single GPU. It keeps the parameters of the base ODQA model fixed, but it overrides the default layer-by-layer computation of the encoder with an AC policy that is trained to optimise the computational efficiency of the model. Our experimental results show that our method improves upon a state-of-the-art model on two datasets, and is also more accurate than previous AC methods due to the stronger base ODQA model. All source code and datasets are available at https://github.com/uclnlp/APE.

Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models
Daniel de Vassimon Manela | David Errington | Thomas Fisher | Boris van Breugel | Pasquale Minervini
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper proposes two intuitive metrics, skew and stereotype, that quantify and analyse the gender bias present in contextual language models when tackling the WinoBias pronoun resolution task. We find evidence that gender stereotype correlates approximately negatively with gender skew in out-of-the-box models, suggesting that there is a trade-off between these two forms of bias. We investigate two methods to mitigate bias. The first approach is an online method which is effective at removing skew at the expense of stereotype. The second, inspired by previous work on ELMo, involves the fine-tuning of BERT using an augmented gender-balanced dataset. We show that this reduces both skew and stereotype relative to its unaugmented fine-tuned counterpart. However, we find that existing gender bias benchmarks do not fully probe professional bias as pronoun resolution may be obfuscated by cross-correlations from other manifestations of gender prejudice.

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them
Patrick Lewis | Yuxiang Wu | Linqing Liu | Pasquale Minervini | Heinrich Küttler | Aleksandra Piktus | Pontus Stenetorp | Sebastian Riedel
Transactions of the Association for Computational Linguistics, Volume 9

Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to “back-off” to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

2020

Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations
Oana-Maria Camburu | Brendan Shillingford | Pasquale Minervini | Thomas Lukasiewicz | Phil Blunsom
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

To increase trust in artificial intelligence systems, a promising research direction consists of designing neural models capable of generating natural language explanations for their predictions. In this work, we show that such models are nonetheless prone to generating mutually inconsistent explanations, such as ”Because there is a dog in the image.” and ”Because there is no dog in the [same] image.”, exposing flaws in either the decision-making process of the model or in the generation of the explanations. We introduce a simple yet effective adversarial framework for sanity checking models against the generation of inconsistent natural language explanations. Moreover, as part of the framework, we address the problem of adversarial attacks with full target sequences, a scenario that was not previously addressed in sequence-to-sequence attacks. Finally, we apply our framework on a state-of-the-art neural natural language inference model that provides natural language explanations for its predictions. Our framework shows that this model is capable of generating a significant number of inconsistent explanations.

Don’t Read Too Much Into It: Adaptive Computation for Open-Domain Question Answering
Yuxiang Wu | Sebastian Riedel | Pasquale Minervini | Pontus Stenetorp
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Most approaches to Open-Domain Question Answering consist of a light-weight retriever that selects a set of candidate passages, and a computationally expensive reader that examines the passages to identify the correct answer. Previous works have shown that as the number of retrieved passages increases, so does the performance of the reader. However, they assume all retrieved passages are of equal importance and allocate the same amount of computation to them, leading to a substantial increase in computational cost. To reduce this cost, we propose the use of adaptive computation to control the computational budget allocated for the passages to be read. We first introduce a technique operating on individual passages in isolation which relies on anytime prediction and a per-layer estimation of early exit probability. We then introduce SKYLINEBUILDER, an approach for dynamically deciding on which passage to allocate computation at each step, based on a resource allocation policy trained via reinforcement learning. Our results on SQuAD-Open show that adaptive computation with global prioritisation improves over several strong static and adaptive methods, leading to a 4.3x reduction in computation while retaining 95% performance of the full model.

Avoiding the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training
Joe Stacey | Pasquale Minervini | Haim Dubossarsky | Sebastian Riedel | Tim Rocktäschel
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Natural Language Inference (NLI) datasets contain annotation artefacts resulting in spurious correlations between the natural language utterances and their respective entailment classes. These artefacts are exploited by neural networks even when only considering the hypothesis and ignoring the premise, leading to unwanted biases. Belinkov et al. (2019b) proposed tackling this problem via adversarial training, but this can lead to learned sentence representations that still suffer from the same biases. We show that the bias can be reduced in the sentence representations by using an ensemble of adversaries, encouraging the model to jointly decrease the accuracy of these different adversaries while fitting the data. This approach produces more robust NLI models, outperforming previous de-biasing efforts when generalised to 12 other NLI datasets (Belinkov et al., 2019a; Mahabadi et al., 2020). In addition, we find that the optimal number of adversarial classifiers depends on the dimensionality of the sentence representations, with larger sentence representations being more difficult to de-bias while benefiting from using a greater number of adversaries.

Undersensitivity in Neural Reading Comprehension
Johannes Welbl | Pasquale Minervini | Max Bartolo | Pontus Stenetorp | Sebastian Riedel
Findings of the Association for Computational Linguistics: EMNLP 2020

Current reading comprehension methods generalise well to in-distribution test sets, yet perform poorly on adversarially selected data. Prior work on adversarial inputs typically studies model oversensitivity: semantically invariant text perturbations that cause a model’s prediction to change. Here we focus on the complementary problem: excessive prediction undersensitivity, where input text is meaningfully changed but the model’s prediction does not, even though it should. We formulate an adversarial attack which searches among semantic variations of the question for which a model erroneously predicts the same answer, and with even higher probability. We demonstrate that models trained on both SQuAD2.0 and NewsQA are vulnerable to this attack, and then investigate data augmentation and adversarial training as defences. Both substantially decrease adversarial vulnerability, which generalises to held-out data and held-out attack spaces. Addressing undersensitivity furthermore improves model robustness on the previously introduced ADDSENT and ADDONESENT datasets, and models generalise better when facing train / evaluation distribution mismatch: they are less prone to overly rely on shallow predictive cues present only in the training set, and outperform a conventional model by as much as 10.9% F1.

Don’t Read Too Much Into It: Adaptive Computation for Open-Domain Question Answering
Yuxiang Wu | Pasquale Minervini | Pontus Stenetorp | Sebastian Riedel
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

Most approaches to Open-Domain Question Answering consist of a light-weight retriever that selects a set of candidate passages, and a computationally expensive reader that examines the passages to identify the correct answer. Previous works have shown that as the number of retrieved passages increases, so does the performance of the reader. However, they assume all retrieved passages are of equal importance and allocate the same amount of computation to them, leading to a substantial increase in computational cost. To reduce this cost, we propose the use of adaptive computation to control the computational budget allocated for the passages to be read. We first introduce a technique operating on individual passages in isolation which relies on anytime prediction and a per-layer estimation of an early exit probability. We then introduce SKYLINEBUILDER, an approach for dynamically deciding on which passage to allocate computation at each step, based on a resource allocation policy trained via reinforcement learning. Our results on SQuAD-Open show that adaptive computation with global prioritisation improves over several strong static and adaptive methods, leading to a 4.3x reduction in computation while retaining 95% performance of the full model.

2019

NLProlog: Reasoning with Weak Unification for Question Answering in Natural Language
Leon Weber | Pasquale Minervini | Jannes Münchmeyer | Ulf Leser | Tim Rocktäschel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its large linguistic variability. In contrast, neural models can cope very well with ambiguity by learning distributed representations of words and their composition from data, but lead to models that are difficult to interpret. In this paper, we describe a model combining neural networks with logic programming in a novel manner for solving multi-hop reasoning tasks over natural language. Specifically, we propose to use an Prolog prover which we extend to utilize a similarity function over pretrained sentence encoders. We fine-tune the representations for the similarity function via backpropagation. This leads to a system that can apply rule-based reasoning to natural language, and induce domain-specific natural language rules from training data. We evaluate the proposed system on two different question answering tasks, showing that it outperforms two baselines – BiDAF (Seo et al., 2016a) and FastQA( Weissenborn et al., 2017) on a subset of the WikiHop corpus and achieves competitive results on the MedHop data set (Welbl et al., 2017).

2018

Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge
Pasquale Minervini | Sebastian Riedel
Proceedings of the 22nd Conference on Computational Natural Language Learning

Adversarial examples are inputs to machine learning models designed to cause the model to make a mistake. They are useful for understanding the shortcomings of machine learning models, interpreting their results, and for regularisation. In NLP, however, most example generation strategies produce input text by using known, pre-specified semantic transformations, requiring significant manual effort and in-depth understanding of the problem and domain. In this paper, we investigate the problem of automatically generating adversarial examples that violate a set of given First-Order Logic constraints in Natural Language Inference (NLI). We reduce the problem of identifying such adversarial examples to a combinatorial optimisation problem, by maximising a quantity measuring the degree of violation of such constraints and by using a language model for generating linguistically-plausible examples. Furthermore, we propose a method for adversarially regularising neural NLI models for incorporating background knowledge. Our results show that, while the proposed method does not always improve results on the SNLI and MultiNLI datasets, it significantly and consistently increases the predictive accuracy on adversarially-crafted datasets – up to a 79.6% relative improvement – while drastically reducing the number of background knowledge violations. Furthermore, we show that adversarial examples transfer among model architectures, and that the proposed adversarial training procedure improves the robustness of NLI models to adversarial examples.

Jack the Reader – A Machine Reading Framework
Dirk Weissenborn | Pasquale Minervini | Isabelle Augenstein | Johannes Welbl | Tim Rocktäschel | Matko Bošnjak | Jeff Mitchell | Thomas Demeester | Tim Dettmers | Pontus Stenetorp | Sebastian Riedel
Proceedings of ACL 2018, System Demonstrations

Many Machine Reading and Natural Language Understanding tasks require reading supporting text in order to answer questions. For example, in Question Answering, the supporting text can be newswire or Wikipedia articles; in Natural Language Inference, premises can be seen as the supporting text and hypotheses as questions. Providing a set of useful primitives operating in a single framework of related tasks would allow for expressive modelling, and easier model comparison and replication. To that end, we present Jack the Reader (JACK), a framework for Machine Reading that allows for quick model prototyping by component reuse, evaluation of new models on existing datasets as well as integrating new datasets and applying them on a growing set of implemented baseline models. JACK is currently supporting (but not limited to) three tasks: Question Answering, Natural Language Inference, and Link Prediction. It is developed with the aim of increasing research efficiency and code reuse.

Extrapolation in NLP
Jeff Mitchell | Pontus Stenetorp | Pasquale Minervini | Sebastian Riedel
Proceedings of the Workshop on Generalization in the Age of Deep Learning

We argue that extrapolation to unseen data will often be easier for models that capture global structures, rather than just maximise their local fit to the training data. We show that this is true for two popular models: the Decomposable Attention Model and word2vec.

2009

Apertium goes SOA: an efficient and scalable service based on the Apertium rule-based machine translation platform
Pasquale Minervini
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

Service Oriented Architecture (SOA) is a paradigm for organising and using distributed services that may be under the control of different ownership domains and implemented using various technology stacks. In some contexts, an organisation using an IT infrastructure implementing the SOA paradigm can take a great benefit from the integration, in its business processes, of efficient machine translation (MT) services to overcome language barriers. This paper describes the architecture and the design patterns used to develop an MT service that is efficient, scalable and easy to integrate in new and existing business processes. The service is based on Apertium, a free/opensource rule-based machine translation platform.

Co-authors

Aryo Pradipta Gema 5

Beatrice Alex 4

Oana-Maria Camburu 4

Isabelle Augenstein 3

Alessio Devoto 3

Haim Dubossarsky 3

Tim Rocktäschel 3

Emile Van Krieken 3

Erik Arakelyan 2

Claire Barale 2

Mohammad Reza Ghasemi Madani 2

Patrick Lewis 2

Jeff Mitchell 2

Edoardo Maria Ponti 2

Johannes Welbl 2

Ahmed Abdulaal 1

Oshin Agarwal 1

Emily Allaway 1

Saadullah Amin 1

Matko Bosnjak 1

Gerard De Melo 1

Thomas Demeester 1

David Errington 1

Thomas Fisher 1

Gayane Ghazaryan 1

Joshua Harris 1

Xuan-Jing Huang (黄萱菁) 1

Ignacio Iacobacci 1

Lucie-Aimée Kaffee 1

Mark Klaisoongnoen 1

Wai Chung Kwan 1

Heinrich Küttler 1

Joshua Ong Jun Leang 1

Thomas Lukasiewicz 1

Nikolay Malkin 1

Alberto Carlo Maria Mancino 1

Marta Marchiori Manerba 1

Robert McHardy 1

Piotr Miłoś 1

Jannes Münchmeyer 1

Günter Neumann 1

Andrea Passerini 1

Laura Perez-Beltrachini 1

Aleksandra Piktus 1

Michael Rovatsos 1

Benjamin I. P. Rubinstein 1

Mardhiyah Sanni 1

Amrutha Saseendran 1

Simone Scardapane 1

Brendan Shillingford 1

Jacopo Staiano 1

Konrad Staniszewski 1

Philip Alexander Teare 1

Szymon Tworkowski 1

Dirk Weissenborn 1

Zhanqiu Zhang 1

Jingjie Zheng 1

Daniel de Vassimon Manela 1

Boris van Breugel 1

Venues