Andreas Vlachos - ACL Anthology

Andreas Vlachos

2025

Conformity in Large Language Models
Xiaochen Zhu | Caiqi Zhang | Tom Stafford | Nigel Collier | Andreas Vlachos
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in state-of-the-art LLMs. Our findings reveal that all models tested exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions—Devil’s Advocate and Question Distillation—to mitigate conformity, providing insights into building more robust language models.

Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models
Xiaochen Zhu | Georgi Karadzhov | Chenxi Whitehouse | Andreas Vlachos
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Diffusion models have shown promise in text generation, but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion doesn’t model word-order dependencies explicitly and operates on short, fixed output windows, while passage-level diffusion struggles with learning robust representations for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into multiple latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on four datasets demonstrate that, when compared to other diffusion and autoregressive baselines SLD achieves competitive or superior fluency, coherence, and contextual compatibility in automatic and human evaluations.

Mitigating Shortcut Learning with InterpoLated Learning
Michalis Korakakis | Andreas Vlachos | Adrian Weller
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Empirical risk minimization (ERM) incentivizes models to exploit shortcuts, i.e., spurious correlations between input attributes and labels that are prevalent in the majority of the training data but unrelated to the task at hand. This reliance hinders generalization on minority examples, where such correlations do not hold. Existing shortcut mitigation approaches are model-specific, difficult to tune, computationally expensive, and fail to improve learned representations. To address these issues, we propose InterpoLated Learning (InterpoLL) which interpolates the representations of majority examples to include features from intra-class minority examples with shortcut-mitigating patterns. This weakens shortcut influence, enabling models to acquire features predictive across both minority and majority examples. Experimental results on multiple natural language understanding tasks demonstrate that InterpoLL improves minority generalization over both ERM and state-of-the-art mitigation methods, without compromising accuracy on majority examples. Notably, these gains persist across encoder, encoder-decoder, and decoder-only architectures, demonstrating the method’s broad applicability.

Causal Estimation of Tokenisation Bias
Pietro Lesci | Clara Meister | Thomas Hofmann | Andreas Vlachos | Tiago Pimentel
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser—which maps character-strings to subwords—should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as **tokenisation bias**. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., ⟨ hello ⟩) in a tokeniser’s vocabulary on the probability a trained model assigns to the corresponding characters (i.e., “hello”). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first K to a tokeniser’s vocabulary, where K is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models’ outputs across scales, vocabularies, and tokenisers. Notably, a subword’s presence in a small model’s vocabulary may increase its characters’ probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.

TCP: a Benchmark for Temporal Constraint-Based Planning
Zifeng Ding | Sikuan Yan | Moy Yuan | Xianglong Hu | Fangru Lin | Andreas Vlachos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark, that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we generate abstract problem prototypes that are then paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models may struggle with TCP, highlighting its difficulty and revealing limitations in LLMs’ temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.

Improving Zero-shot Sentence Decontextualisation with Content Selection and Planning
Zhenyun Deng | Yulong Chen | Andreas Vlachos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Extracting individual sentences from a document as evidence or reasoning steps is commonly done in many NLP tasks. However, extracted sentences often lack context necessary to make them understood, e.g., coreference and background information. To this end, we propose a content selection and planning framework for zero-shot decontextualisation, which determines what content should be mentioned and in what order for a sentence to be understood out of context. Specifically, given a potentially ambiguous sentence and its context, we first segment it into basic semantically-independent units. We then identify potentially ambiguous units from the given sentence, and extract relevant units from the context based on their discourse relations. Finally, we generate a content plan to rewrite the sentence by enriching each ambiguous unit with its relevant units. Experimental results demonstrate that our approach is competitive for sentence decontextualisation, producing sentences that exhibit better semantic integrity and discourse coherence, outperforming existing methods.

Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts
Eric Chamoun | Nedjma Ousidhoum | Michael Sejr Schlichtkrull | Andreas Vlachos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications when researchers claim that their findings have real-world impact. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning.We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset—achieving consistent improvements over strong LLM baselines.Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.

TSVer: A Benchmark for Fact Verification Against Time-Series Evidence
Marek Strong | Andreas Vlachos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Reasoning over temporal and numerical data, such as time series, is a crucial aspect of fact-checking. While many systems have recently been developed to handle this form of evidence, their evaluation remains limited by existing datasets, which often lack structured evidence, provide insufficient justifications for verdicts, or rely on synthetic claims. In this paper, we introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 287 real-world claims sourced from 38 fact-checking organizations and a curated database of 400 time series covering diverse domains.Each claim is annotated with time frames across all pertinent time series, along with a verdict and justifications reflecting how the evidence is used to reach the verdict. Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of 𝜅 = 0.745 on verdicts. We also develop a baseline for verifying claims against time-series evidence and show that even the state-of-the-art reasoning models like Gemini-2.5-Pro are challenged by time series, achieving a 63.37 accuracy score on verdicts and an Ev²R score of 48.63 on verdict justifications.

PledgeTracker: A System for Monitoring the Fulfilment of Pledges
Yulong Chen | Michael Sejr Schlichtkrull | Zhenyun Deng | David Corney | Nasim Asl | Joshua Salisbury | Andrew Dudfield | Andreas Vlachos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Political pledges reflect candidates’ policy commitments, but tracking their fulfilment requires reasoning over incremental evidence distributed across multiple, dynamically updated sources. Existing methods simplify this task into a document classification task, overlooking its dynamic temporal and multi-document nature. To address this issue, we introduce PledgeTracker, a system that reformulates pledge verification into structured event timeline construction. PledgeTracker consists of three core components: (1) a multi-step evidence retrieval module; (2) a timeline construction module and; (3) a fulfilment filtering module, allowing the capture of the evolving nature of pledge fulfilment and producing interpretable and structured timelines. We evaluate PledgeTracker in collaboration with professional fact-checkers in real-world workflows, demonstrating its effectiveness in retrieving relevant evidence and reducing human verification effort.

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Valentina Pyatkin | Andreas Vlachos
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)
Mubashara Akhtar | Rami Aly | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)

The 2nd Automated Verification of Textual Claims (AVeriTeC) Shared Task: Open-weights, Reproducible and Efficient Systems
Mubashara Akhtar | Rami Aly | Yulong Chen | Zhenyun Deng | Michael Schlichtkrull | Chenxi Whitehouse | Andreas Vlachos
Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)

In the First Automated Verification of Textual Claims (AVeriTeC) shared task participanting teams developed systems that for each claim retrieve evidence from the web and predict its veracity. While there was progress in automated fact-checking for real-world claims, the majority of the systems proposed relied on closed-weights large language models, which rendered them expensive to run and less reporducible. To ameliorate this issue, in this year’s edition of the AVERITEC shared task we required system to use only open-weights models that could be run use a single GPU with 23GBs of RAM, and that systems should take one minute or less to return verdicts accompanied by evidence retrieved from a precompiled knowledge store. The shared task received 7 submissions; 6 of which exceeded the accuracy of our baseline on the test set, while they ran in under a minute per claim on the hardware we had speficied. The winning team was CTU AIC with an AVeriTeC score of 33.17%. In this paper we describe the shared task in detail and highlight key findings.

Dis2Dis: Explaining Ambiguity in Fact-Checking
Ieva Staliunaite | Andreas Vlachos
Findings of the Association for Computational Linguistics: NAACL 2025

Ambiguity is a linguistic tool for encoding information efficiently, yet it also causes misunderstandings and disagreements. It is particularly relevant to the domain of misinformation, as fact-checking ambiguous claims is difficult even for experts. In this paper we argue that instead of predicting a veracity label for which there is genuine disagreement, it would be more beneficial to explain the ambiguity. Thus, this work introduces claim disambiguation, a constrained generation task, for explaining ambiguous claims in fact-checking. This involves editing them to spell out an interpretation that can then be unequivocally supported by the given evidence. We collect a dataset of 1501 such claim revisions and conduct experiments with sequence-to-sequence models. The performance is compared to a simple copy baseline and a Large Language Model baseline. The best results are achieved by employing Minimum Bayes Decoding, with a BertScore F1 of 92.22. According to human evaluation, the model successfully disambiguates the claims 72% of the time.

MPTA: MultiTask Personalization Assessment
Matthieu Tehenan | Eric Chamoun | Andreas Vlachos
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models are increasingly expected to adapt to individual users, reflecting differences in preferences, values, and communication styles. To evaluate whether models can serve diverse populations, we introduce MTPA, a benchmark that leverages large-scale survey data (WVS, EVS, GSS) to construct real, hyper-granular personas spanning demographics, beliefs, and values. Unlike prior benchmarks that rely on synthetic profiles or narrow trait prediction, MTPA conditions models on real personas and systematically tests their behavior across core alignment tasks. We show that persona conditioning exposes pluralistic misalignment: while aggregate metrics suggest models are truthful and safe, subgroup-specific evaluations reveal hidden pockets of degraded factuality, fairness disparities, and inconsistent value alignment. Alongside the benchmark, we release a dataset, toolkit, and baseline evaluations. MTPA is designed with extensibility and sustainability in mind: as the underlying survey datasets are regularly updated, MTPA supports regular integration of new populations and user traits.

A Bayesian Optimization Approach to Machine Translation Reranking
Julius Cheng | Maike Züfle | Vilém Zouhar | Andreas Vlachos
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Reranking, or scoring a list of prediction candidates from a machine translation system with an external scoring model and returning the highest-scoring candidate, remains a simple and effective method for improving prediction quality. However, reranking with high quality scoring models can add substantial computational cost to the translation pipeline, which we address in this work by framing list reranking as a Bayesian optimization (BayesOpt) problem over the candidate list, where unknown scores are modeled with a Gaussian process. This algorithm scores candidates iteratively, choosing next candidates by balancing between exploration, choosing to score those that differ from candidates already scored, and exploitation, choosing to score those that resemble high-scoring candidates.This procedure finds high-scoring candidates while scoring only a fraction of the candidates list; given candidate lists of 200 random samples (before deduplication), our method achieves the same CometKiwi score using only 70 scoring evaluations on average compared to scoring a random subset of 180 candidates. We also propose multi-fidelity BayesOpt for list reranking, where scores obtained from a noisier but cheaper proxy scoring model are incorporated into the search process. We show that well-trained distilled proxy scorers can further improve the performance of BayesOpt.

Uncertain (Mis)Takes at LeWiDi-2025: Modeling Human Label Variation With Semantic Entropy
Ieva Raminta Staliūnaitė | Andreas Vlachos
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP

The VariErrNLI task requires detecting the degree to which each Natural Language Inference (NLI) label is acceptable to a group of annotators. This paper presents an approach to VariErrNLI which incorporates measures of uncertainty, namely Semantic Entropy (SE), to model human label variation. Our method is based on the assumption that if two labels are plausible alternatives, then their explanations must be non-contradictory. We measure SE over Large Language Model (LLM)-generated explanations for a given NLI label, which represents the model uncertainty over the semantic space of possible explanations for that label. The system employs SE scores combined with an encoding of the inputs and generated explanations, and reaches a 0.31 Manhattan distance score on the test set, ranking joint first in the soft evaluation of VariErrNLI.

2024

Document-level Claim Extraction and Decontextualisation for Fact-Checking
Zhenyun Deng | Michael Schlichtkrull | Andreas Vlachos
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Selecting which claims to check is a time-consuming task for human fact-checkers, especially from documents consisting of multiple sentences and containing multiple claims. However, existing claim extraction approaches focus more on identifying and extracting claims from individual sentences, e.g., identifying whether a sentence contains a claim or the exact boundaries of the claim within a sentence. In this paper, we propose a method for document-level claim extraction for fact-checking, which aims to extract check-worthy claims from documents and decontextualise them so that they can be understood out of context. Specifically, we first recast claim extraction as extractive summarization in order to identify central sentences from documents, then rewrite them to include necessary context from the originating document through sentence decontextualisation. Evaluation with both automatic metrics and a fact-checking professional shows that our method is able to extract check-worthy claims from documents at a higher rate than previous work, while also improving evidence retrieval.

Causal Estimation of Memorisation Profiles
Pietro Lesci | Clara Meister | Thomas Hofmann | Andreas Vlachos | Tiago Pimentel
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding memorisation in language models has practical and societal implications, e.g., studying models’ training dynamics or preventing copyright infringements.Prior work defines memorisation as the causal effect of training with an instance on the model’s ability to predict that instance. This definition relies on a counterfactual: the ability to observe what would have happened had the model not seen that instance.Existing methods struggle to provide computationally efficient and accurate estimates of this counterfactual. Further, they often estimate memorisation for a model architecture rather than for a specific model instance. This paper fills an important gap in the literature, proposing a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics. Using this method, we characterise a model’s memorisation profile–its memorisation trends across training–by only observing its behaviour on a small set of instances throughout training.In experiments with the Pythia model suite, we find that memorisation (i) is stronger and more persistent in larger models, (ii) is determined by data order and learning rate, and (iii) has stable trends across model sizes, thus making memorisation in larger models predictable from smaller ones.

Measuring Uncertainty in Neural Machine Translation with Similarity-Sensitive Entropy
Julius Cheng | Andreas Vlachos
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Uncertainty estimation is an important diagnostic tool for statistical models, and is often used to assess the confidence of model predictions. Previous work shows that neural machine translation (NMT) is an intrinsically uncertain task where there are often multiple correct and semantically equivalent translations, and that well-trained NMT models produce good translations despite spreading probability mass among many semantically similar translations. These findings suggest that popular measures of uncertainty based on token- and sequence-level entropies which measure surface form diversity may not be good proxies of the more useful quantity of interest, semantic diversity. We propose to adapt similarity-sensitive Shannon entropy (S3E), a concept borrowed from theoretical ecology, for NMT. By demonstrating significantly improved correlation between S3E and task performance on quality estimation and named entity recall, we show that S3E is a useful framework for measuring uncertainty in NMT.

Do We Need Language-Specific Fact-Checking Models? The Case of Chinese
Caiqi Zhang | Zhijiang Guo | Andreas Vlachos
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese using CHEF dataset. To better reflect real-world fact-checking, we first develop a novel Chinese document-level evidence retriever, achieving state-of-the-art performance. We then demonstrate the limitations of translation-based methods and multilingual language models, highlighting the need for language-specific systems. To better analyze token-level biases in different systems, we construct an adversarial dataset based on the CHEF dataset, where each instance has a large word overlap with the original one but holds the opposite veracity label. Experimental results on the CHEF dataset and our adversarial dataset show that our proposed method outperforms translation-based methods and multilingual language models and is more robust toward biases, emphasizing the importance of language-specific fact-checking systems.

An LLM Feature-based Framework for Dialogue Constructiveness Assessment
Lexin Zhou | Youmna Farag | Andreas Vlachos
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructiveness outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). In this paper we propose an LLM feature-based framework for dialogue constructiveness assessment that combines the strengths of feature-based and neural approaches, while mitigating their downsides. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models outperform or performs at least as well as standard feature-based models and neural models. We also find that the LLM feature-based model learns more robust prediction rules instead of relying on superficial shortcuts, which often trouble neural models.

ALVIN: Active Learning Via INterpolation
Michalis Korakakis | Andreas Vlachos | Adrian Weller
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Active Learning aims to minimize annotation effort by selecting the most useful instances from a pool of unlabeled data. However, typical active learning methods overlook the presence of distinct example groups within a class, whose prevalence may vary, e.g., in occupation classification datasets certain demographics are disproportionately represented in specific classes. This oversight causes models to rely on shortcuts for predictions, i.e., spurious correlations between input attributes and labels occurring in well-represented groups. To address this issue, we propose Active Learning Via INterpolation (ALVIN), which conducts intra-class interpolations between examples from under-represented and well-represented groups to create anchors, i.e., artificial points situated between the example groups in the representation space. By selecting instances close to the anchors for annotation, ALVIN identifies informative examples exposing the model to regions of the representation space that counteract the influence of shortcuts. Crucially, since the model considers these examples to be of high certainty, they are likely to be ignored by typical active learning methods. Experimental results on six datasets encompassing sentiment analysis, natural language inference, and paraphrase detection demonstrate that ALVIN outperforms state-of-the-art active learning methods in both in-distribution and out-of-distribution generalization.

Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)
Michael Schlichtkrull | Yulong Chen | Chenxi Whitehouse | Zhenyun Deng | Mubashara Akhtar | Rami Aly | Zhijiang Guo | Christos Christodoulopoulos | Oana Cocarascu | Arpit Mittal | James Thorne | Andreas Vlachos
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)

The Automated Verification of Textual Claims (AVeriTeC) Shared Task
Michael Schlichtkrull | Yulong Chen | Chenxi Whitehouse | Zhenyun Deng | Mubashara Akhtar | Rami Aly | Zhijiang Guo | Christos Christodoulopoulos | Oana Cocarascu | Arpit Mittal | James Thorne | Andreas Vlachos
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)

The Automated Verification of Textual Claims (AVeriTeC) shared task asks participants to retrieve evidence and predict veracity for real-world claims checked by fact-checkers. Evidence can be found either via a search engine, or via a knowledge store provided by the organisers. Submissions are evaluated using the AVeriTeC score, which considers a claim to be accurately verified if and only if both the verdict is correct and retrieved evidence is considered to meet a certain quality threshold. The shared task received 21 submissions, 18 of which surpassed our baseline. The winning team was TUDA_MAI with an AVeriTeC score of 63%. In this paper we describe the shared task, present the full results, and highlight key takeaways from the shared task.

Automated Focused Feedback Generation for Scientific Writing Assistance
Eric Chamoun | Michael Schlichtkrull | Andreas Vlachos
Findings of the Association for Computational Linguistics: ACL 2024

Scientific writing is a challenging task, particularly for novice researchers who often rely on feedback from experienced peers. Recent work has primarily focused on improving surface form and style rather than manuscript content. In this paper, we propose a novel task: automated focused feedback generation for scientific writing assistance. We present SWIF²T: a Scientific WrIting Focused Feedback Tool. It is designed to generate specific, actionable and coherent comments, which identify weaknesses in a scientific paper and/or propose revisions to it. Our approach consists of four components - planner, investigator, reviewer and controller - leveraging multiple Large Language Models (LLMs) to implement them. We compile a dataset of 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation. The results demonstrate the superiority in specificity, reading comprehension, and overall helpfulness of SWIF²T’s feedback compared to other approaches. In our analysis, we also identified cases where automatically generated reviews were judged better than human ones, suggesting opportunities for integration of AI-generated feedback in scientific writing.

Zero-Shot Fact Verification via Natural Logic and Large Language Models
Marek Strong | Rami Aly | Andreas Vlachos
Findings of the Association for Computational Linguistics: EMNLP 2024

The recent development of fact verification systems with natural logic has enhanced their explainability by aligning claims with evidence through set-theoretic operators, providing faithful justifications. Despite these advancements, such systems often rely on a large amount of training data annotated with natural logic. To address this issue, we propose a zero-shot method that utilizes the generalization capabilities of instruction-tuned large language models. To comprehensively assess the zero-shot capabilities of our method and other fact verification systems, we evaluate all models on both artificial and real-world claims, including multilingual datasets. We also compare our method against other fact verification systems in two setups. First, in the zero-shot generalization setup, we demonstrate that our approach outperforms other systems that were not specifically trained on natural logic data, achieving an average accuracy improvement of 8.96 points over the best-performing baseline. Second, in the zero-shot transfer setup, we show that current systems trained on natural logic data do not generalize well to other domains, and our method outperforms these systems across all datasets with real-world claims.

Zero-Shot Fact-Checking with Semantic Triples and Knowledge Graphs
Moy Yuan | Andreas Vlachos
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)

Despite progress in automated fact-checking, most systems require a significant amount of labeled training data, which is expensive. In this paper, we propose a novel zero-shot method, which instead of operating directly on the claim and evidence sentences, decomposes them into semantic triples augmented using external knowledge graphs, and uses large language models trained for natural language inference. This allows it to generalize to adversarial datasets and domains that supervised models require specific training data for. Our empirical results show that our approach outperforms previous zero-shot approaches on FEVER, FEVER-Symmetric, FEVER 2.0, and Climate-FEVER, while being comparable or better than supervised models on the adversarial and the out-of-domain datasets.

AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets
Pietro Lesci | Andreas Vlachos
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Active learning for imbalanced classification tasks is challenging as the minority classes naturally occur rarely. Gathering a large pool of unlabelled data is thus essential to capture minority instances. Standard pool-based active learning is computationally expensive on large pools and often reaches low accuracy by overfitting the initial decision boundary, thus failing to explore the input space and find minority instances. To address these issues we propose AnchorAL. At each iteration, AnchorAL chooses class-specific instances from the labelled set, or *anchors*, and retrieves the most similar unlabelled instances from the pool. This resulting *subpool* is then used for active learning. Using a small, fixed-sized subpool AnchorAL allows scaling any active learning strategy to large pools. By dynamically selecting different anchors at each iteration it promotes class balance and prevents overfitting the initial decision boundary, thus promoting the discovery of new clusters of minority instances. Experiments across different classification tasks, active learning strategies, and model architectures AnchorAL is *(i)* faster, often reducing runtime from hours to minutes, *(ii)* trains more performant models, *(iii)* and returns more balanced datasets than competing methods.

AmbiFC: Fact-Checking Ambiguous Claims with Evidence
Max Glockner | Ieva Staliūnaitė | James Thorne | Gisela Vallejo | Andreas Vlachos | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 12

Automated fact-checking systems verify claims against evidence to predict their veracity. In real-world scenarios, the retrieved evidence may not unambiguously support or refute the claim and yield conflicting but valid interpretations. Existing fact-checking datasets assume that the models developed with them predict a single veracity label for each claim, thus discouraging the handling of such ambiguity. To address this issue we present AmbiFC,1 a fact-checking dataset with 10k claims derived from real-world information needs. It contains fine-grained evidence annotations of 50k passages from 5k Wikipedia pages. We analyze the disagreements arising from ambiguity when comparing claims against evidence in AmbiFC, observing a strong correlation of annotator disagreement with linguistic phenomena such as underspecification and probabilistic reasoning. We develop models for predicting veracity handling this ambiguity via soft labels, and find that a pipeline that learns the label distribution for sentence-level evidence selection and veracity prediction yields the best performance. We compare models trained on different subsets of AmbiFC and show that models trained on the ambiguous instances perform better when faced with the identified linguistic phenomena.

TabVer: Tabular Fact Verification with Natural Logic
Rami Aly | Andreas Vlachos
Transactions of the Association for Computational Linguistics, Volume 12

Fact verification on tabular evidence incentivizes the use of symbolic reasoning models where a logical form is constructed (e.g., a LISP-style program), providing greater verifiability than fully neural approaches. However, these logical forms typically rely on well-formed tables, restricting their use in many scenarios. An emerging symbolic reasoning paradigm for textual evidence focuses on natural logic inference, which constructs proofs by modeling set-theoretic relations between a claim and its evidence in natural language. This approach provides flexibility and transparency but is less compatible with tabular evidence since the relations do not extend to arithmetic functions. We propose a set-theoretic interpretation of numerals and arithmetic functions in the context of natural logic, enabling the integration of arithmetic expressions in deterministic proofs. We leverage large language models to generate arithmetic expressions by generating questions about salient parts of a claim which are answered by executing appropriate functions on tables. In a few-shot setting on FEVEROUS, we achieve an accuracy of 71.4, outperforming both fully neural and symbolic reasoning models by 3.4 points. When evaluated on TabFact without any further training, our method remains competitive with an accuracy lead of 0.5 points.

2023

Improving the robustness of NLI models with minimax training
Michalis Korakakis | Andreas Vlachos
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural language inference (NLI) models are susceptible to learning shortcuts, i.e. decision rules that spuriously correlate with the label. As a result, they achieve high in-distribution performance, but fail to generalize to out-of-distribution samples where such correlations do not hold. In this paper, we present a training method to reduce the reliance of NLI models on shortcuts and improve their out-of-distribution performance without assuming prior knowledge of the shortcuts being targeted. To this end, we propose a minimax objective between a learner model being trained for the NLI task, and an auxiliary model aiming to maximize the learner’s loss by up-weighting examples from regions of the input space where the learner incurs high losses. This process incentivizes the learner to focus on under-represented “hard” examples with patterns that contradict the shortcuts learned from the prevailing “easy” examples. Experimental results on three NLI datasets demonstrate that our method consistently outperforms other robustness enhancing techniques on out-of-distribution adversarial test sets, while maintaining high in-distribution accuracy.

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Andreas Vlachos | Isabelle Augenstein
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

QA-NatVer: Question Answering for Natural Logic-based Fact Verification
Rami Aly | Marek Strong | Andreas Vlachos
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Fact verification systems assess a claim’s veracity based on evidence. An important consideration in designing them is faithfulness, i.e. generating explanations that accurately reflect the reasoning of the model. Recent works have focused on natural logic, which operates directly on natural language by capturing the semantic relation of spans between an aligned claim with its evidence via set-theoretic operators. However, these approaches rely on substantial resources for training, which are only available for high-resource languages. To this end, we propose to use question answering to predict natural logic operators, taking advantage of the generalization capabilities of instruction-tuned language models. Thus, we obviate the need for annotated training data while still relying on a deterministic inference system. In a few-shot setting on FEVER, our approach outperforms the best baseline by 4.3 accuracy points, including a state-of-the-art pre-trained seq2seq natural logic system, as well as a state-of-the-art prompt-based classifier. Our system demonstrates its robustness and portability, achieving competitive performance on a counterfactual dataset and surpassing all approaches without further annotation on a Danish verification dataset. A human evaluation indicates that our approach produces more plausible proofs with fewer erroneous natural logic operators than previous natural logic-based systems.

Faster Minimum Bayes Risk Decoding with Confidence-based Pruning
Julius Cheng | Andreas Vlachos
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Minimum Bayes risk (MBR) decoding outputs the hypothesis with the highest expected utility over the model distribution for some utility function. It has been shown to improve accuracy over beam search in conditional language generation problems and especially neural machine translation, in both human and automatic evaluations. However, the standard sampling-based algorithm for MBR is substantially more computationally expensive than beam search, requiring a large number of samples as well as a quadratic number of calls to the utility function, limiting its applicability. We describe an algorithm for MBR which gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling. Our method requires fewer samples and drastically reduces the number of calls to the utility function compared to standard MBR while being statistically indistinguishable in terms of accuracy. We demonstrate the effectiveness of our approach in experiments on three language pairs, using chrF++ and COMET as utility/evaluation metrics.

Automated Fact-Checking in Dialogue: Are Specialized Models Needed?
Eric Chamoun | Marzieh Saeidi | Andreas Vlachos
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Prior research has shown that typical fact-checking models for stand-alone claims struggle with claims made in conversation. As a solution, fine-tuning these models on dialogue data has been proposed. However, creating separate models for each use case is impractical, and we show that fine-tuning models for dialogue results in poor performance on typical fact-checking. To overcome this challenge, we present techniques that allow us to use the same models for both dialogue and typical fact-checking. These mainly focus on retrieval adaptation and transforming conversational inputs so that they can be accurately processed by models trained on stand-alone claims. We demonstrate that a typical fact-checking model incorporating these techniques is competitive with state-of-the-art models for dialogue, while maintaining its performance on stand-alone claims.

Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER)
Mubashara Akhtar | Rami Aly | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER)

Findings of the Association for Computational Linguistics: EACL 2023
Andreas Vlachos | Isabelle Augenstein
Findings of the Association for Computational Linguistics: EACL 2023

Multimodal Automated Fact-Checking: A Survey
Mubashara Akhtar | Michael Schlichtkrull | Zhijiang Guo | Oana Cocarascu | Elena Simperl | Andreas Vlachos
Findings of the Association for Computational Linguistics: EMNLP 2023

Misinformation is often conveyed in multiple modalities, e.g. a miscaptioned image. Multimodal misinformation is perceived as more credible by humans, and spreads faster than its text-only counterparts. While an increasing body of research investigates automated fact-checking (AFC), previous surveys mostly focus on text. In this survey, we conceptualise a framework for AFC including subtasks unique to multimodal misinformation. Furthermore, we discuss related terms used in different communities and map them to our framework. We focus on four modalities prevalent in real-world fact-checking: text, image, audio, and video. We survey benchmarks and models, and discuss limitations and promising directions for future research

The Intended Uses of Automated Fact-Checking Artefacts: Why, How and Who
Michael Schlichtkrull | Nedjma Ousidhoum | Andreas Vlachos
Findings of the Association for Computational Linguistics: EMNLP 2023

Automated fact-checking is often presented as an epistemic tool that fact-checkers, social media consumers, and other stakeholders can use to fight misinformation. Nevertheless, few papers thoroughly discuss how. We document this by analysing 100 highly-cited papers, and annotating epistemic elements related to intended use, i.e., means, ends, and stakeholders. We find that narratives leaving out some of these aspects are common, that many papers propose inconsistent means and ends, and that the feasibility of suggested strategies rarely has empirical backing. We argue that this vagueness actively hinders the technology from reaching its goals, as it encourages overclaiming, limits criticism, and prevents stakeholder feedback. Accordingly, we provide several recommendations for thinking and writing about the use of fact-checking artefacts.

2022

Leveraging Wikipedia article evolution for promotional tone detection
Christine De Kock | Andreas Vlachos
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Detecting biased language is useful for a variety of applications, such as identifying hyperpartisan news sources or flagging one-sided rhetoric. In this work we introduce WikiEvolve, a dataset for document-level promotional tone detection. Unlike previously proposed datasets, WikiEvolve contains seven versions of the same article from Wikipedia, from different points in its revision history; one with promotional tone, and six without it. This allows for obtaining more precise training signal for learning models from promotional tone detection. We adapt the previously proposed gradient reversal layer framework to encode two article versions simultaneously and thus leverage this additional training signal. In our experiments, our proposed adaptation of gradient reversal improves the accuracy of four different architectures on both in-domain and out-of-domain evaluation.

Explainable Assessment of Healthcare Articles with QA
Alodie Boissonnet | Marzieh Saeidi | Vassilis Plachouras | Andreas Vlachos
Proceedings of the 21st Workshop on Biomedical Language Processing

The healthcare domain suffers from the spread of poor quality articles on the Internet. While manual efforts exist to check the quality of online healthcare articles, they are not sufficient to assess all those in circulation. Such quality assessment can be automated as a text classification task, however, explanations for the labels are necessary for the users to trust the model predictions. While current explainable systems tackle explanation generation as summarization, we propose a new approach based on question answering (QA) that allows us to generate explanations for multiple criteria using a single model. We show that this QA-based approach is competitive with the current state-of-the-art, and complements summarization-based models for explainable quality assessment. We also introduce a human evaluation protocol more appropriate than automatic metrics for the evaluation of explanation generation models.

Varifocal Question Generation for Fact-checking
Nedjma Ousidhoum | Zhangdie Yuan | Andreas Vlachos
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Fact-checking requires retrieving evidence related to a claim under investigation. The task can be formulated as question generation based on a claim, followed by question answering.However, recent question generation approaches assume that the answer is known and typically contained in a passage given as input,whereas such passages are what is being sought when verifying a claim.In this paper, we present Varifocal, a method that generates questions based on different focal points within a given claim, i.e. different spans of the claim and its metadata, such as its source and date.Our method outperforms previous work on a fact-checking question generation dataset on a wide range of automatic evaluation metrics.These results are corroborated by our manual evaluation, which indicates that our method generates more relevant and informative questions.We further demonstrate the potential of focal points in generating sets of clarification questions for product descriptions.

How to disagree well: Investigating the dispute tactics used on Wikipedia
Christine De Kock | Tom Stafford | Andreas Vlachos
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Disagreements are frequently studied from the perspective of either detecting toxicity or analysing argument structure. We propose a framework of dispute tactics which unifies these two perspectives, as well as other dialogue acts which play a role in resolving disputes, such as asking questions and providing clarification. This framework includes a preferential ordering among rebuttal-type tactics, ranging from ad hominem attacks to refuting the central argument. Using this framework, we annotate 213 disagreements (3,865 utterances) from Wikipedia Talk pages. This allows us to investigate research questions around the tactics used in disagreements; for instance, we provide empirical validation of the approach to disagreement recommended by Wikipedia. We develop models for multilabel prediction of dispute tactics in an utterance, achieving the best performance with a transformer-based label powerset model. Adding an auxiliary task to incorporate the ordering of rebuttal tactics further yields a statistically significant increase. Finally, we show that these annotations can be used to provide useful additional signals to improve performance on the task of predicting escalation.

Natural Logic-guided Autoregressive Multi-hop Document Retrieval for Fact Verification
Rami Aly | Andreas Vlachos
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

A key component of fact verification is the evidence retrieval, often from multiple documents. Recent approaches use dense representations and condition the retrieval of each document on the previously retrieved ones. The latter step is performed over all the documents in the collection, requiring storing their dense representations in an index, thus incurring a high memory footprint. An alternative paradigm is retrieve-and-rerank, where documents are retrieved using methods such as BM25, their sentences are reranked, and further documents are retrieved conditioned on these sentences, reducing the memory requirements. However, such approaches can be brittle as they rely on heuristics and assume hyperlinks between documents.We propose a novel retrieve-and-rerank method for multi-hop retrieval, that consists of a retriever that jointly scores documents in the knowledge source and sentences from previously retrieved documents using an autoregressive formulation and is guided by a proof system based on natural logic that dynamically terminates the retrieval process if the evidence is deemed sufficient.This method exceeds or is on par with the current state-of-the-art on FEVER, HoVer and FEVEROUS-S, while using 5 to 10 times less memory than competing systems. Evaluation on an adversarial dataset indicates improved stability of our approach compared to commonly deployed threshold-based methods. Finally, the proof system helps humans predict model decisions correctly more often than using the evidence alone.

Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER)
Rami Aly | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER)

Opening up Minds with Argumentative Dialogues
Youmna Farag | Charlotte Brand | Jacopo Amidei | Paul Piwek | Tom Stafford | Svetlana Stoyanchev | Andreas Vlachos
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent research on argumentative dialogues has focused on persuading people to take some action, changing their stance on the topic of discussion, or winning debates. In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. To this end, we present a dataset of 183 argumentative dialogues about 3 controversial topics: veganism, Brexit and COVID-19 vaccination. The dialogues were collected using the Wizard of Oz approach, where wizards leverage a knowledge-base of arguments to converse with participants. Open-mindedness is measured before and after engaging in the dialogue using a questionnaire from the psychology literature, and success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs. We evaluate two dialogue models: a Wikipedia-based and an argument-based model. We show that while both models perform closely in terms of opening up minds, the argument-based model is significantly better on other dialogue properties such as engagement and clarity.

Improving Scheduled Sampling with Elastic Weight Consolidation for Neural Machine Translation
Michalis Korakakis | Andreas Vlachos
Findings of the Association for Computational Linguistics: EMNLP 2022

Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. the discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and empirically successful approach which addresses this issue by incorporating model-generated prefixes into training. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that scheduled sampling, while it ameliorates exposure bias by increasing model reliance on the input sequence, worsens performance when the prefix at inference time is correct, a form of catastrophic forgetting. We propose to use Elastic Weight Consolidation to better balance mitigating exposure bias with retaining performance. Experiments on four IWSLT’14 and WMT’14 translation datasets demonstrate that our approach alleviates catastrophic forgetting and significantly outperforms maximum likelihood estimation and scheduled sampling baselines.

What makes you change your mind? An empirical investigation in online group decision-making conversations
Georgi Karadzhov | Tom Stafford | Andreas Vlachos
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

People leverage group discussions to collaborate in order to solve complex tasks, e.g. in project meetings or hiring panels. By doing so, they engage in a variety of conversational strategies where they try to convince each other of the best approach and ultimately reach a decision. In this work, we investigate methods for detecting what makes someone change their mind. To this end, we leverage a recently introduced dataset containing group discussions of people collaborating to solve a task. To find out what makes someone change their mind, we incorporate various techniques such as neural text classification and language-agnostic change point detection. Evaluation of these methods shows that while the task is not trivial, the best way to approach it is using a language-aware model with learning-to-rank training. Finally, we examine the cues that the models develop as indicative of the cause of a change of mind.

Proceedings of the Sixth Workshop on Structured Prediction for NLP
Andreas Vlachos | Priyanka Agrawal | André Martins | Gerasimos Lampouras | Chunchuan Lyu
Proceedings of the Sixth Workshop on Structured Prediction for NLP

A Survey on Automated Fact-Checking
Zhijiang Guo | Michael Schlichtkrull | Andreas Vlachos
Transactions of the Association for Computational Linguistics, Volume 10

Fact-checking has become increasingly important due to the speed with which both information and misinformation can spread in the modern media ecosystem. Therefore, researchers have been exploring how fact-checking can be automated, using techniques based on natural language processing, machine learning, knowledge representation, and databases to automatically predict the veracity of claims. In this paper, we survey automated fact-checking stemming from natural language processing, and discuss its connections to related tasks and disciplines. In this process, we present an overview of existing datasets and models, aiming to unify the various definitions given and identify common concepts. Finally, we highlight challenges for future research.

ProoFVer: Natural Logic Theorem Proving for Fact Verification
Amrith Krishna | Sebastian Riedel | Andreas Vlachos
Transactions of the Association for Computational Linguistics, Volume 10

Fact verification systems typically rely on neural network classifiers for veracity prediction, which lack explainability. This paper proposes ProoFVer, which uses a seq2seq model to generate natural logic-based inferences as proofs. These proofs consist of lexical mutations between spans in the claim and the evidence retrieved, each marked with a natural logic operator. Claim veracity is determined solely based on the sequence of these operators. Hence, these proofs are faithful explanations, and this makes ProoFVer faithful by construction. Currently, ProoFVer has the highest label accuracy and the second best score in the FEVER leaderboard. Furthermore, it improves by 13.21% points over the next best model on a dataset with counterfactual instances, demonstrating its robustness. As explanations, the proofs show better overlap with human rationales than attention-based highlights and the proofs help humans predict model decisions correctly more often than using the evidence directly.1

2021

Leveraging Type Descriptions for Zero-shot Named Entity Recognition and Classification
Rami Aly | Andreas Vlachos | Ryan McDonald
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

A common issue in real-world applications of named entity recognition and classification (NERC) is the absence of annotated data for the target entity classes during training. Zero-shot learning approaches address this issue by learning models from classes with training data that can predict classes without it. This paper presents the first approach for zero-shot NERC, introducing novel architectures that leverage the fact that textual descriptions for many entity classes occur naturally. We address the zero-shot NERC specific challenge that the not-an-entity class is not well defined as different entity classes are considered in training and testing. For evaluation, we adapt two datasets, OntoNotes and MedMentions, emulating the difficulty of real-world zero-shot learning by testing models on the rarest entity classes. Our proposed approach outperforms baselines adapted from machine reading comprehension and zero-shot text classification. Furthermore, we assess the effect of different class descriptions for this task.

Evidence-based Factual Error Correction
James Thorne | Andreas Vlachos
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper introduces the task of factual error correction: performing edits to a claim so that the generated rewrite is better supported by evidence. This extends the well-studied task of fact verification by providing a mechanism to correct written texts that are refuted or only partially supported by evidence. We demonstrate that it is feasible to train factual error correction systems from existing fact checking datasets which only contain labeled claims accompanied by evidence, but not the correction. We achieve this by employing a two-stage distant supervision approach that incorporates evidence into masked claims when generating corrections. Our approach, based on the T5 transformer and using retrieved evidence, achieved better results than existing work which used a pointer copy network and gold evidence, producing accurate factual error corrections for 5x more instances in human evaluation and a .125 increase in SARI score. The evaluation is conducted on a dataset of 65,000 instances based on a recent fact verification shared task and we release it to enable further work on the task.

Trajectory-Based Meta-Learning for Out-Of-Vocabulary Word Embedding Learning
Gordon Buck | Andreas Vlachos
Proceedings of the Second Workshop on Domain Adaptation for NLP

Word embedding learning methods require a large number of occurrences of a word to accurately learn its embedding. However, out-of-vocabulary (OOV) words which do not appear in the training corpus emerge frequently in the smaller downstream data. Recent work formulated OOV embedding learning as a few-shot regression problem and demonstrated that meta-learning can improve results obtained. However, the algorithm used, model-agnostic meta-learning (MAML) is known to be unstable and perform worse when a large number of gradient steps are used for parameter updates. In this work, we propose the use of Leap, a meta-learning algorithm which leverages the entire trajectory of the learning process instead of just the beginning and the end points, and thus ameliorates these two issues. In our experiments on a benchmark OOV embedding learning dataset and in an extrinsic evaluation, Leap performs comparably or better than MAML. We go on to examine which contexts are most beneficial to learn an OOV embedding from, and propose that the choice of contexts may matter more than the meta-learning employed.

Elastic weight consolidation for better bias inoculation
James Thorne | Andreas Vlachos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The biases present in training datasets have been shown to affect models for sentence pair classification tasks such as natural language inference (NLI) and fact verification. While fine-tuning models on additional data has been used to mitigate them, a common issue is that of catastrophic forgetting of the original training dataset. In this paper, we show that elastic weight consolidation (EWC) allows fine-tuning of models to mitigate biases while being less susceptible to catastrophic forgetting. In our evaluation on fact verification and NLI stress tests, we show that fine-tuning with EWC dominates standard fine-tuning, yielding models with lower levels of forgetting on the original (biased) dataset for equivalent gains in accuracy on the fine-tuning (unbiased) dataset.

I Beg to Differ: A study of constructive disagreement in online conversations
Christine De Kock | Andreas Vlachos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Disagreements are pervasive in human communication. In this paper we investigate what makes disagreement constructive. To this end, we construct WikiDisputes, a corpus of 7425 Wikipedia Talk page conversations that contain content disputes, and define the task of predicting whether disagreements will be escalated to mediation by a moderator. We evaluate feature-based models with linguistic markers from previous work, and demonstrate that their performance is improved by using features that capture changes in linguistic markers throughout the conversations, as opposed to averaged values. We develop a variety of neural models and show that taking into account the structure of the conversation improves predictive accuracy, exceeding that of feature-based models. We assess our best neural model in terms of both predictive accuracy and uncertainty by evaluating its behaviour when it is only exposed to the beginning of the conversation, finding that model accuracy improves and uncertainty reduces as models are exposed to more information.

Incremental Beam Manipulation for Natural Language Generation
James Hargreaves | Andreas Vlachos | Guy Emerson
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The performance of natural language generation systems has improved substantially with modern neural networks. At test time they typically employ beam search to avoid locally optimal but globally suboptimal predictions. However, due to model errors, a larger beam size can lead to deteriorating performance according to the evaluation metric. For this reason, it is common to rerank the output of beam search, but this relies on beam search to produce a good set of hypotheses, which limits the potential gains. Other alternatives to beam search require changes to the training of the model, which restricts their applicability compared to beam search. This paper proposes incremental beam manipulation, i.e. reranking the hypotheses in the beam during decoding instead of only at the end. This way, hypotheses that are unlikely to lead to a good final output are discarded, and in their place hypotheses that would have been ignored will be considered instead. Applying incremental beam manipulation leads to an improvement of 1.93 and 5.82 BLEU points over vanilla beam search for the test sets of the E2E and WebNLG challenges respectively. The proposed method also outperformed a strong reranker by 1.04 BLEU points on the E2E challenge, while being on par with it on the WebNLG dataset.

Cross-Policy Compliance Detection via Question Answering
Marzieh Saeidi | Majid Yazdani | Andreas Vlachos
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Policy compliance detection is the task of ensuring that a scenario conforms to a policy (e.g. a claim is valid according to government rules or a post in an online platform conforms to community guidelines). This task has been previously instantiated as a form of textual entailment, which results in poor accuracy due to the complexity of the policies. In this paper we propose to address policy compliance detection via decomposing it into question answering, where questions check whether the conditions stated in the policy apply to the scenario, and an expression tree combines the answers to obtain the label. Despite the initial upfront annotation cost, we demonstrate that this approach results in better accuracy, especially in the cross-policy setup where the policies during testing are unseen in training. In addition, it allows us to use existing question answering models pre-trained on existing large datasets. Finally, it explicitly identifies the information missing from a scenario in case policy compliance cannot be determined. We conduct our experiments using a recent dataset consisting of government policies, which we augment with expert annotations and find that the cost of annotating question answering decomposition is largely offset by improved inter-annotator agreement and speed.

Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER)
Rami Aly | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER)

The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task
Rami Aly | Zhijiang Guo | Michael Sejr Schlichtkrull | James Thorne | Andreas Vlachos | Christos Christodoulopoulos | Oana Cocarascu | Arpit Mittal
Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER)

The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) shared task, asks participating systems to determine whether human-authored claims are Supported or Refuted based on evidence retrieved from Wikipedia (or NotEnoughInfo if the claim cannot be verified). Compared to the FEVER 2018 shared task, the main challenge is the addition of structured data (tables and lists) as a source of evidence. The claims in the FEVEROUS dataset can be verified using only structured evidence, only unstructured evidence, or a mixture of both. Submissions are evaluated using the FEVEROUS score that combines label accuracy and evidence retrieval. Unlike FEVER 2018, FEVEROUS requires partial evidence to be returned for NotEnoughInfo claims, and the claims are longer and thus more complex. The shared task received 13 entries, six of which were able to beat the baseline system. The winning team was “Bust a move!”, achieving a FEVEROUS score of 27% (+9% compared to the baseline). In this paper we describe the shared task, present the full results and highlight commonalities and innovations among the participating systems.

Survival text regression for time-to-event prediction in conversations
Christine De Kock | Andreas Vlachos
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021)
Zornitsa Kozareva | Sujith Ravi | Andreas Vlachos | Priyanka Agrawal | André Martins
Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021)

2020

Generating Fact Checking Briefs
Angela Fan | Aleksandra Piktus | Fabio Petroni | Guillaume Wenzek | Marzieh Saeidi | Andreas Vlachos | Antoine Bordes | Sebastian Riedel
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Fact checking at scale is difficult—while the number of active fact checking websites is growing, it remains too small for the needs of the contemporary media ecosystem. However, despite good intentions, contributions from volunteers are often error-prone, and thus in practice restricted to claim detection. We investigate how to increase the accuracy and efficiency of fact checking by providing information about the claim before performing the check, in the form of natural language briefs. We investigate passage-based briefs, containing a relevant passage from Wikipedia, entity-centric ones consisting of Wikipedia pages of mentioned entities, and Question-Answering Briefs, with questions decomposing the claim, and their answers. To produce QABriefs, we develop QABriefer, a model that generates a set of questions conditioned on the claim, searches the web for evidence, and generates answers. To train its components, we introduce QABriefDataset We show that fact checking with briefs — in particular QABriefs — increases the accuracy of crowdworkers by 10% while slightly decreasing the time taken. For volunteer (unpaid) fact checkers, QABriefs slightly increase accuracy and reduce the time required by around 20%.

Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)
Christos Christodoulopoulos | James Thorne | Andreas Vlachos | Oana Cocarascu | Arpit Mittal
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)

Proceedings of the Fourth Workshop on Structured Prediction for NLP
Priyanka Agrawal | Zornitsa Kozareva | Julia Kreutzer | Gerasimos Lampouras | André Martins | Sujith Ravi | Andreas Vlachos
Proceedings of the Fourth Workshop on Structured Prediction for NLP

2019

Neural Generative Rhetorical Structure Parsing
Amandla Mabona | Laura Rimell | Stephen Clark | Andreas Vlachos
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Rhetorical structure trees have been shown to be useful for several document-level tasks including summarization and document classification. Previous approaches to RST parsing have used discriminative models; however, these are less sample efficient than generative models, and RST parsing datasets are typically small. In this paper, we present the first generative model for RST parsing. Our model is a document-level RNN grammar (RNNG) with a bottom-up traversal order. We show that, for our parser’s traversal order, previous beam search algorithms for RNNGs have a left-branching bias which is ill-suited for RST parsing. We develop a novel beam search algorithm that keeps track of both structure-and word-generating actions without exhibit-ing this branching bias and results in absolute improvements of 6.8 and 2.9 on unlabelled and labelled F1 over previous algorithms. Overall, our generative model outperforms a discriminative model with the same features by 2.6 F1points and achieves performance comparable to the state-of-the-art, outperforming all published parsers from a recent replication study that do not use additional training data

Evaluating adversarial attacks against multiple fact verification systems
James Thorne | Andreas Vlachos | Christos Christodoulopoulos | Arpit Mittal
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Automated fact verification has been progressing owing to advancements in modeling and availability of large datasets. Due to the nature of the task, it is critical to understand the vulnerabilities of these systems against adversarial instances designed to make them predict incorrectly. We introduce two novel scoring metrics, attack potency and system resilience which take into account the correctness of the adversarial instances, an aspect often ignored in adversarial evaluations. We consider six fact verification systems from the recent Fact Extraction and VERification (FEVER) challenge: the four best-scoring ones and two baselines. We evaluate adversarial instances generated by a recently proposed state-of-the-art method, a paraphrasing method, and rule-based attacks devised for fact verification. We find that our rule-based attacks have higher potency, and that while the rankings among the top systems changed, they exhibited higher resilience than the baselines.

Incorporating Label Dependencies in Multilabel Stance Detection
William Ferreira | Andreas Vlachos
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Stance detection in social media is a well-studied task in a variety of domains. Nevertheless, previous work has mostly focused on multiclass versions of the problem, where the labels are mutually exclusive, and typically positive, negative or neutral. In this paper, we address versions of the task in which an utterance can have multiple labels, thus corresponding to multilabel classification. We propose a method that explicitly incorporates label dependencies in the training objective and compare it against a variety of baselines, as well as a reduction of multilabel to multiclass learning. In experiments with three datasets, we find that our proposed method improves upon all baselines on two out of three datasets. We also show that the reduction of multilabel to multiclass classification can be very competitive, especially in cases where the output consists of a small number of labels and one can enumerate over all label combinations.

Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)
James Thorne | Andreas Vlachos | Oana Cocarascu | Christos Christodoulopoulos | Arpit Mittal
Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)

The FEVER2.0 Shared Task
James Thorne | Andreas Vlachos | Oana Cocarascu | Christos Christodoulopoulos | Arpit Mittal
Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)

We present the results of the second Fact Extraction and VERification (FEVER2.0) Shared Task. The task challenged participants to both build systems to verify factoid claims using evidence retrieved from Wikipedia and to generate adversarial attacks against other participant’s systems. The shared task had three phases: building, breaking and fixing. There were 8 systems in the builder’s round, three of which were new qualifying submissions for this shared task, and 5 adversaries generated instances designed to induce classification errors and one builder submitted a fixed system which had higher FEVER score and resilience than their first submission. All but one newly submitted systems attained FEVER scores higher than the best performing system from the first shared task and under adversarial evaluation, all systems exhibited losses in FEVER score. There was a great variety in adversarial attack types as well as the techniques used to generate the attacks, In this paper, we present the results of the shared task and a summary of the systems, highlighting commonalities and innovations among participating systems.

Generating Token-Level Explanations for Natural Language Inference
James Thorne | Andreas Vlachos | Christos Christodoulopoulos | Arpit Mittal
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The task of Natural Language Inference (NLI) is widely modeled as supervised sentence pair classification. While there has been a lot of work recently on generating explanations of the predictions of classifiers on a single piece of text, there have been no attempts to generate explanations of classifiers operating on pairs of sentences. In this paper, we show that it is possible to generate token-level explanations for NLI without the need for training data explicitly annotated for this purpose. We use a simple LSTM architecture and evaluate both LIME and Anchor explanations for this task. We compare these to a Multiple Instance Learning (MIL) method that uses thresholded attention make token-level predictions. The approach we present in this paper is a novel extension of zero-shot single-sentence tagging to sentence pairs for NLI. We conduct our experiments on the well-studied SNLI dataset that was recently augmented with manually annotation of the tokens that explain the entailment relation. We find that our white-box MIL-based method, while orders of magnitude faster, does not reach the same accuracy as the black-box methods.

Strong Baselines for Complex Word Identification across Multiple Languages
Pierre Finnimore | Elisabeth Fritzsch | Daniel King | Alison Sneyd | Aneeq Ur Rehman | Fernando Alva-Manchego | Andreas Vlachos
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Complex Word Identification (CWI) is the task of identifying which words or phrases in a sentence are difficult to understand by a target audience. The latest CWI Shared Task released data for two settings: monolingual (i.e. train and test in the same language) and cross-lingual (i.e. test in a language not seen during training). The best monolingual models relied on language-dependent features, which do not generalise in the cross-lingual setting, while the best cross-lingual model used neural networks with multi-task learning. In this paper, we present monolingual and cross-lingual CWI models that perform as well as (or better than) most models submitted to the latest CWI Shared Task. We show that carefully selected features and simple learning models can achieve state-of-the-art performance, and result in strong baselines for future development in this area. Finally, we discuss how inconsistencies in the annotation of the data can explain some of the results obtained.

HighRES: Highlight-based Reference-less Evaluation of Summarization
Hardy | Shashi Narayan | Andreas Vlachos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

There has been substantial progress in summarization research enabled by the availability of novel, often large-scale, datasets and recent advances on neural network-based approaches. However, manual evaluation of the system generated summaries is inconsistent due to the difficulty the task poses to human non-expert readers. To address this issue, we propose a novel approach for manual evaluation, Highlight-based Reference-less Evaluation of Summarization (HighRES), in which summaries are assessed by multiple annotators against the source document via manually highlighted salient content in the latter. Thus summary assessment on the source document by human judges is facilitated, while the highlights can be used for evaluating multiple systems. To validate our approach we employ crowd-workers to augment with highlights a recently proposed dataset and compare two state-of-the-art systems. We demonstrate that HighRES improves inter-annotator agreement in comparison to using the source document directly, while they help emphasize differences among systems that would be ignored under other evaluation approaches.

Merge and Label: A Novel Neural Network Architecture for Nested NER
Joseph Fisher | Andreas Vlachos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Named entity recognition (NER) is one of the best studied tasks in natural language processing. However, most approaches are not capable of handling nested structures which are common in many applications. In this paper we introduce a novel neural network architecture that first merges tokens and/or entities into entities forming nested structures, and then labels each of them independently. Unlike previous work, our merge and label approach predicts real-valued instead of discrete segmentation structures, which allow it to combine word and nested entity embeddings while maintaining differentiability. We evaluate our approach using the ACE 2005 Corpus, where it achieves state-of-the-art F1 of 74.6, further improved with contextual embeddings (BERT) to 82.4, an overall improvement of close to 8 F1 points over previous approaches trained on the same data. Additionally we compare it against BiLSTM-CRFs, the dominant approach for flat NER structures, demonstrating that its ability to predict nested structures does not impact performance in simpler cases.

Model-Agnostic Meta-Learning for Relation Classification with Limited Supervision
Abiola Obamuyide | Andreas Vlachos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper we frame the task of supervised relation classification as an instance of meta-learning. We propose a model-agnostic meta-learning protocol for training relation classifiers to achieve enhanced predictive performance in limited supervision settings. During training, we aim to not only learn good parameters for classifying relations with sufficient supervision, but also learn model parameters that can be fine-tuned to enhance predictive performance for relations with limited supervision. In experiments conducted on two relation classification datasets, we demonstrate that the proposed meta-learning approach improves the predictive performance of two state-of-the-art supervised relation classification models.

Proceedings of the Third Workshop on Structured Prediction for NLP
Andre Martins | Andreas Vlachos | Zornitsa Kozareva | Sujith Ravi | Gerasimos Lampouras | Vlad Niculae | Julia Kreutzer
Proceedings of the Third Workshop on Structured Prediction for NLP

Meta-Learning Improves Lifelong Relation Extraction
Abiola Obamuyide | Andreas Vlachos
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Most existing relation extraction models assume a fixed set of relations and are unable to adapt to exploit newly available supervision data to extract new relations. In order to alleviate such problems, there is the need to develop approaches that make relation extraction models capable of continuous adaptation and learning. We investigate and present results for such an approach, based on a combination of ideas from lifelong learning and optimization-based meta-learning. We evaluate the proposed approach on two recent lifelong relation extraction benchmarks, and demonstrate that it markedly outperforms current state-of-the-art approaches.

2018

Topic or Style? Exploring the Most Useful Features for Authorship Attribution
Yunita Sari | Mark Stevenson | Andreas Vlachos
Proceedings of the 27th International Conference on Computational Linguistics

Approaches to authorship attribution, the task of identifying the author of a document, are based on analysis of individuals’ writing style and/or preferred topics. Although the problem has been widely explored, no previous studies have analysed the relationship between dataset characteristics and effectiveness of different types of features. This study carries out an analysis of four widely used datasets to explore how different types of features affect authorship attribution accuracy under varying conditions. The results of the analysis are applied to authorship attribution models based on both discrete and continuous representations. We apply the conclusions from our analysis to an extension of an existing approach to authorship attribution and outperform the prior state-of-the-art on two out of the four datasets used.

Automated Fact Checking: Task Formulations, Methods and Future Directions
James Thorne | Andreas Vlachos
Proceedings of the 27th International Conference on Computational Linguistics

The recently increased focus on misinformation has stimulated research in fact checking, the task of assessing the truthfulness of a claim. Research in automating this task has been conducted in a variety of disciplines including natural language processing, machine learning, knowledge representation, databases, and journalism. While there has been substantial progress, relevant papers and articles have been published in research communities that are often unaware of each other and use inconsistent terminology, thus impeding understanding and further progress. In this paper we survey automated fact checking research stemming from natural language processing and related disciplines, unifying the task formulations and methodologies across papers and authors. Furthermore, we highlight the use of evidence as an important distinguishing factor among them cutting across task formulations and methods. We conclude with proposing avenues for future NLP research on automated fact checking.

Guided Neural Language Generation for Abstractive Summarization using Abstract Meaning Representation
Hardy | Andreas Vlachos
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent work on abstractive summarization has made progress with neural encoder-decoder architectures. However, such models are often challenged due to their lack of explicit semantic modeling of the source document and its summary. In this paper, we extend previous work on abstractive summarization using Abstract Meaning Representation (AMR) with a neural language generation stage which we guide using the source document. We demonstrate that this guidance improves summarization results by 7.4 and 10.5 points in ROUGE-2 using gold standard AMR parses and parses obtained from an off-the-shelf parser respectively. We also find that the summarization performance on later parses is 2 ROUGE-2 points higher than that of a well-established neural encoder-decoder approach trained on a larger dataset.

FEVER: a Large-scale Dataset for Fact Extraction and VERification
James Thorne | Andreas Vlachos | Christos Christodoulopoulos | Arpit Mittal
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss kappa. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.

Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)
James Thorne | Andreas Vlachos | Oana Cocarascu | Christos Christodoulopoulos | Arpit Mittal
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

The Fact Extraction and VERification (FEVER) Shared Task
James Thorne | Andreas Vlachos | Oana Cocarascu | Christos Christodoulopoulos | Arpit Mittal
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

We present the results of the first Fact Extraction and VERification (FEVER) Shared Task. The task challenged participants to classify whether human-written factoid claims could be SUPPORTED or REFUTED using evidence retrieved from Wikipedia. We received entries from 23 competing teams, 19 of which scored higher than the previously published baseline. The best performing system achieved a FEVER score of 64.21%. In this paper, we present the results of the shared task and a summary of the systems, highlighting commonalities and innovations among participating systems.

Zero-shot Relation Classification as Textual Entailment
Abiola Obamuyide | Andreas Vlachos
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

We consider the task of relation classification, and pose this task as one of textual entailment. We show that this formulation leads to several advantages, including the ability to (i) perform zero-shot relation classification by exploiting relation descriptions, (ii) utilize existing textual entailment models, and (iii) leverage readily available textual entailment datasets, to enhance the performance of relation classification systems. Our experiments show that the proposed approach achieves 20.16% and 61.32% in F1 zero-shot classification performance on two datasets, which further improved to 22.80% and 64.78% respectively with the use of conditional encoding.

2017

Continuous N-gram Representations for Authorship Attribution
Yunita Sari | Andreas Vlachos | Mark Stevenson
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper presents work on using continuous representations for authorship attribution. In contrast to previous work, which uses discrete feature representations, our model learns continuous representations for n-gram features via a neural network jointly with the classification layer. Experimental results demonstrate that the proposed model outperforms the state-of-the-art on two datasets, while producing comparable results on the remaining two.

An Extensible Framework for Verification of Numerical Claims
James Thorne | Andreas Vlachos
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

In this paper we present our automated fact checking system demonstration which we developed in order to participate in the Fast and Furious Fact Check challenge. We focused on simple numerical claims such as “population of Germany in 2015 was 80 million” which comprised a quarter of the test instances in the challenge, achieving 68% accuracy. Our system extends previous work on semantic parsing and claim identification to handle temporal expressions and knowledge bases consisting of multiple tables, while relying solely on automatically generated training data. We demonstrate the extensible nature of our system by evaluating it on relations used in previous work. We make our system publicly available so that it can be used and extended by the community.

The SUMMA Platform Prototype
Renars Liepins | Ulrich Germann | Guntis Barzdins | Alexandra Birch | Steve Renals | Susanne Weber | Peggy van der Kreeft | Hervé Bourlard | João Prieto | Ondřej Klejch | Peter Bell | Alexandros Lazaridis | Alfonso Mendes | Sebastian Riedel | Mariana S. C. Almeida | Pedro Balage | Shay B. Cohen | Tomasz Dwojak | Philip N. Garner | Andreas Giefer | Marcin Junczys-Dowmunt | Hina Imran | David Nogueira | Ahmed Ali | Sebastião Miranda | Andrei Popescu-Belis | Lesly Miculicich Werlen | Nikos Papasarantopoulos | Abiola Obamuyide | Clive Jones | Fahim Dalvi | Andreas Vlachos | Yang Wang | Sibo Tong | Rico Sennrich | Nikolaos Pappas | Shashi Narayan | Marco Damonte | Nadir Durrani | Sameer Khurana | Ahmed Abdelali | Hassan Sajjad | Stephan Vogel | David Sheppey | Chris Hernon | Jeff Mitchell
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.

Imitation learning for structured prediction in natural language processing
Andreas Vlachos | Gerasimos Lampouras | Sebastian Riedel
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Imitation learning is a learning paradigm originally developed to learn robotic controllers from demonstrations by humans, e.g. autonomous flight from pilot demonstrations. Recently, algorithms for structured prediction were proposed under this paradigm and have been applied successfully to a number of tasks including syntactic dependency parsing, information extraction, coreference resolution, dynamic feature selection, semantic parsing and natural language generation. Key advantages are the ability to handle large output search spaces and to learn with non-decomposable loss functions. Our aim in this tutorial is to have a unified presentation of the various imitation algorithms for structure prediction, and show how they can be applied to a variety of NLP tasks.All material associated with the tutorial will be made available through https://sheffieldnlp.github.io/ImitationLearningTutorialEACL2017/.

Sheffield at SemEval-2017 Task 9: Transition-based language generation from AMR.
Gerasimos Lampouras | Andreas Vlachos
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes the submission by the University of Sheffield to the SemEval 2017 Abstract Meaning Representation Parsing and Generation task (SemEval 2017 Task 9, Subtask 2). We cast language generation from AMR as a sequence of actions (e.g., insert/remove/rename edges and nodes) that progressively transform the AMR graph into a dependency parse tree. This transition-based approach relies on the fact that an AMR graph can be considered structurally similar to a dependency tree, with a focus on content rather than function words. An added benefit to this approach is the greater amount of data we can take advantage of to train the parse-to-text linearizer. Our submitted run on the test data achieved a BLEU score of 3.32 and a Trueskill score of -22.04 on automatic and human evaluation respectively.

Fake news stance detection using stacked ensemble of classifiers
James Thorne | Mingjie Chen | Giorgos Myrianthous | Jiashu Pu | Xiaoxuan Wang | Andreas Vlachos
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

Fake news has become a hotly debated topic in journalism. In this paper, we present our entry to the 2017 Fake News Challenge which models the detection of fake news as a stance classification task that finished in 11th place on the leader board. Our entry is an ensemble system of classifiers developed by students in the context of their coursework. We show how we used the stacking ensemble method for this purpose and obtained improvements in classification accuracy exceeding each of the individual models’ performance on the development data. Finally, we discuss aspects of the experimental setup of the challenge.

2016

Imitation learning for language generation from unaligned data
Gerasimos Lampouras | Andreas Vlachos
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Natural language generation (NLG) is the task of generating natural language from a meaning representation. Current rule-based approaches require domain-specific and manually constructed linguistic resources, while most machine-learning based approaches rely on aligned training data and/or phrase templates. The latter are needed to restrict the search space for the structured prediction task defined by the unaligned datasets. In this work we propose the use of imitation learning for structured prediction which learns an incremental model that handles the large search space by avoiding explicit enumeration of the outputs. We focus on the Locally Optimal Learning to Search framework which allows us to train against non-decomposable loss functions such as the BLEU or ROUGE scores while not assuming gold standard alignments. We evaluate our approach on three datasets using both automatic measures and human judgements and achieve results comparable to the state-of-the-art approaches developed for each of them.

Stance Detection with Bidirectional Conditional Encoding
Isabelle Augenstein | Tim Rocktäschel | Andreas Vlachos | Kalina Bontcheva
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Timeline extraction using distant supervision and joint inference
Savelie Cornegruta | Andreas Vlachos
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Emergent: a novel data-set for stance classification
William Ferreira | Andreas Vlachos
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing
James Goodman | Andreas Vlachos | Jason Naradowsky
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

USFD at SemEval-2016 Task 6: Any-Target Stance Detection on Twitter with Autoencoders
Isabelle Augenstein | Andreas Vlachos | Kalina Bontcheva
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

UCL+Sheffield at SemEval-2016 Task 8: Imitation learning for AMR parsing with an alpha-bound
James Goodman | Andreas Vlachos | Jason Naradowsky
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

SHEF-MIME: Word-level Quality Estimation Using Imitation Learning
Daniel Beck | Andreas Vlachos | Gustavo Paetzold | Lucia Specia
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

Extracting Relations between Non-Standard Entities using Distant Supervision and Imitation Learning
Isabelle Augenstein | Andreas Vlachos | Diana Maynard
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

A Strong Lexical Matching Method for the Machine Comprehension Test
Ellery Smith | Nicola Greco | Matko Bošnjak | Andreas Vlachos
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Identification and Verification of Simple Claims about Statistical Properties
Andreas Vlachos | Sebastian Riedel
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Dependency Recurrent Neural Language Models for Sentence Completion
Piotr Mirowski | Andreas Vlachos
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Matrix and Tensor Factorization Methods for Natural Language Processing
Guillaume Bouchard | Jason Naradowsky | Sebastian Riedel | Tim Rocktäschel | Andreas Vlachos
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts

Using word embedding for bio-event extraction
Chen Li | Runqing Song | Maria Liakata | Andreas Vlachos | Stephanie Seneff | Xiangrong Zhang
Proceedings of BioNLP 15

2014

A New Corpus and Imitation Learning Framework for Context-Dependent Semantic Parsing
Andreas Vlachos | Stephen Clark
Transactions of the Association for Computational Linguistics, Volume 2

Semantic parsing is the task of translating natural language utterances into a machine-interpretable meaning representation. Most approaches to this task have been evaluated on a small number of existing corpora which assume that all utterances must be interpreted according to a database and typically ignore context. In this paper we present a new, publicly available corpus for context-dependent semantic parsing. The MRL used for the annotation was designed to support a portable, interactive tourist information system. We develop a semantic parser for this corpus by adapting the imitation learning algorithm DAgger without requiring alignment information during training. DAgger improves upon independently trained classifiers by 9.0 and 4.8 points in F-score on the development and test sets respectively.

Proceedings of the EACL 2014 Workshop on Dialogue in Motion
Tiphaine Dalmas | Jana Götze | Joakim Gustafson | Srinivasan Janarthanam | Jan Kleindienst | Christian Mueller | Amanda Stent | Andreas Vlachos
Proceedings of the EACL 2014 Workshop on Dialogue in Motion

Fact Checking: Task definition and dataset construction
Andreas Vlachos | Sebastian Riedel
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science

Application-Driven Relation Extraction with Limited Distant Supervision
Andreas Vlachos | Stephen Clark
Proceedings of the First AHA!-Workshop on Information Discovery in Text

2013

Dependency Language Models for Sentence Completion
Joseph Gubbins | Andreas Vlachos
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Semantic Parsing as Machine Translation
Jacob Andreas | Andreas Vlachos | Stephen Clark
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

Search-based Structured Prediction applied to Biomedical Event Extraction
Andreas Vlachos | Mark Craven
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

Biomedical Event Extraction from Abstracts and Full Papers using Search-based Structured Prediction
Andreas Vlachos | Mark Craven
Proceedings of BioNLP Shared Task 2011 Workshop

Evaluating unsupervised learning for natural language processing tasks
Andreas Vlachos
Proceedings of the First workshop on Unsupervised Learning in NLP

2010

Two Strong Baselines for the BioNLP 2009 Event Extraction Task
Andreas Vlachos
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

Active Learning for Constrained Dirichlet Process Mixture Models
Andreas Vlachos | Zoubin Ghahramani | Ted Briscoe
Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics

Detecting Speculative Language Using Syntactic Dependencies and Logistic Regression
Andreas Vlachos | Mark Craven
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

2009

The infinite HMM for unsupervised PoS tagging
Jurgen Van Gael | Andreas Vlachos | Zoubin Ghahramani
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering
Andreas Vlachos | Anna Korhonen | Zoubin Ghahramani
Proceedings of the Workshop on Geometrical Models of Natural Language Semantics

Biomedical Event Extraction without Training Data
Andreas Vlachos | Paula Buttery | Diarmuid Ó Séaghdha | Ted Briscoe
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task

2007

Evaluating and combining and biomedical named entity recognition systems
Andreas Vlachos
Biological, translational, and clinical language processing

2006

Active Annotation
Andreas Vlachos
Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006)

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain
Andreas Vlachos | Caroline Gasperin
Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology

Co-authors

Oana Cocarascu 13

Zhijiang Guo 10

Sebastian Riedel 7

Mubashara Akhtar 6

Gerasimos Lampouras 6

Isabelle Augenstein 5

Stephen Clark 4

Michalis Korakakis 4

André F. T. Martins 4

Abiola Obamuyide 4

Marzieh Saeidi 4

Chenxi Whitehouse 4

Christine de Kock 4

Priyanka Agrawal 3

Zoubin Ghahramani 3

Zornitsa Kozareva 3

Jason Naradowsky 3

Nedjma Ousidhoum 3

Kalina Bontcheva 2

William Ferreira 2

James Goodman 2

Thomas Hofmann 2

Georgi Karadzhov 2

Julia Kreutzer 2

Clara Meister 2

Shashi Narayan 2

Tiago Pimentel 2

Tim Rocktäschel 2

Ieva Staliūnaitė 2

Mark Stevenson 2

Adrian Weller 2

Ahmed Abdelali 1

Mariana S. C. Almeida 1

Fernando Alva-Manchego 1

Jacopo Amidei 1

Jacob Andreas 1

Pedro Balage Filho 1

Guntis Barzdins 1

Alexandra Birch 1

Alodie Boissonnet 1

Antoine Bordes 1

Matko Bosnjak 1

Guillaume Bouchard 1

Hervé Bourlard 1

Charlotte Brand 1

Paula Buttery 1

Shay B. Cohen 1

Nigel Collier 1

Savelie Cornegruta 1

Tiphaine Dalmas 1

Marco Damonte 1

Andrew Dudfield 1

Nadir Durrani 1

Tomasz Dwojak 1

Pierre Finnimore 1

Joseph Fisher 1

Elisabeth Fritzsch 1

Philip N. Garner 1

Caroline Gasperin 1

Ulrich Germann 1

Andreas Giefer 1

Joseph Gubbins 1

Iryna Gurevych 1

Joakim Gustafson 1

James Hargreaves 1

Srinivasan Janarthanam 1

Marcin Junczys-Dowmunt 1

Sameer Khurana 1

Jan Kleindienst 1

Ondřej Klejch 1

Anna Korhonen 1

Amrith Krishna 1

Alexandros Lazaridis 1

Maria Liakata 1

Renārs Liepins 1

Chunchuan Lyu 1

Amandla Mabona 1

Diana Maynard 1

Ryan McDonald 1

Alfonso Mendes 1

Lesly Miculicich Werlen 1

Sebastião Miranda 1

Piotr Mirowski 1

Jeff Mitchell 1

Christian Mueller 1

Giorgos Myrianthous 1

David Nogueira 1

Gustavo Paetzold 1

Nikos Papasarantopoulos 1

Nikolaos Pappas 1

Fabio Petroni 1

Aleksandra Piktus 1

Vassilis Plachouras 1

Andrei Popescu-Belis 1

Valentina Pyatkin 1

Hassan Sajjad 1

Joshua Salisbury 1

Stephanie Seneff 1

Rico Sennrich 1

David Sheppey 1

Elena Simperl 1

Ieva Raminta Staliūnaitė 1

Svetlana Stoyanchev 1

Matthieu Tehenan 1

Aneeq Ur Rehman 1

Gisela Vallejo 1

Jurgen Van Gael 1

Stephan Vogel 1

Xiaoxuan Wang 1

Susanne Weber 1

Guillaume Wenzek 1

Majid Yazdani 1

Zhangdie Yuan 1

Xiangrong Zhang 1

Vilém Zouhar 1

Peggy van der Kreeft 1

Diarmuid Ó Séaghdha 1

Venues

NLPerspectives1