Helen Yannakoudakis - ACL Anthology

Helen Yannakoudakis

2026

KidsArtBench: Multi-Dimensional Children’s Art Evaluation with Attribute-Aware MLLMs
Mingrui Ye | Chanjin Zheng | Zengyi Yu | Chenyu Xiang | Zhixue Zhao | Zheng Yuan | Helen Yannakoudakis
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal Large Language Models (MLLMs) show progress across many visual–language tasks; however, their capacity to evaluate artistic expression remains limited: aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children’s artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children’s artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach – where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric – with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. Our results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.

2025

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch | Constantin Niko Weisser | Severin Field | Helen Yannakoudakis | Stephen Casper
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction. Interventions derived from non-linear probes produce larger and more reliable effects than those from linear probes, indicating that features linked to jailbreak success are encoded non-linearly in prompt representations. Overall, the results surface heterogeneous, non-linear structure in jailbreak mechanisms and provide a prompt-side methodology for recovering and testing the features that drive jailbreak outcomes.

Few-Shot Open-Set Classification via Reasoning-Aware Decomposition
Avyav Kumar Singh | Helen Yannakoudakis
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) excel at few-shot learning, but their ability to reject out-of-distribution examples remains under-explored. We study this challenge under the setting of few-shot open-set classification, where a model must not only classify examples from a small set of seen classes but also reject unseen ones at inference time. This setting is more realistic and challenging than traditional closed-set supervised learning, requiring both fine-grained classification and robust rejection. We show that, for small LLMs, neither chain-of-thought (CoT) prompting nor supervised fine-tuning (SFT) alone are sufficient to generalise reliably, particularly when class semantics are anonymised. We introduce Wasserstein GFN (W-GFN), a novel amortised Generative Flow Network framework that uses latent trajectories to approximate the Bayesian posterior. With as few as 4 examples per class, W-GFN substantially improves performance, enabling Llama 3.2 3B to achieve up to ≥80% of the performance of Llama 3.3 70B in complex datasets, despite being ∼ 23 times smaller, which highlights the importance of reasoning-aware approaches for robust open-set few-shot learning.

A Survey of Cognitive Distortion Detection and Classification in NLP
Archie Sage | Jeroen Keppens | Helen Yannakoudakis
Findings of the Association for Computational Linguistics: EMNLP 2025

As interest grows in applying natural language processing (NLP) techniques to mental health, an expanding body of work explores the automatic detection and classification of cognitive distortions (CDs). CDs are habitual patterns of negatively biased or flawed thinking that distort how people perceive events, judge themselves, and react to the world. Identifying and addressing them is a central goal of therapy. Despite this momentum, the field remains fragmented, with inconsistencies in CD taxonomies, task formulations, and evaluation practices limiting comparability across studies. This survey presents the first comprehensive review of 38 studies spanning two decades, mapping how CDs have been implemented in computational research and evaluating the methods applied. We provide a consolidated CD taxonomy reference, summarise common task setups, and highlight persistent challenges to support more coherent and reproducible research. Alongside our review, we introduce practical resources, including curated evaluation metrics from surveyed papers, a standardised datasheet template, and an ethics flowchart, available online.

2024

Logging Keystrokes in Writing by English Learners
Georgios Velentzas | Andrew Caines | Rita Borgo | Erin Pacquetet | Clive Hamilton | Taylor Arnold | Diane Nicholls | Paula Buttery | Thomas Gaillat | Nicolas Ballier | Helen Yannakoudakis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Essay writing is a skill commonly taught and practised in schools. The ability to write a fluent and persuasive essay is often a major component of formal assessment. In natural language processing and education technology we may work with essays in their final form, for example to carry out automated assessment or grammatical error correction. In this work we collect and analyse data representing the essay writing process from start to finish, by recording every key stroke from multiple writers participating in our study. We describe our data collection methodology, the characteristics of the resulting dataset, and the assignment of proficiency levels to the texts. We discuss the ways the keystroke data can be used – for instance seeking to identify patterns in the keystrokes which might act as features in automated assessment or may enable further advancements in writing assistance – and the writing support technology which could be built with such information, if we can detect when writers are struggling to compose a section of their essay and offer appropriate intervention. We frame this work in the context of English language learning, but we note that keystroke logging is relevant more broadly to text authoring scenarios as well as cognitive or linguistic analyses of the writing process.

Prompting open-source and commercial language models for grammatical error correction of English learner text
Christopher Davis | Andrew Caines | Øistein E. Andersen | Shiva Taslimipoor | Helen Yannakoudakis | Zheng Yuan | Christopher Bryant | Marek Rei | Paula Buttery
Findings of the Association for Computational Linguistics: ACL 2024

Thanks to recent advances in generative AI, we are able to prompt large language models (LLMs) to produce texts which are fluent and grammatical. In addition, it has been shown that we can elicit attempts at grammatical error correction (GEC) from LLMs when prompted with ungrammatical input sentences. We evaluate how well LLMs can perform at GEC by measuring their performance on established benchmark datasets. We go beyond previous studies, which only examined GPT* models on a selection of English GEC datasets, by evaluating seven open-source and three commercial LLMs on four established GEC benchmarks. We investigate model performance and report results against individual error types. Our results indicate that LLMs do not always outperform supervised English GEC models except in specific contexts – namely commercial LLMs on benchmarks annotated with fluency corrections as opposed to minimal edits. We find that several open-source models outperform commercial ones on minimal edit benchmarks, and that in some settings zero-shot prompting is just as competitive as few-shot prompting.

A (More) Realistic Evaluation Setup for Generalisation of Community Models on Malicious Content Detection
Ivo Verhoeven | Pushkar Mishra | Rahel Beloch | Helen Yannakoudakis | Ekaterina Shutova
Findings of the Association for Computational Linguistics: NAACL 2024

Community models for malicious content detection, which take into account the context from a social graph alongside the content itself, have shown remarkable performance on benchmark datasets. Yet, misinformation and hate speech continue to propagate on social media networks. This mismatch can be partially attributed to the limitations of current evaluation setups that neglect the rapid evolution of online content and the underlying social graph. In this paper, we propose a novel evaluation setup for model generalisation based on our few-shot subgraph sampling approach. This setup tests for generalisation through few labelled examples in local explorations of a larger graph, emulating more realistic application settings. We show this to be a challenging inductive setup, wherein strong performance on the training graph is not indicative of performance on unseen tasks, domains, or graph structures. Lastly, we show that graph meta-learners trained with our proposed few-shot subgraph sampling outperform standard community models in the inductive setup.

Learning New Tasks from a Few Examples with Soft-Label Prototypes
Avyav Singh | Ekaterina Shutova | Helen Yannakoudakis
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)

Existing approaches to few-shot learning in NLP rely on large language models (LLMs) and/or fine-tuning of these to generalise on out-of-distribution data. In this work, we propose a novel few-shot learning approach based on soft-label prototypes (SLPs) designed to collectively capture the distribution of different classes across the input domain space. We focus on learning previously unseen NLP tasks from very few examples (4, 8, 16) per class and experimentally demonstrate that our approach achieves superior performance on the majority of tested tasks in this data-lean setting while being highly parameter efficient. We also show that our few-shot adaptation method can be integrated into more generalised learning settings, primarily meta-learning, to yield superior performance against strong baselines.

2023

CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension
Zhi Zhang | Helen Yannakoudakis | Xiantong Zhen | Ekaterina Shutova
Findings of the Association for Computational Linguistics: EACL 2023

The task of multimodal referring expression comprehension (REC), aiming at localizing an image region described by a natural language expression, has recently received increasing attention within the research comminity. In this paper, we specifically focus on referring expression comprehension with commonsense knowledge (KB-Ref), a task which typically requires reasoning beyond spatial, visual or semantic information. We propose a novel framework for Commonsense Knowledge Enhanced Transformers (CK-Transformer) which effectively integrates commonsense knowledge into the representations of objects in an image, facilitating identification of the target objects referred to by the expressions. We conduct extensive experiments on several benchmarks for the task of KB-Ref. Our results show that the proposed CK-Transformer achieves a new state of the art, with an absolute improvement of 3.14% accuracy over the existing state of the art.

K-hop neighbourhood regularization for few-shot learning on graphs: A case study of text classification
Niels van der Heijden | Ekaterina Shutova | Helen Yannakoudakis
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We present FewShotTextGCN, a novel method designed to effectively utilize the properties of word-document graphs for improved learning in low-resource settings. We introduce K-hop Neighbourhood Regularization, a regularizer for heterogeneous graphs, and show that it stabilizes and improves learning when only a few training samples are available. We furthermore propose a simplification in the graph-construction method, which results in a graph that is ∼7 times less dense and yields better performance in little-resource settings while performing on par with the state of the art in high-resource settings. Finally, we introduce a new variant of Adaptive Pseudo-Labeling tailored for word-document graphs. When using as little as 20 samples for training, we outperform a strong TextGCN baseline with 17% in absolute accuracy on average over eight languages. We demonstrate that our method can be applied to document classification without any language model pretraining on a wide range of typologically diverse languages while performing on par with large pretrained language models.

2022

Meta-Learning for Fast Cross-Lingual Adaptation in Dependency Parsing
Anna Langedijk | Verna Dankers | Phillip Lippe | Sander Bos | Bryan Cardenas Guevara | Helen Yannakoudakis | Ekaterina Shutova
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Meta-learning, or learning to learn, is a technique that can help to overcome resource scarcity in cross-lingual NLP problems, by enabling fast adaptation to new tasks. We apply model-agnostic meta-learning (MAML) to the task of cross-lingual dependency parsing. We train our model on a diverse set of languages to learn a parameter initialization that can adapt quickly to new languages. We find that meta-learning with pre-training can significantly improve upon the performance of language transfer and standard supervised learning baselines for a variety of unseen, typologically diverse, and low-resource languages, in a few-shot learning setup.

The Teacher-Student Chatroom Corpus version 2: more lessons, new annotation, automatic detection of sequence shifts
Andrew Caines | Helen Yannakoudakis | Helen Allen | Pascual Pérez-Paredes | Bill Byrne | Paula Buttery
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

A unified framework for cross-domain and cross-task learning of mental health conditions
Huikai Chua | Andrew Caines | Helen Yannakoudakis
Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)

The detection of mental health conditions based on an individual’s use of language has received considerable attention in the NLP community. However, most work has focused on single-task and single-domain models, limiting the semantic space that they are able to cover and risking significant cross-domain loss. In this paper, we present two approaches towards a unified framework for cross-domain and cross-task learning for the detection of depression, post-traumatic stress disorder and suicide risk across different platforms that further utilizes inductive biases across tasks. Firstly, we develop a lightweight model using a general set of features that sets a new state of the art on several tasks while matching the performance of more complex task- and domain-specific systems on others. We also propose a multi-task approach and further extend our framework to explicitly capture the affective characteristics of someone’s language, further consolidating transfer of inductive biases and of shared linguistic characteristics. Finally, we present a novel dynamically adaptive loss weighting approach that allows for more stable learning across imbalanced datasets and better neural generalization performance. Our results demonstrate the effectiveness of our unified framework for mental ill-health detection across a number of diverse English datasets.

Scientific and Creative Analogies in Pretrained Language Models
Tamara Czinczoll | Helen Yannakoudakis | Pushkar Mishra | Ekaterina Shutova
Findings of the Association for Computational Linguistics: EMNLP 2022

This paper examines the encoding of analogy in large-scale pretrained language models, such as BERT and GPT-2. Existing analogy datasets typically focus on a limited set of analogical relations, with a high similarity of the two domains between which the analogy holds. As a more realistic setup, we introduce the Scientific and Creative Analogy dataset (SCAN), a novel analogy dataset containing systematic mappings of multiple attributes and relational structures across dissimilar domains. Using this dataset, we test the analogical reasoning capabilities of several widely-used pretrained language models (LMs). We find that state-of-the-art LMs achieve low performance on these complex analogy tasks, highlighting the challenges still posed by analogy understanding.

Authorship Verification for Arabic Short Texts Using Arabic Knowledge-Base Model (AraKB)
Fatimah Alqahtani | Helen Yannakoudakis
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Arabic is a Semitic language, considered to be one of the most complex languages in the world due to its unique composition and complex linguistic features. It consequently causes challenges for verifying the authorship of Arabic texts, requiring extensive research and development. This paper presents a new knowledge-based model to enhance Natural Language Understanding and thereby improve authorship verification performance. The proposed model provided promising results that would benefit the Arabic research for different Natural Language Processing tasks.

2021

Modeling Users and Online Communities for Abuse Detection: A Position on Ethics and Explainability
Pushkar Mishra | Helen Yannakoudakis | Ekaterina Shutova
Findings of the Association for Computational Linguistics: EMNLP 2021

Abuse on the Internet is an important societal problem of our time. Millions of Internet users face harassment, racism, personal attacks, and other types of abuse across various platforms. The psychological effects of abuse on individuals can be profound and lasting. Consequently, over the past few years, there has been a substantial research effort towards automated abusive language detection in the field of NLP. In this position paper, we discuss the role that modeling of users and online communities plays in abuse detection. Specifically, we review and analyze the state of the art methods that leverage user or community information to enhance the understanding and detection of abusive language. We then explore the ethical challenges of incorporating user and community information, laying out considerations to guide future research. Finally, we address the topic of explainability in abusive language detection, proposing properties that an explainable method should aim to exhibit. We describe how user and community information can facilitate the realization of these properties and discuss the effective operationalization of explainability in view of the properties.

Zero-shot Sequence Labeling for Transformer-based Sentence Classifiers
Kamil Bujel | Helen Yannakoudakis | Marek Rei
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

We investigate how sentence-level transformers can be modified into effective sequence labelers at the token level without any direct supervision. Existing approaches to zero-shot sequence labeling do not perform well when applied on transformer-based architectures. As transformers contain multiple layers of multi-head self-attention, information in the sentence gets distributed between many tokens, negatively affecting zero-shot token-level performance. We find that a soft attention module which explicitly encourages sharpness of attention weights can significantly outperform existing methods.

Multilingual and cross-lingual document classification: A meta-learning approach
Niels van der Heijden | Helen Yannakoudakis | Pushkar Mishra | Ekaterina Shutova
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The great majority of languages in the world are considered under-resourced for successful application of deep learning methods. In this work, we propose a meta-learning approach to document classification in low-resource languages and demonstrate its effectiveness in two different settings: few-shot, cross-lingual adaptation to previously unseen languages; and multilingual joint-training when limited target-language data is available during trai-ing. We conduct a systematic comparison of several meta-learning methods, investigate multiple settings in terms of data availability, and show that meta-learning thrives in settings with a heterogeneous task distribution. We propose a simple, yet effective adjustment to existing meta-learning methods which allows for better and more stable learning, and set a new state-of-the-art on a number of languages while performing on-par on others, using only a small amount of labeled data.

Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Andrea Horbach | Ekaterina Kochmar | Ronja Laarmann-Quante | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

Ruddit: Norms of Offensiveness for English Reddit Comments
Rishav Hada | Sohi Sudhir | Pushkar Mishra | Helen Yannakoudakis | Saif M. Mohammad | Ekaterina Shutova
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

On social media platforms, hateful and offensive language negatively impact the mental well-being of users and the participation of people from diverse backgrounds. Automatic methods to detect offensive language have largely relied on datasets with categorical labels. However, comments can vary in their degree of offensiveness. We create the first dataset of English language Reddit comments that has fine-grained, real-valued scores between -1 (maximally supportive) and 1 (maximally offensive). The dataset was annotated using Best–Worst Scaling, a form of comparative annotation that has been shown to alleviate known biases of using rating scales. We show that the method produces highly reliable offensiveness scores. Finally, we evaluate the ability of widely-used neural models to predict offensiveness scores on this new dataset.

2020

Joint Modelling of Emotion and Abusive Language Detection
Santhosh Rajamanickam | Pushkar Mishra | Helen Yannakoudakis | Ekaterina Shutova
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The rise of online communication platforms has been accompanied by some undesirable effects, such as the proliferation of aggressive and abusive behaviour online. Aiming to tackle this problem, the natural language processing (NLP) community has experimented with a range of techniques for abuse detection. While achieving substantial success, these methods have so far only focused on modelling the linguistic properties of the comments and the online communities of users, disregarding the emotional state of the users and how this might affect their language. The latter is, however, inextricably linked to abusive behaviour. In this paper, we present the first joint model of emotion and abusive language detection, experimenting in a multi-task learning framework that allows one task to inform the other. Our results demonstrate that incorporating affective features leads to significant improvements in abuse detection performance across datasets.

Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses
Simon Flachs | Ophélie Lacroix | Helen Yannakoudakis | Marek Rei | Anders Søgaard
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.

Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

The Teacher-Student Chatroom Corpus
Andrew Caines | Helen Yannakoudakis | Helena Edmondson | Helen Allen | Pascual Pérez-Paredes | Bill Byrne | Paula Buttery
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning

Learning to Learn to Disambiguate: Meta-Learning for Few-Shot Word Sense Disambiguation
Nithin Holla | Pushkar Mishra | Helen Yannakoudakis | Ekaterina Shutova
Findings of the Association for Computational Linguistics: EMNLP 2020

The success of deep learning methods hinges on the availability of large training datasets annotated for the task of interest. In contrast to human intelligence, these methods lack versatility and struggle to learn and adapt quickly to new tasks, where labeled data is scarce. Meta-learning aims to solve this problem by training a model on a large number of few-shot tasks, with an objective to learn new tasks quickly from a small number of examples. In this paper, we propose a meta-learning framework for few-shot word sense disambiguation (WSD), where the goal is to learn to disambiguate unseen words from only a few labeled instances. Meta-learning approaches have so far been typically tested in an N-way, K-shot classification setting where each task has N classes with K examples per class. Owing to its nature, WSD deviates from this controlled setup and requires the models to handle a large number of highly unbalanced classes. We extend several popular meta-learning approaches to this scenario, and analyze their strengths and weaknesses in this new challenging setting.

Investigating the effect of auxiliary objectives for the automated grading of learner English speech transcriptions
Hannah Craighead | Andrew Caines | Paula Buttery | Helen Yannakoudakis
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We address the task of automatically grading the language proficiency of spontaneous speech based on textual features from automatic speech recognition transcripts. Motivated by recent advances in multi-task learning, we develop neural networks trained in a multi-task fashion that learn to predict the proficiency level of non-native English speakers by taking advantage of inductive transfer between the main task (grading) and auxiliary prediction tasks: morpho-syntactic labeling, language modeling, and native language identification (L1). We encode the transcriptions with both bi-directional recurrent neural networks and with bi-directional representations from transformers, compare against a feature-rich baseline, and analyse performance at different proficiency levels and with transcriptions of varying error rates. Our best performance comes from a transformer encoder with L1 prediction as an auxiliary task. We discuss areas for improvement and potential applications for text-only speech scoring.

Analyzing Neural Discourse Coherence Models
Youmna Farag | Josef Valvoda | Helen Yannakoudakis | Ted Briscoe
Proceedings of the First Workshop on Computational Approaches to Discourse

In this work, we systematically investigate how well current models of coherence can capture aspects of text implicated in discourse organisation. We devise two datasets of various linguistic alterations that undermine coherence and test model sensitivity to changes in syntax and semantics. We furthermore probe discourse embedding space and examine the knowledge that is encoded in representations of coherence. We hope this study shall provide further insight into how to frame the task and improve models of coherence assessment further. Finally, we make our datasets publicly available as a resource for researchers to use to test discourse coherence models.

2019

Multi-Task Learning for Coherence Modeling
Youmna Farag | Helen Yannakoudakis
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We address the task of assessing discourse coherence, an aspect of text quality that is essential for many NLP tasks, such as summarization and language assessment. We propose a hierarchical neural network trained in a multi-task fashion that learns to predict a document-level coherence score (at the network’s top layers) along with word-level grammatical roles (at the bottom layers), taking advantage of inductive transfer between the two tasks. We assess the extent to which our framework generalizes to different domains and prediction tasks, and demonstrate its effectiveness not only on standard binary evaluation coherence tasks, but also on real-world tasks involving the prediction of varying degrees of coherence, achieving a new state of the art.

Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Helen Yannakoudakis | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Torsten Zesch
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Context is Key: Grammatical Error Detection with Contextual Word Representations
Samuel Bell | Helen Yannakoudakis | Marek Rei
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Grammatical error detection (GED) in non-native writing requires systems to identify a wide range of errors in text written by language learners. Error detection as a purely supervised task can be challenging, as GED datasets are limited in size and the label distributions are highly imbalanced. Contextualized word representations offer a possible solution, as they can efficiently capture compositional information in language and can be optimized on large amounts of unsupervised data. In this paper, we perform a systematic comparison of ELMo, BERT and Flair embeddings (Peters et al., 2017; Devlin et al., 2018; Akbik et al., 2018) on a range of public GED datasets, and propose an approach to effectively integrate such representations in current methods, achieving a new state of the art on GED. We further analyze the strengths and weaknesses of different contextual embeddings for the task at hand, and present detailed analyses of their impact on different types of errors.

Learning Outside the Box: Discourse-level Features Improve Metaphor Identification
Jesse Mu | Helen Yannakoudakis | Ekaterina Shutova
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Most current approaches to metaphor identification use restricted linguistic contexts, e.g. by considering only a verb’s arguments or the sentence containing a phrase. Inspired by pragmatic accounts of metaphor, we argue that broader discourse features are crucial for better metaphor identification. We train simple gradient boosting classifiers on representations of an utterance and its surrounding discourse learned with a variety of document embedding methods, obtaining near state-of-the-art results on the 2018 VU Amsterdam metaphor identification task without the complex metaphor-specific features or deep neural architectures employed by other systems. A qualitative analysis further confirms the need for broader context in metaphor processing.

CAMsterdam at SemEval-2019 Task 6: Neural and graph-based feature extraction for the identification of offensive tweets
Guy Aglionby | Chris Davis | Pushkar Mishra | Andrew Caines | Helen Yannakoudakis | Marek Rei | Ekaterina Shutova | Paula Buttery
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe the CAMsterdam team entry to the SemEval-2019 Shared Task 6 on offensive language identification in Twitter data. Our proposed model learns to extract textual features using a multi-layer recurrent network, and then performs text classification using gradient-boosted decision trees (GBDT). A self-attention architecture enables the model to focus on the most relevant areas in the text. In order to enrich input representations, we use node2vec to learn globally optimised embeddings for hashtags, which are then given as additional features to the GBDT classifier. Our best model obtains 78.79% macro F1-score on detecting offensive language (subtask A), 66.32% on categorising offence types (targeted/untargeted; subtask B), and 55.36% on identifying the target of offence (subtask C).

Abusive Language Detection with Graph Convolutional Networks
Pushkar Mishra | Marco Del Tredici | Helen Yannakoudakis | Ekaterina Shutova
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Abuse on the Internet represents a significant societal problem of our time. Previous research on automated abusive language detection in Twitter has shown that community-based profiling of users is a promising technique for this task. However, existing approaches only capture shallow properties of online communities by modeling follower–following relationships. In contrast, working with graph convolutional networks (GCNs), we present the first approach that captures not only the structure of online communities but also the linguistic behavior of the users within them. We show that such a heterogeneous graph-structured modeling of communities significantly advances the current state of the art in abusive language detection.

A Simple and Robust Approach to Detecting Subject-Verb Agreement Errors
Simon Flachs | Ophélie Lacroix | Marek Rei | Helen Yannakoudakis | Anders Søgaard
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

While rule-based detection of subject-verb agreement (SVA) errors is sensitive to syntactic parsing errors and irregularities and exceptions to the main rules, neural sequential labelers have a tendency to overfit their training data. We observe that rule-based error generation is less sensitive to syntactic parsing errors and irregularities than error detection and explore a simple, yet efficient approach to getting the best of both worlds: We train neural sequential labelers on the combination of large volumes of silver standard data, obtained through rule-based error generation, and gold standard data. We show that our simple protocol leads to more robust detection of SVA errors on both in-domain and out-of-domain data, as well as in the context of other errors and long-distance dependencies; and across four standard benchmarks, the induced model on average achieves a new state of the art.

Neural and FST-based approaches to grammatical error correction
Zheng Yuan | Felix Stahlberg | Marek Rei | Bill Byrne | Helen Yannakoudakis
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we describe our submission to the BEA 2019 shared task on grammatical error correction. We present a system pipeline that utilises both error detection and correction models. The input text is first corrected by two complementary neural machine translation systems: one using convolutional networks and multi-task learning, and another using a neural Transformer-based system. Training is performed on publicly available data, along with artificial examples generated through back-translation. The n-best lists of these two machine translation systems are then combined and scored using a finite state transducer (FST). Finally, an unsupervised re-ranking system is applied to the n-best output of the FST. The re-ranker uses a number of error detection features to re-rank the FST n-best list and identify the final 1-best correction hypothesis. Our system achieves 66.75% F 0.5 on error correction (ranking 4th), and 82.52% F 0.5 on token-level error detection (ranking 2nd) in the restricted track of the shared task.

2018

Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Helen Yannakoudakis
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

Neural Character-based Composition Models for Abuse Detection
Pushkar Mishra | Helen Yannakoudakis | Ekaterina Shutova
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

The advent of social media in recent years has fed into some highly undesirable phenomena such as proliferation of offensive language, hate speech, sexist remarks, etc. on the Internet. In light of this, there have been several efforts to automate the detection and moderation of such abusive content. However, deliberate obfuscation of words by users to evade detection poses a serious challenge to the effectiveness of these efforts. The current state of the art approaches to abusive language detection, based on recurrent neural networks, do not explicitly address this problem and resort to a generic OOV (out of vocabulary) embedding for unseen words. However, in using a single embedding for all unseen words we lose the ability to distinguish between obfuscated and non-obfuscated or rare words. In this paper, we address this problem by designing a model that can compose embeddings for unseen words. We experimentally demonstrate that our approach significantly advances the current state of the art in abuse detection on datasets from two different domains, namely Twitter and Wikipedia talk page.

Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input
Youmna Farag | Helen Yannakoudakis | Ted Briscoe
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences. We develop a neural model of local coherence that can effectively learn connectedness features between sentences, and propose a framework for integrating and jointly training the local coherence model with a state-of-the-art AES model. We evaluate our approach against a number of baselines and experimentally demonstrate its effectiveness on both the AES task and the task of flagging adversarial input, further contributing to the development of an approach that strengthens the validity of neural essay scoring models.

Author Profiling for Abuse Detection
Pushkar Mishra | Marco Del Tredici | Helen Yannakoudakis | Ekaterina Shutova
Proceedings of the 27th International Conference on Computational Linguistics

The rapid growth of social media in recent years has fed into some highly undesirable phenomena such as proliferation of hateful and offensive language on the Internet. Previous research suggests that such abusive content tends to come from users who share a set of common stereotypes and form communities around them. The current state-of-the-art approaches to abuse detection are oblivious to user and community information and rely entirely on textual (i.e., lexical and semantic) cues. In this paper, we propose a novel approach to this problem that incorporates community-based profiling features of Twitter users. Experimenting with a dataset of 16k tweets, we show that our methods significantly outperform the current state of the art in abuse detection. Further, we conduct a qualitative analysis of model characteristics. We release our code, pre-trained models and all the resources used in the public domain.

2017

Auxiliary Objectives for Neural Error Detection Models
Marek Rei | Helen Yannakoudakis
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We investigate the utility of different auxiliary objectives and training strategies within a neural sequence labeling approach to error detection in learner writing. Auxiliary costs provide the model with additional linguistic information, allowing it to learn general-purpose compositional features that can then be exploited for other objectives. Our experiments show that a joint learning approach trained with parallel labels on in-domain data improves performance over the previous best error detection system. While the resulting model has the same number of parameters, the additional objectives allow it to be optimised more efficiently and achieve better performance.

Neural Sequence-Labelling Models for Grammatical Error Correction
Helen Yannakoudakis | Marek Rei | Øistein E. Andersen | Zheng Yuan
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We propose an approach to N-best list reranking using neural sequence-labelling models. We train a compositional model for error detection that calculates the probability of each token in a sentence being correct or incorrect, utilising the full sentence as context. Using the error detection model, we then re-rank the N best hypotheses generated by statistical machine translation systems. Our approach achieves state-of-the-art results on error correction for three different datasets, and it has the additional advantage of only using a small set of easily computed features that require no linguistic input.

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock | Helen Yannakoudakis
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Semantic Frames and Visual Scenes: Learning Semantic Role Inventories from Image and Video Descriptions
Ekaterina Shutova | Andreas Wundsam | Helen Yannakoudakis
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Frame-semantic parsing and semantic role labelling, that aim to automatically assign semantic roles to arguments of verbs in a sentence, have become an active strand of research in NLP. However, to date these methods have relied on a predefined inventory of semantic roles. In this paper, we present a method to automatically learn argument role inventories for verbs from large corpora of text, images and videos. We evaluate the method against manually constructed role inventories in FrameNet and show that the visual model outperforms the language-only model and operates with a high precision.

2016

Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock | Helen Yannakoudakis
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

Unsupervised Modeling of Topical Relevance in L2 Learner Text
Ronan Cummins | Helen Yannakoudakis | Ted Briscoe
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

Compositional Sequence Labeling Models for Error Detection in Learner Writing
Marek Rei | Helen Yannakoudakis
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic Text Scoring Using Neural Networks
Dimitrios Alikaniotis | Helen Yannakoudakis | Marek Rei
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Evaluating the performance of Automated Text Scoring systems
Helen Yannakoudakis | Ronan Cummins
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

2014

Grammatical error correction using hybrid systems and type filtering
Mariano Felice | Zheng Yuan | Øistein E. Andersen | Helen Yannakoudakis | Ekaterina Kochmar
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

2013

Developing and testing a self-assessment and tutoring system
Øistein E. Andersen | Helen Yannakoudakis | Fiona Barker | Tim Parish
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

2012

Automating Second Language Acquisition Research: Integrating Information Visualisation and Machine Learning
Helen Yannakoudakis | Ted Briscoe | Theodora Alexopoulou
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

Modeling coherence in ESOL learner texts
Helen Yannakoudakis | Ted Briscoe
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

2011

A New Dataset and Method for Automatically Grading ESOL Texts
Helen Yannakoudakis | Ted Briscoe | Ben Medlock
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Co-authors

Paula Buttery 6

Claudia Leacock 6

Jill Burstein 5

Ekaterina Kochmar 5

Øistein E. Andersen 4

Nitin Madnani 3

Ildikó Pilán 3

Joel Tetreault 3

Torsten Zesch 3

Ronan Cummins 2

Ophélie Lacroix 2

Pascual Pérez-Paredes 2

Anders Søgaard 2

Marco Del Tredici 2

Niels van der Heijden 2

Theodora Alexopoulou 1

Dimitrios Alikaniotis 1

Fatimah Alqahtani 1

Taylor Arnold 1

Nicolas Ballier 1

Christopher Bryant 1

Bryan Cardenas Guevara 1

Stephen Casper 1

Hannah Craighead 1

Tamara Czinczoll 1

Verna Dankers 1

Christopher Davis 1

Chris Irwin Davis 1

Helena Edmondson 1

Mariano Felice 1

Severin Field 1

Thomas Gaillat 1

Clive Hamilton 1

Andrea Horbach 1

Jeroen Keppens 1

Nathalie Maria Kirch 1

Ronja Laarmann-Quante 1

Anna Langedijk 1

Phillip Lippe 1

Saif Mohammad 1

Diane Nicholls 1

Erin Pacquetet 1

Santhosh Rajamanickam 1

Avyav Kumar Singh 1

Felix Stahlberg 1

Shiva Taslimipoor 1

Josef Valvoda 1

Georgios Velentzas 1

Ivo Verhoeven 1

Constantin Niko Weisser 1

Andreas Wundsam 1

Xiantong Zhen 1

Chanjin Zheng 1

Venues