Andrew Caines - ACL Anthology

Andrew Caines

2026

PictureStories: Predicting the Task Adherence of Language Learner Answers to a Picture Story-Based Writing Task
Marie Bexte | Andrew Caines | Diane Nicholls | Paula Buttery | Torsten Zesch
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate the automated evaluation of English language learner answers to writing tasks featuring picture stories.This is usually limited to language proficiency only, neglecting the context of the picture. Instead, our analysis focuses on task adherence, which for example allows detection of off-topic answers.Since there is a lack of suitable training and evaluation data, our first step is to build the PictureStories dataset.To this end, we develop a marking rubric that covers task adherence with respect to both form and content. Six annotators mark 713 learner answers written in response to one of five picture stories.Having assembled the dataset, we then explore to what extent task adherence can be predicted automatically. Our experiments assume a scenario where no or just a few labelled answers are available for the picture story which is being marked.For form-focused criteria, we find that it is beneficial to finetune models across tasks.With content-focused criteria, few-shot prompting Qwen emerges as the best-performing method. We examine the trade-off between including the story image vs. example answers in the prompt and find that examples suffice in many cases. While for some LLMs, few-shot prompting results may look promising on the surface, we demonstrate that a much simpler method can do just as well when shown the same examples.

2025

Web(er) of Hate: A Survey on How Hate Speech Is Typed
Luna Wang | Andrew Caines | Alice Hutchings
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber’s notion of ideal types, we argue for a reflexive approach in dataset creation, urging researchers to acknowledge their own value judgments during dataset construction, fostering transparency and methodological rigour.

LLM-based post-editing as reference-free GEC evaluation
Robert Östling | Murathan Kurfali | Andrew Caines
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Evaluation of Grammatical Error Correction (GEC) systems is becoming increasingly challenging as the quality of such systems increases and traditional automatic metrics fail to adequately capture such nuances as fluency versus minimal edits, alternative valid corrections compared to the ‘ground truth’, and the difference between corrections that are useful in a language learning scenario versus those preferred by native readers. Previous work has suggested using human post-editing of GEC system outputs, but this is very labor-intensive. We investigate the use of Large Language Models (LLMs) as post-editors of English and Swedish texts, and perform a meta-analysis of a range of different evaluation setups using a set of recent GEC systems. We find that for the two languages studied in our work, automatic evaluation based on post-editing agrees well with both human post-editing and direct human rating of GEC systems. Furthermore, we find that a simple n-gram overlap metric is sufficient to measure post-editing distance, and that including human references when prompting the LLMs generally does not improve agreement with human ratings. The resulting evaluation metric is reference-free and requires no language-specific training or additional resources beyond an LLM capable of handling the given language.Evaluation of Grammatical Error Correction (GEC) systems is becoming increasingly challenging as the quality of such systems increases and traditional automatic metrics fail to adequately capture such nuances as fluency versus minimal edits, alternative valid corrections compared to the ‘ground truth’, and the difference between corrections that are useful in a language learning scenario versus those preferred by native readers. Previous work has suggested using human post-editing of GEC system outputs, but this is very labor-intensive. We investigate the use of Large Language Models (LLMs) as post-editors of English and Swedish texts, and perform a meta-analysis of a range of different evaluation setups using a set of recent GEC systems. We find that for the two languages studied in our work, automatic evaluation based on post-editing agrees well with both human post-editing and direct human rating of GEC systems. Furthermore, we find that a simple n-gram overlap metric is sufficient to measure post-editing distance, and that including human references when prompting the LLMs generally does not improve agreement with human ratings. The resulting evaluation metric is reference-free and requires no language-specific training or additional resources beyond an LLM capable of handling the given language.

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating the ability of neural machine translation (NMT) models and large language models (LLMs) to translate between English and these languages, at both the sentence and pseudo-document levels, the outputs being realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieves the best average performance among the standard NMT models, while GPT-4o outperforms general-purpose LLMs. Fine-tuning selected models leads to substantial performance gains, but models trained on sentences struggle to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, over-generation, repetition of words and phrases, and off-target translations, specifically for translation into African languages.

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Bianca-Mihaela Ganescu | Suchir Salhan | Andrew Caines | Paula Buttery
Proceedings of the First BabyLM Workshop

Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.

Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction
Suchir Salhan | Hongyi Gu | Donya Rooein | Diana Galvan-Sosa | Gabrielle Gaudeau | Andrew Caines | Zheng Yuan | Paula Buttery
Proceedings of the First BabyLM Workshop

Multi-turn dialogues between a child and caregiver are characterized by a property called contingency – prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a Teacher–Student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive Teacher decoding strategies show limited additional gains. ContingentChat highlights the positive benefits of targeted post-training on dialogue quality and presents contingency as a challenging goal for BabyLMs.

BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models
Yuan Gao | Suchir Salhan | Andrew Caines | Paula Buttery | Weiwei Sun
Proceedings of the First BabyLM Workshop

Cross-lingual extensions of the BabyLM Shared Task beyond English incentivise the development of Small Language Models that simulate a much wider range of language acquisition scenarios, including code-switching, simultaneous and successive bilingualism and second language acquisition. However, to our knowledge, there is no benchmark of the formal competence of cognitively-inspired models of L2 acquisition, or L2LMs. To address this, we introduce a Benchmark of Learner Interlingual Syntactic Structure (BLiSS). BLiSS consists of 1.5M naturalistic minimal pairs dataset derived from errorful sentence–correction pairs in parallel learner corpora. These are systematic patterns –overlooked by standard benchmarks of the formal competence of Language Models – which we use to evaluate L2LMs trained in a variety of training regimes on specific properties of L2 learner language to provide a linguistically-motivated framework for controlled measure of the interlanguage competence of L2LMs.

The MultiGEC-2025 Shared Task on Multilingual Grammatical Error Correction at NLP4CALL
Arianna Masciolini | Andrew Caines | Orphée De Clercq | Joni Kruijsbergen | Murathan Kurfalı | Ricardo Muñoz Sánchez | Elena Volodina | Robert Östling
Proceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning

2024

Logging Keystrokes in Writing by English Learners
Georgios Velentzas | Andrew Caines | Rita Borgo | Erin Pacquetet | Clive Hamilton | Taylor Arnold | Diane Nicholls | Paula Buttery | Thomas Gaillat | Nicolas Ballier | Helen Yannakoudakis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Essay writing is a skill commonly taught and practised in schools. The ability to write a fluent and persuasive essay is often a major component of formal assessment. In natural language processing and education technology we may work with essays in their final form, for example to carry out automated assessment or grammatical error correction. In this work we collect and analyse data representing the essay writing process from start to finish, by recording every key stroke from multiple writers participating in our study. We describe our data collection methodology, the characteristics of the resulting dataset, and the assignment of proficiency levels to the texts. We discuss the ways the keystroke data can be used – for instance seeking to identify patterns in the keystrokes which might act as features in automated assessment or may enable further advancements in writing assistance – and the writing support technology which could be built with such information, if we can detect when writers are struggling to compose a section of their essay and offer appropriate intervention. We frame this work in the context of English language learning, but we note that keystroke logging is relevant more broadly to text authoring scenarios as well as cognitive or linguistic analyses of the writing process.

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
Zébulon Goriely | Richard Diehl Martinez | Andrew Caines | Paula Buttery | Lisa Beinborn
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing
Richard Diehl Martinez | Zébulon Goriely | Andrew Caines | Paula Buttery | Lisa Beinborn
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity.Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias.

Prompting open-source and commercial language models for grammatical error correction of English learner text
Christopher Davis | Andrew Caines | Øistein E. Andersen | Shiva Taslimipoor | Helen Yannakoudakis | Zheng Yuan | Christopher Bryant | Marek Rei | Paula Buttery
Findings of the Association for Computational Linguistics: ACL 2024

Thanks to recent advances in generative AI, we are able to prompt large language models (LLMs) to produce texts which are fluent and grammatical. In addition, it has been shown that we can elicit attempts at grammatical error correction (GEC) from LLMs when prompted with ungrammatical input sentences. We evaluate how well LLMs can perform at GEC by measuring their performance on established benchmark datasets. We go beyond previous studies, which only examined GPT* models on a selection of English GEC datasets, by evaluating seven open-source and three commercial LLMs on four established GEC benchmarks. We investigate model performance and report results against individual error types. Our results indicate that LLMs do not always outperform supervised English GEC models except in specific contexts – namely commercial LLMs on benchmarks annotated with fluency corrections as opposed to minimal edits. We find that several open-source models outperform commercial ones on minimal edit benchmarks, and that in some settings zero-shot prompting is just as competitive as few-shot prompting.

Detecting Narrative Patterns in Biblical Hebrew and Greek
Hope McGovern | Hale Sirin | Tom Lippincott | Andrew Caines
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

We present a novel approach to extracting recurring narrative patterns, or type-scenes, in Biblical Hebrew and Biblical Greek with an information retrieval network. We use cross-references to train an encoder model to create similar representations for verses linked by a cross-reference. We then query our trained model with phrases informed by humanities scholarship and designed to elicit particular kinds of narrative scenes. Our models can surface relevant instances in the top-10 ranked candidates in many cases.Through manual error analysis and discussion, we address the limitations and challenges inherent in our approach. Our findings contribute to the field of Biblical scholarship by offering a new perspective on narrative analysis within ancient texts, and to computational modeling of narrative with a genre-agnostic approach for pattern-finding in long, literary texts.

LLM chatbots as a language practice tool: a user study
Gladys Tyen | Andrew Caines | Paula Buttery
Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning

Grammatical Error Correction for Code-Switched Sentences by Learners of English
Kelvin Wey Han Chan | Christopher Bryant | Li Nguyen | Andrew Caines | Zheng Yuan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Code-switching (CSW) is a common phenomenon among multilingual speakers where multiple languages are used in a single discourse or utterance. Mixed language utterances may still contain grammatical errors however, yet most existing Grammar Error Correction (GEC) systems have been trained on monolingual data and not developed with CSW in mind. In this work, we conduct the first exploration into the use of GEC systems on CSW text. Through this exploration, we propose a novel method of generating synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints, and identify how they affect the performance of GEC systems on CSW text. Our best model achieves an average increase of 1.57 F0.5 across 3 CSW test sets (English-Chinese, English-Korean and English-Japanese) without affecting the model’s performance on a monolingual dataset. We furthermore discovered that models trained on one CSW language generalise relatively well to other typologically similar CSW languages.

2023

MultiGED-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection
Elena Volodina | Christopher Bryant | Andrew Caines | Orphée De Clercq | Jennifer-Carmen Frey | Elizaveta Ershova | Alexandr Rosen | Olga Vinogradova
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

CLIMB – Curriculum Learning for Infant-inspired Model Building
Richard Diehl Martinez | Zébulon Goriely | Hope McGovern | Christopher Davis | Andrew Caines | Paula Buttery | Lisa Beinborn
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

2022

The Specificity and Helpfulness of Peer-to-Peer Feedback in Higher Education
Roman Rietsche | Andrew Caines | Cornelius Schramm | Dominik Pfütze | Paula Buttery
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

With the growth of online learning through MOOCs and other educational applications, it has become increasingly difficult for course providers to offer personalized feedback to students. Therefore asking students to provide feedback to each other has become one way to support learning. This peer-to-peer feedback has become increasingly important whether in MOOCs to provide feedback to thousands of students or in large-scale classes at universities. One of the challenges when allowing peer-to-peer feedback is that the feedback should be perceived as helpful, and an import factor determining helpfulness is how specific the feedback is. However, in classes including thousands of students, instructors do not have the resources to check the specificity of every piece of feedback between students. Therefore, we present an automatic classification model to measure sentence specificity in written feedback. The model was trained and tested on student feedback texts written in German where sentences have been labelled as general or specific. We find that we can automatically classify the sentences with an accuracy of 76.7% using a conventional feature-based approach, whereas transfer learning with BERT for German gives a classification accuracy of 81.1%. However, the feature-based approach comes with lower computational costs and preserves human interpretability of the coefficients. In addition we show that specificity of sentences in feedback texts has a weak positive correlation with perceptions of helpfulness. This indicates that specificity is one of the ingredients of good feedback, and invites further investigation.

Towards an open-domain chatbot for language practice
Gladys Tyen | Mark Brenchley | Andrew Caines | Paula Buttery
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

State-of-the-art chatbots for English are now able to hold conversations on virtually any topic (e.g. Adiwardana et al., 2020; Roller et al., 2021). However, existing dialogue systems in the language learning domain still use hand-crafted rules and pattern matching, and are much more limited in scope. In this paper, we make an initial foray into adapting open-domain dialogue generation for second language learning. We propose and implement decoding strategies that can adjust the difficulty level of the chatbot according to the learner’s needs, without requiring further training of the chatbot. These strategies are then evaluated using judgements from human examiners trained in language education. Our results show that re-ranking candidate outputs is a particularly effective strategy, and performance can be further improved by adding sub-token penalties and filtering.

Probing for targeted syntactic knowledge through grammatical error detection
Christopher Davis | Christopher Bryant | Andrew Caines | Marek Rei | Paula Buttery
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Targeted studies testing knowledge of subject-verb agreement (SVA) indicate that pre-trained language models encode syntactic information. We assert that if models robustly encode subject-verb agreement, they should be able to identify when agreement is correct and when it is incorrect. To that end, we propose grammatical error detection as a diagnostic probe to evaluate token-level contextual representations for their knowledge of SVA. We evaluate contextual representations at each layer from five pre-trained English language models: BERT, XLNet, GPT-2, RoBERTa and ELECTRA. We leverage public annotated training data from both English second language learners and Wikipedia edits, and report results on manually crafted stimuli for subject-verb agreement. We find that masked language models linearly encode information relevant to the detection of SVA errors, while the autoregressive models perform on par with our baseline. However, we also observe a divergence in performance when probes are trained on different training sets, and when they are evaluated on different syntactic constructions, suggesting the information pertaining to SVA error detection is not robustly encoded.

The Teacher-Student Chatroom Corpus version 2: more lessons, new annotation, automatic detection of sequence shifts
Andrew Caines | Helen Yannakoudakis | Helen Allen | Pascual Pérez-Paredes | Bill Byrne | Paula Buttery
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

A unified framework for cross-domain and cross-task learning of mental health conditions
Huikai Chua | Andrew Caines | Helen Yannakoudakis
Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)

The detection of mental health conditions based on an individual’s use of language has received considerable attention in the NLP community. However, most work has focused on single-task and single-domain models, limiting the semantic space that they are able to cover and risking significant cross-domain loss. In this paper, we present two approaches towards a unified framework for cross-domain and cross-task learning for the detection of depression, post-traumatic stress disorder and suicide risk across different platforms that further utilizes inductive biases across tasks. Firstly, we develop a lightweight model using a general set of features that sets a new state of the art on several tasks while matching the performance of more complex task- and domain-specific systems on others. We also propose a multi-task approach and further extend our framework to explicitly capture the affective characteristics of someone’s language, further consolidating transfer of inductive biases and of shared linguistic characteristics. Finally, we present a novel dynamically adaptive loss weighting approach that allows for more stable learning across imbalanced datasets and better neural generalization performance. Our results demonstrate the effectiveness of our unified framework for mental ill-health detection across a number of diverse English datasets.

ALEN App: Argumentative Writing Support To Foster English Language Learning
Thiemo Wambsganss | Andrew Caines | Paula Buttery
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

This paper introduces a novel tool to support and engage English language learners with feedback on the quality of their argument structures. We present an approach which automatically detects claim-premise structures and provides visual feedback to the learner to prompt them to repair any broken argumentation structures. To investigate, if our persuasive feedback on language learners’ essay writing tasks engages and supports them in learning better English language, we designed the ALEN app (Argumentation for Learning English). We leverage an argumentation mining model trained on texts written by students and embed it in a writing support tool which provides students with feedback in their essay writing process. We evaluated our tool in two field-studies with a total of 28 students from a German high school to investigate the effects of adaptive argumentation feedback on their learning of English. The quantitative results suggest that using the ALEN app leads to a high self-efficacy, ease-of-use, intention to use and perceived usefulness for students in their English language learning process. Moreover, the qualitative answers indicate the potential benefits of combining grammar feedback with discourse level argumentation mining.

2021

Efficient Unsupervised NMT for Related Languages with Cross-Lingual Language Models and Fidelity Objectives
Rami Aly | Andrew Caines | Paula Buttery
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

The most successful approach to Neural Machine Translation (NMT) when only monolingual training data is available, called unsupervised machine translation, is based on back-translation where noisy translations are generated to turn the task into a supervised one. However, back-translation is computationally very expensive and inefficient. This work explores a novel, efficient approach to unsupervised NMT. A transformer, initialized with cross-lingual language model weights, is fine-tuned exclusively on monolingual data of the target language by jointly learning on a paraphrasing and denoising autoencoder objective. Experiments are conducted on WMT datasets for German-English, French-English, and Romanian-English. Results are competitive to strong baseline unsupervised NMT models, especially for closely related source languages (German) compared to more distant ones (Romanian, French), while requiring about a magnitude less training time.

2020

Detecting Trending Terms in Cybersecurity Forum Discussions
Jack Hughes | Seth Aycock | Andrew Caines | Paula Buttery | Alice Hutchings
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

We present a lightweight method for identifying currently trending terms in relation to a known prior of terms, using a weighted log-odds ratio with an informative prior. We apply this method to a dataset of posts from an English-language underground hacking forum, spanning over ten years of activity, with posts containing misspellings, orthographic variation, acronyms, and slang. Our statistical approach supports analysis of linguistic change and discussion topics over time, without a requirement to train a topic model for each time interval for analysis. We evaluate the approach by comparing the results to TF-IDF using the discounted cumulative gain metric with human annotations, finding our method outperforms TF-IDF on information retrieval.

Investigating the effect of auxiliary objectives for the automated grading of learner English speech transcriptions
Hannah Craighead | Andrew Caines | Paula Buttery | Helen Yannakoudakis
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We address the task of automatically grading the language proficiency of spontaneous speech based on textual features from automatic speech recognition transcripts. Motivated by recent advances in multi-task learning, we develop neural networks trained in a multi-task fashion that learn to predict the proficiency level of non-native English speakers by taking advantage of inductive transfer between the main task (grading) and auxiliary prediction tasks: morpho-syntactic labeling, language modeling, and native language identification (L1). We encode the transcriptions with both bi-directional recurrent neural networks and with bi-directional representations from transformers, compare against a feature-rich baseline, and analyse performance at different proficiency levels and with transcriptions of varying error rates. Our best performance comes from a transformer encoder with L1 prediction as an auxiliary task. We discuss areas for improvement and potential applications for text-only speech scoring.

An Expectation Maximisation Algorithm for Automated Cognate Detection
Roddy MacSween | Andrew Caines
Proceedings of the 24th Conference on Computational Natural Language Learning

In historical linguistics, cognate detection is the task of determining whether sets of words have common etymological roots. Inspired by the comparative method used by human linguists, we develop a system for automated cognate detection that frames the task as an inference problem for a general statistical model consisting of observed data (potentially cognate pairs of words), latent variables (the cognacy status of pairs) and unknown global parameters (which sounds correspond between languages). We then give a specific instance of such a model along with an expectation-maximisation algorithm to infer its parameters. We evaluate our system on a corpus of 8140 cognate sets, finding the performance of our method to be comparable to the state of the art. We additionally carry out qualitative analysis demonstrating advantages it has over existing systems. We also suggest several ways our work could be extended within the general theoretical framework we propose.

The Teacher-Student Chatroom Corpus
Andrew Caines | Helen Yannakoudakis | Helena Edmondson | Helen Allen | Pascual Pérez-Paredes | Bill Byrne | Paula Buttery
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning

REPROLANG 2020: Automatic Proficiency Scoring of Czech, English, German, Italian, and Spanish Learner Essays
Andrew Caines | Paula Buttery
Proceedings of the Twelfth Language Resources and Evaluation Conference

We report on our attempts to reproduce the work described in Vajjala & Rama 2018, ‘Experiments with universal CEFR classification’, as part of REPROLANG 2020: this involves featured-based and neural approaches to essay scoring in Czech, German and Italian. Our results are broadly in line with those from the original paper, with some differences due to the stochastic nature of machine learning and programming language used. We correct an error in the reported metrics, introduce new baselines, apply the experiments to English and Spanish corpora, and generate adversarial data to test classifier robustness. We conclude that feature-based approaches perform better than neural network classifiers for text datasets of this size, though neural network modifications do bring performance closer to the best feature-based models.

Grammatical error detection in transcriptions of spoken English
Andrew Caines | Christian Bentz | Kate Knill | Marek Rei | Paula Buttery
Proceedings of the 28th International Conference on Computational Linguistics

We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics. The corpus recordings were crowdsourced from native speakers of English and learners of English with German as their first language. The new transcriptions and annotations are obtained from different crowdworkers: we analyse the 1108 new crowdworker submissions and propose that they can be used for automatic transcription post-editing and grammatical error correction for speech. To further explore the data we train grammatical error detection models with various configurations including pre-trained and contextual word representations as input, additional features and auxiliary objectives, and extra training data from written error-annotated corpora. We find that a model concatenating pre-trained and contextual word representations as input performs best, and that additional information does not lead to further performance gains.

2019

CAMsterdam at SemEval-2019 Task 6: Neural and graph-based feature extraction for the identification of offensive tweets
Guy Aglionby | Chris Davis | Pushkar Mishra | Andrew Caines | Helen Yannakoudakis | Marek Rei | Ekaterina Shutova | Paula Buttery
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe the CAMsterdam team entry to the SemEval-2019 Shared Task 6 on offensive language identification in Twitter data. Our proposed model learns to extract textual features using a multi-layer recurrent network, and then performs text classification using gradient-boosted decision trees (GBDT). A self-attention architecture enables the model to focus on the most relevant areas in the text. In order to enrich input representations, we use node2vec to learn globally optimised embeddings for hashtags, which are then given as additional features to the GBDT classifier. Our best model obtains 78.79% macro F1-score on detecting offensive language (subtask A), 66.32% on categorising offence types (targeted/untargeted; subtask B), and 55.36% on identifying the target of offence (subtask C).

2018

Aggressive language in an online hacking forum
Andrew Caines | Sergio Pastrana | Alice Hutchings | Paula Buttery
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

We probe the heterogeneity in levels of abusive language in different sections of the Internet, using an annotated corpus of Wikipedia page edit comments to train a binary classifier for abuse detection. Our test data come from the CrimeBB Corpus of hacking-related forum posts and we find that (a) forum interactions are rarely abusive, (b) the abusive language which does exist tends to be relatively mild compared to that found in the Wikipedia comments domain, and tends to involve aggressive posturing rather than hate speech or threats of violence. We observe that the purpose of conversations in online forums tend to be more constructive and informative than those in Wikipedia page edit comments which are geared more towards adversarial interactions, and that this may explain the lower levels of abuse found in our forum data than in Wikipedia comments. Further work remains to be done to compare these results with other inter-domain classification experiments, and to understand the impact of aggressive language in forum conversations.

2017

Parsing transcripts of speech
Andrew Caines | Michael McCarthy | Paula Buttery
Proceedings of the Workshop on Speech-Centric Natural Language Processing

We present an analysis of parser performance on speech data, comparing word type and token frequency distributions with written data, and evaluating parse accuracy by length of input string. We find that parser performance tends to deteriorate with increasing length of string, more so for spoken than for written texts. We train an alternative parsing model with added speech data and demonstrate improvements in accuracy on speech-units, with no deterioration in performance on written text.

Collecting fluency corrections for spoken learner English
Andrew Caines | Emma Flint | Paula Buttery
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We present crowdsourced collection of error annotations for transcriptions of spoken learner English. Our emphasis in data collection is on fluency corrections, a more complete correction than has traditionally been aimed for in grammatical error correction research (GEC). Fluency corrections require improvements to the text, taking discourse and utterance level semantics into account: the result is a more naturalistic, holistic version of the original. We propose that this shifted emphasis be reflected in a new name for the task: ‘holistic error correction’ (HEC). We analyse crowdworker behaviour in HEC and conclude that the method is useful with certain amendments for future work.

A Text Normalisation System for Non-Standard English Words
Emma Flint | Elliot Ford | Olivia Thomas | Andrew Caines | Paula Buttery
Proceedings of the 3rd Workshop on Noisy User-generated Text

This paper investigates the problem of text normalisation; specifically, the normalisation of non-standard words (NSWs) in English. Non-standard words can be defined as those word tokens which do not have a dictionary entry, and cannot be pronounced using the usual letter-to-phoneme conversion rules; e.g. lbs, 99.3%, #EMNLP2017. NSWs pose a challenge to the proper functioning of text-to-speech technology, and the solution is to spell them out in such a way that they can be pronounced appropriately. We describe our four-stage normalisation system made up of components for detection, classification, division and expansion of NSWs. Performance is favourabe compared to previous work in the field (Sproat et al. 2001, Normalization of non-standard words), as well as state-of-the-art text-to-speech software. Further, we update Sproat et al.’s NSW taxonomy, and create a more customisable system where users are able to input their own abbreviations and specify into which variety of English (currently available: British or American) they wish to normalise.

2016

Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora
Andrew Caines | Christian Bentz | Calbert Graham | Tim Polzehl | Paula Buttery
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

Automated speech-unit delimitation in spoken learner English
Russell Moore | Andrew Caines | Calbert Graham | Paula Buttery
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In order to apply computational linguistic analyses and pass information to downstream applications, transcriptions of speech obtained via automatic speech recognition (ASR) need to be divided into smaller meaningful units, in a task we refer to as ‘speech-unit (SU) delimitation’. We closely recreate the automatic delimitation system described by Lee and Glass (2012), ‘Sentence detection using multiple annotations’, Proceedings of INTERSPEECH, which combines a prosodic model, language model and speech-unit length model in log-linear fashion. Since state-of-the-art natural language processing (NLP) tools have been developed to deal with written text and its characteristic sentence-like units, SU delimitation helps bridge the gap between ASR and NLP, by normalising spoken data into a more canonical format. Previous work has focused on native speaker recordings; we test the system of Lee and Glass (2012) on non-native speaker (or ‘learner’) data, achieving performance above the state-of-the-art. We also consider alternative evaluation metrics which move away from the idea of a single ‘truth’ in SU delimitation, and frame this work in the context of downstream NLP applications.

Predicting Author Age from Weibo Microblog Posts
Wanru Zhang | Andrew Caines | Dimitrios Alikaniotis | Paula Buttery
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Binary file summaries/958.html matches

2014

The effect of disfluencies and learner errors on the parsing of spoken learner language
Andrew Caines | Paula Buttery
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

2012

Reclassifying subcategorization frames for experimental analysis and stimulus generation
Paula Buttery | Andrew Caines
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Researchers in the fields of psycholinguistics and neurolinguistics increasingly test their experimental hypotheses against probabilistic models of language. VALEX (Korhonen et al., 2006) is a large-scale verb lexicon that specifies verb usage as probability distributions over a set of 163 verb SUBCATEGORIZATION FRAMES (SCFs). VALEX has proved to be a popular computational linguistic resource and may also be used by psycho- and neurolinguists for experimental analysis and stimulus generation. However, a probabilistic model based upon a set of 163 SCFs often proves too fine grained for experimenters in these fields. Our goal is to simplify the classification by grouping the frames into genera―explainable clusters that may be used as experimental parameters. We adopted two methods for reclassification. One was a manual linguistic approach derived from verb argumentation and clause features; the other was an automatic, computational approach driven from a graphical representation of SCFs. The premise was not only to compare the results of two quite different methods for our own interest, but also to enable other researchers to choose whichever reclassification better suited their purpose (one being grounded purely in theoretical linguistics and the other in practical language engineering). The various classifications are available as an online resource to researchers.

Annotating progressive aspect constructions in the spoken section of the British National Corpus
Andrew Caines | Paula Buttery
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a set of stand-off annotations for the ninety thousand sentences in the spoken section of the British National Corpus (BNC) which feature a progressive aspect verb group. These annotations may be matched to the original BNC text using the supplied document and sentence identifiers. The annotated features mostly relate to linguistic form: subject type, subject person and number, form of auxiliary verb, and clause type, tense and polarity. In addition, the sentences are classified for register, the formality of recording context: three levels of `spontaneity' with genres such as sermons and scripted speech at the most formal level and casual conversation at the least formal. The resource has been designed so that it may easily be augmented with further stand-off annotations. Expert linguistic annotations of spoken data, such as these, are valuable for improving the performance of natural language processing tools in the spoken language domain and assist linguistic research in general.

2010

You Talking to Me? A Predictive Model for Zero Auxiliary Constructions
Andrew Caines | Paula Buttery
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

Co-authors

Christopher Davis 3

Richard Diehl Martinez 3

Zébulon Goriely 3

Alice Hutchings 3

Suchir Salhan 3

Christian Bentz 2

Orphee De Clercq 2

Calbert Graham 2

Murathan Kurfali 2

Hope McGovern 2

Diane Nicholls 2

Pascual Pérez-Paredes 2

Elena Volodina 2

Robert Östling 2

David Ifeoluwa Adelani 1

David O. Ademuyiwa 1

Idris Akinade 1

Jesujoba Alabi 1

Dimitrios Alikaniotis 1

Øistein E. Andersen 1

Taylor Arnold 1

Israel Abebe Azime 1

Nicolas Ballier 1

Rachel Bawden 1

Mark Brenchley 1

Kelvin Wey Han Chan 1

Hannah Craighead 1

Chris Irwin Davis 1

Helena Edmondson 1

Elizaveta Ershova 1

Cristina España-Bonet 1

Jennifer-Carmen Frey 1

Thomas Gaillat 1

Diana Galván-Sosa 1

Bianca-Mihaela Ganescu 1

Gabrielle Gaudeau 1

Clive Hamilton 1

Dietrich Klakow 1

Joni Kruijsbergen 1

Tom Lippincott 1

Roddy MacSween 1

Arianna Masciolini 1

Michael McCarthy 1

Pushkar Mishra 1

Russell Moore 1

Shamsuddeen Hassan Muhammad 1

Ricardo Muñoz Sánchez 1

Clement Oyeleke Odoje 1

Erin Pacquetet 1

Sergio Pastrana 1

Dominik Pfütze 1

Roman Rietsche 1

Alexandr Rosen 1

Cornelius Schramm 1

Ekaterina Shutova 1

Shiva Taslimipoor 1

Olivia Thomas 1

Georgios Velentzas 1

Olga Vinogradova 1

Thiemo Wambsganss 1

Torsten Zesch 1

Miaoran Zhang 1

Venues