Paul Piwek

2025

Coherence of Argumentative Dialogue Snippets: A New Method for Large Scale Evaluation with an Application to Inference Anchoring Theory
Paul Piwek | Jacopo Amidei | Svetlana Stoyanchev
Findings of the Association for Computational Linguistics: EMNLP 2025

This paper introduces a novel method for testing the components of theories of (dialogue) coherence through utterance substitution. The method is described and then applied to Inference Anchoring Theory (IAT) in a large scale experimental study with 933 dialogue snippets and 87 annotators. IAT has been used for substantial corpus annotation and practical applications. To address the aim of finding out if and to what extent two aspects of IAT – illocutionary acts and propositional relations – contribute to dialogue coherence, we designed an experiment for systematically comparing the coherence ratings for several variants of short debate snippets. The comparison is between original human-human debate snippets, snippets generated with an IAT-compliant algorithm and snippets produced with ablated versions of the algorithm. This allows us to systematically compare snippets that have identical underlying structures as well as IAT-deficient structures with each other. We found that propositional relations do impact on dialogue coherence (at a statistically highly significant level) whereas we found no such effect for illocutionary act expression. This result suggests that fine-grained inferential relations impact on dialogue coherence, complementing the higher-level coherence structures of, for instance, Rhetorical Structure Theory.

pdf bib abs

Mention detection with LLMs in pair-programming dialogue
Cecilia Domingo | Paul Piwek | Svetlana Stoyanchev | Michel Wermelinger
Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference

We tackle the task of mention detection for pair-programming dialogue, a setting which adds several challenges to the task due to the characteristics of natural dialogue, the dynamic environment of the dialogue task, and the domain-specific vocabulary and structures. We compare recent variants of the Llama and GPT families and explore different prompt and context engineering approaches. While aspects like hesitations and references to read-out code and variable names made the task challenging, GPT 4.1 approximated human performance when we provided few-shot examples similar to the inference text and corrected formatting errors.

pdf bib abs

Human ratings of LLM response generation in pair-programming dialogue
Cecilia Domingo | Paul Piwek | Svetlana Stoyanchev | Michel Wermelinger | Kaustubh Adhikari | Rama Sanand Doddipatla
Proceedings of the 18th International Natural Language Generation Conference

We take first steps in exploring whether Large Language Models (LLMs) can be adapted to dialogic learning practices, specifically pair programming — LLMs have primarily been implemented as programming assistants, not fully exploiting their dialogic potential. We used new dialogue data from real pair-programming interactions between students, prompting state-of-the-art LLMs to assume the role of a student, when generating a response that continues the real dialogue. We asked human annotators to rate human and AI responses on the criteria through which we operationalise the LLMs’ suitability for educational dialogue: Coherence, Collaborativeness, and whether they appeared human. Results show model differences, with Llama-generated responses being rated similarly to human answers on all three criteria. Thus, for at least one of the models we investigated, the LLM utterance-level response generation appears to be suitable for pair-programming dialogue.

Paul Piwek

2025

2022

2020

2019

2018

2017

2016

2013

2012

2011

2010

2008

2007

2006

2003

2002

2000

Co-authors

Venues