Chuyuan Li - ACL Anthology

Chuyuan Li

2026

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models
Chuyuan Li | Giuseppe Carenini
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., just), and also aggregates a shared-task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier reasoning model GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibits strong performance in arithmetic aspect of temporal reasoning, but they struggle with long-dependency reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation classification.

2025

Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)
Michael Strube | Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Amir Zeldes | Chuyuan Li
Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)

CEMTM: Contextual Embedding-based Multimodal Topic Modeling
Amirhossein Abaskohi | Raymond Li | Chuyuan Li | Shafiq Joty | Giuseppe Carenini
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

On Large Foundation Models and Alzheimer’s Disease Detection
Chuyuan Li | Giuseppe Carenini | Thalia Field
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

Large Foundation Models have displayed incredible capabilities in a wide range of domains and tasks. However, it is unclear whether these models match specialist capabilities without special training or fine-tuning. In this paper, we investigate the innate ability of foundation models as neurodegenerative disease specialists. Precisely, we use a language model, Llama-3.1, and a visual language model, Llama3-LLaVA-NeXT, to detect language specificity between Alzheimer’s Disease patients and healthy controls through a well-known Picture Description task. Results show that Llama is comparable to supervised classifiers, while LLaVA, despite its additional “vision”, lags behind.

Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing
Juntai Cao | Xiang Zhang | Raymond Li | Jiaqi Wei | Chuyuan Li | Shafiq Joty | Giuseppe Carenini
Proceedings of The 5th New Frontiers in Summarization Workshop

Recent advances in test-time scaling have shown promising results in improving large language model performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in reasoning tasks, its application to natural language generation tasks, particularly summarization, remains unexplored.Among all of the generation tasks, multi-document summarization (MDS) presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more complicated approach to prompt design and ensemble methods, as no single “best-overall” prompt can satisfy diverse summarization requirements. The inherent diversity in summarization needs necessitates exploring how different prompting strategies can be systematically combined to improve performance.We propose a novel framework that harnesses prompt diversity to enhance MDS performance. Our approach generates multiple candidate summaries using carefully designed prompt variations, then ensemble them through sophisticated aggregation methods to produce refined summaries. This prompt diversity enables models to capture different aspects and perspectives of the source documents, leading to more comprehensive and higher-quality summaries. To evaluate our method effectively, we also introduce two new LLM-based metrics: the Preference Alignment Score (PAS) and LLM Atom-Content-Unit score (LLM-ACU), which assess summary quality while addressing the positional bias inherent in automatic evaluations performed by LLMs.Our experiments demonstrate that leveraging prompt diversity significantly enhances summary quality, while also revealing the practical scaling boundaries for MDS tasks.

Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization
Chuyuan Li | Austin Xu | Shafiq Joty | Giuseppe Carenini
Findings of the Association for Computational Linguistics: EMNLP 2025

A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models (LLMs) have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.

Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimer’s Disease Detection
Chuyuan Li | Raymond Li | Thalia S. Field | Giuseppe Carenini
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder that leads to dementia, and early intervention can greatly benefit from analyzing linguistic abnormalities. In this work, we explore the potential of Large Language Models as health assistants for AD diagnosis from patient-generated text using in-context learning (ICL), where tasks are defined through a few input-output examples. Empirical results reveal that conventional ICL methods, such as similarity-based selection, perform poorly for AD diagnosis, likely due to the inherent complexity of this task. To address this, we introduce Delta-KNN, a novel demonstration selection strategy that enhances ICL performance. Our method leverages a delta score to assess the relative gains of each training example, coupled with a KNN-based retriever that dynamically selects optimal “representatives” for a given input.Experiments on two AD detection datasets across three models demonstrate that Delta-KNN consistently outperforms existing ICL baselines. Notably, when using the Llama-3.1 model, our approach achieves new state-of-the-art results, surpassing even supervised classifiers.

Explicit Bayesian Inference to Uncover the Latent Themes of Large Language Models
Raymond Li | Chuyuan Li | Gabriel Murray | Giuseppe Carenini
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) have demonstrated impressive generative capabilities, yet their inner mechanisms remain largely opaque. In this work, we introduce a novel approach to interpret LLMs generation process through the lens of an explicit Bayesian framework by inferring latent topic variables via variational inference. Specifically, we leverage a variational autoencoder-based neural topic model to dynamically approximate the posterior distribution over the high-level latent topic variables at each generation step. By reconstructing the LLM’s next-token predictions through these latent topics and maintaining a regularized latent space, our method yields interpretable and diverse topic representations but also has the ability to effectively captures semantic shifts throughout the text. We validate our approach on multiple datasets, showing that our latent topics outperform state-of-the-art topic models on intrinsic measures of coherence and diversity. Furthermore, we demonstrate the utility of our approach in downstream applications by using the inferred topic distributions to retrieve relevant demonstration examples for in-context learning, resulting in significant gains on classification and summarization tasks.

Hierarchical Attention Adapter for Abstractive Dialogue Summarization
Raymond Li | Chuyuan Li | Gabriel Murray | Giuseppe Carenini
Proceedings of The 5th New Frontiers in Summarization Workshop

Dialogue summarization is still a very challenging task even for large language models (LLMs). On the one hand, some previous approaches have pre-trained language models specifically for dialogue understanding and summarization, but they have been limited to relatively small models. On the other hand, other works have tried to directly exploit the dialogue semantics and discourse structures in their modeling effort, but by construction, they require access to those structures, which is in itself a largely unsolved problem. In this paper, we synergistically combine these two ideas in an approach that can be seamlessly integrated into the decoder-only architecture adopted by the most state-of-the-art LLMs. In particular, our novel solution leverages the parameter-efficient fine-tuning (PEFT) paradigm to model the hierarchical structure of dialogues, where input sequences are naturally segmented into dialogue turns, and then fine-tune the model for abstractive summarization. From experiments on two datasets, we find that Hierarchical Attention Adapter outperforms all baseline adapter methods on SummScreen, where our approach can also be combined with LoRA to achieve the best performance on SamSum.

Proceedings of the 4th Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2025)
Chloé Braud | Yang Janet Liu | Philippe Muller | Amir Zeldes | Chuyuan Li
Proceedings of the 4th Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2025)

Discourse Relation Recognition with Language Models Under Different Data Availability
Shuhaib Mehri | Chuyuan Li | Giuseppe Carenini
Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)

Large Language Models (LLMs) have demonstrated remarkable performance across various NLP tasks, yet they continue to face challenges in discourse relation recognition (DRR). Current state-of-the-art methods for DRR primarily rely on smaller pre-trained language models (PLMs). In this study, we conduct a comprehensive analysis of different approaches using both PLMs and LLMs, evaluating their effectiveness for DRR at multiple granularities and under different data availability settings. Our findings indicate that no single approach consistently outperforms the others, and we offer a general comparison framework to guide the selection of the most appropriate model based on specific DRR requirements and data conditions.

The DISRPT 2025 Shared Task on Elementary Discourse Unit Segmentation, Connective Detection, and Relation Classification
Chloé Braud | Amir Zeldes | Chuyuan Li | Yang Janet Liu | Philippe Muller
Proceedings of the 4th Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2025)

In 2025, we held the fourth iteration of the DISRPT Shared Task (Discourse Relation Parsing and Treebanking) dedicated to discourse parsing across formalisms. Following the success of the 2019, 2021, and 2023 tasks on Elementary Discourse Unit Segmentation, Connective Detection, and Relation Classification, this iteration added 13 new datasets, including three new languages (Czech, Polish, Nigerian Pidgin) and two new frameworks: the ISO framework and Enhanced Rhetorical Structure Theory, in addition to the previously included frameworks: RST, SDRT, DEP, and PDTB. In this paper, we review the data included in DISRPT 2025, which covers 39 datasets across 16 languages, survey and compare submitted systems, and report on system performance on each task for both treebanked and plain-tokenized versions of the data. The best systems obtain a mean accuracy of 71.19% for relation classification, a mean F1 of 91.57 (Treebanked Track) and 87.38 (Plain Track) for segmentation, and a mean F1 of 81.53 (Treebanked Track) and 79.92 (Plain Track) for connective identification. The data and trained models of several participants can be found at https://huggingface.co/multilingual-discourse-hub.

2024

Dialogue Discourse Parsing as Generation: A Sequence-to-Sequence LLM-based Approach
Chuyuan Li | Yuwei Yin | Giuseppe Carenini
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Existing works on dialogue discourse parsing mostly utilize encoder-only models and sophisticated decoding strategies to extract structures. Despite recent advances in Large Language Models (LLMs), there has been little work applying directly these models on discourse parsing. To fully utilize the rich semantic and discourse knowledge in LLMs, we explore the feasibility of transforming discourse parsing into a generation task using a text-to-text paradigm. Our approach is intuitive and requires no modification of the LLM architecture. Experimental results on STAC and Molweni datasets show that a sequence-to-sequence model such as T0 can perform reasonably well. Notably, our improved transition-based sequence-to-sequence system achieves new state-of-the-art performance on Molweni, demonstrating the effectiveness of the proposed method. Furthermore, our systems can generate richer discourse structures such as directed acyclic graphs, whereas previous methods are limited to trees.

Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)
Michael Strube | Chloe Braud | Christian Hardmeier | Junyi Jessy Li | Sharid Loaiciga | Amir Zeldes | Chuyuan Li
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

Discourse Relation Prediction and Discourse Parsing in Dialogues with Minimal Supervision
Chuyuan Li | Chloé Braud | Maxime Amblard | Giuseppe Carenini
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

Discourse analysis plays a crucial role in Natural Language Processing, with discourse relation prediction arguably being the most difficult task in discourse parsing. Previous studies have generally focused on explicit or implicit discourse relation classification in monologues, leaving dialogue an under-explored domain. Facing the data scarcity issue, we propose to leverage self-training strategies based on a Transformer backbone. Moreover, we design the first semi-supervised pipeline that sequentially predicts discourse structures and relations. Using 50 examples, our relation prediction module achieves 58.4 in accuracy on the STAC corpus, close to supervised state-of-the-art. Full parsing results show notable improvements compared to the supervised models both in-domain (gaming) and cross-domain (technical chat), with better stability.

Coherence-based Dialogue Discourse Structure Extraction using Open-Source Large Language Models
Gaetano Cimino | Chuyuan Li | Giuseppe Carenini | Vincenzo Deufemia
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Despite the challenges posed by data sparsity in discourse parsing for dialogues, unsupervised methods have been underexplored. Leveraging recent advances in Large Language Models (LLMs), in this paper we investigate an unsupervised coherence-based method to build discourse structures for multi-party dialogues using open-source LLMs fine-tuned on conversational data. Specifically, we propose two algorithms that extract dialogue structures by identifying their most coherent sub-dialogues: DS-DP employs a dynamic programming strategy, while DS-FLOW applies a greedy approach. Evaluation on the STAC corpus demonstrates a micro-F1 score of 58.1%, surpassing prior unsupervised methods. Furthermore, on a cleaned subset of the Molweni corpus, the proposed method achieves a micro-F1 score of 74.7%, highlighting its effectiveness across different corpora.

2023

Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues
Chuyuan Li | Patrick Huber | Wen Xiao | Maxime Amblard | Chloe Braud | Giuseppe Carenini
Findings of the Association for Computational Linguistics: EACL 2023

Discourse processing suffers from data sparsity, especially for dialogues. As a result, we explore approaches to infer latent discourse structures for dialogues, based on attention matrices from Pre-trained Language Models (PLMs). We investigate multiple auxiliary tasks for fine-tuning and show that the dialogue-tailored Sentence Ordering task performs best. To locate and exploit discourse information in PLMs, we propose an unsupervised and a semi-supervised method. Our proposals thereby achieve encouraging results on the STAC corpus, with F1 scores of 57.2 and 59.3 for the unsupervised and semi-supervised methods, respectively. When restricted to projective trees, our scores improved to 63.3 and 68.1.

2022

Quantification Annotation in ISO 24617-12, Second Draft
Harry Bunt | Maxime Amblard | Johan Bos | Karën Fort | Bruno Guillaume | Philippe de Groote | Chuyuan Li | Pierre Ludmann | Michel Musiol | Siyana Pavlova | Guy Perrier | Sylvain Pogodalla
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes the continuation of a project that aims at establishing an interoperable annotation schema for quantification phenomena as part of the ISO suite of standards for semantic annotation, known as the Semantic Annotation Framework. After a break, caused by the Covid-19 pandemic, the project was relaunched in early 2022 with a second working draft of an annotation scheme, which is discussed in this paper. Keywords: semantic annotation, quantification, interoperability, annotation schema, ISO standard

Multi-Task Learning for Depression Detection in Dialogs
Chuyuan Li | Chloé Braud | Maxime Amblard
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Depression is a serious mental illness that impacts the way people communicate, especially through their emotions, and, allegedly, the way they interact with others. This work examines depression signals in dialogs, a less studied setting that suffers from data sparsity. We hypothesize that depression and emotion can inform each other, and we propose to explore the influence of dialog structure through topic and dialog act prediction. We investigate a Multi-Task Learning (MTL) approach, where all tasks mentioned above are learned jointly with dialog-tailored hierarchical modeling. We experiment on the DAIC and DailyDialog corpora – both contain dialogs in English – and show important improvements over state-of-the-art on depression detection (at best 70.6% F1), which demonstrates the correlation of depression with emotion and dialog organization and the power of MTL to leverage information from different sources.

2021

Investigating non lexical markers of the language of schizophrenia in spontaneous conversations
Chuyuan Li | Maxime Amblard | Chloé Braud | Caroline Demily | Nicolas Franck | Michel Musiol
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

We investigate linguistic markers associated with schizophrenia in clinical conversations by detecting predictive features among French-speaking patients. Dealing with human-human dialogues makes for a realistic situation, but it calls for strategies to represent the context and face data sparsity. We compare different approaches for data representation – from individual speech turns to entire conversations –, and data modeling, using lexical, morphological, syntactic, and discourse features, dimensions presumed to be tightly connected to the language of schizophrenia. Previous English models were mostly lexical and reached high performance, here replicated (93.7% acc.). However, our analysis reveals that these models are heavily biased, which probably concerns most datasets on this task. Our new delexicalized models are more general and robust, with the best accuracy score at 77.9%.

2020

Investigation par méthodes d’apprentissage des spécificités langagières propres aux personnes avec schizophrénie (Investigating Learning Methods Applied to Language Specificity of Persons with Schizophrenia)
Maxime Amblard | Chloé Braud | Chuyuan Li | Caroline Demily | Nicolas Franck | Michel Musiol
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Nous présentons des expériences visant à identifier automatiquement des patients présentant des symptômes de schizophrénie dans des conversations contrôlées entre patients et psychothérapeutes. Nous fusionnons l’ensemble des tours de parole de chaque interlocuteur et entraînons des modèles de classification utilisant des informations lexicales, morphologiques et syntaxiques. Cette étude est la première du genre sur le français et obtient des résultats comparables à celles sur l’anglais. Nos premières expériences tendent à montrer que la parole des personnes avec schizophrénie se distingue de celle des témoins : le meilleur modèle obtient une exactitude de 93,66%. Des informations plus riches seront cependant nécessaires pour parvenir à un modèle robuste.

2019

Characterizing the Response Space of Questions: a Corpus Study for English and Polish
Jonathan Ginzburg | Zulipiye Yusupujiang | Chuyuan Li | Kexin Ren | Paweł Łupkowski
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

The main aim of this paper is to provide a characterization of the response space for questions using a taxonomy grounded in a dialogical formal semantics. As a starting point we take the typology for responses in the form of questions provided in (Lupkowski and Ginzburg, 2016). This work develops a wide coverage taxonomy for question/question sequences observable in corpora including the BNC, CHILDES, and BEE, as well as formal modelling of all the postulated classes. Our aim is to extend this work to cover all responses to questions. We present the extended typology of responses to questions based on a corpus studies of BNC, BEE and Maptask with include 506, 262, and 467 question/response pairs respectively. We compare the data for English with data from Polish using the Spokes corpus (205 question/response pairs). We discuss annotation reliability and disagreement analysis. We sketch how each class can be formalized using a dialogical semantics appropriate for dialogue management.

Venues

JEP/TALN/RECITAL1