Martial Pastor

2025

pdf bib abs
Enhancing Discourse Parsing for Local Structures from Social Media with LLM-Generated Data
Martial Pastor | Nelleke Oostdijk | Patricia Martin-Rodilla | Javier Parapar
Proceedings of the 31st International Conference on Computational Linguistics

We explore the use of discourse parsers for extracting a particular discourse structure in a real-world social media scenario. Specifically, we focus on enhancing parser performance through the integration of synthetic data generated by large language models (LLMs). We conduct experiments using a newly developed dataset of 1,170 local RST discourse structures, including 900 synthetic and 270 gold examples, covering three social media platforms: online news comments sections, a discussion forum (Reddit), and a social media messaging platform (Twitter). Our primary goal is to assess the impact of LLM-generated synthetic training data on parser performance in a raw text setting without pre-identified discourse units. While both top-down and bottom-up RST architectures greatly benefit from synthetic data, challenges remain in classifying evaluative discourse structures.

2024

pdf bib abs
Signals as Features: Predicting Error/Success in Rhetorical Structure Parsing
Martial Pastor | Nelleke Oostdijk
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

This study introduces an approach for evaluating the importance of signals proposed by Das and Taboada in discourse parsing. Previous studies using other signals indicate that discourse markers (DMs) are not consistently reliable cues and can act as distractors, complicating relations recognition. The study explores the effectiveness of alternative signal types, such as syntactic and genre-related signals, revealing their efficacy even when not predominant for specific relations. An experiment incorporating RST signals as features for a parser error / success prediction model demonstrates their relevance and provides insights into signal combinations that prevents (or facilitates) accurate relation recognition. The observations also identify challenges and potential confusion posed by specific signals. This study resulted in producing publicly available code and data, contributing to an accessible resources for research on RST signals in discourse parsing.

pdf bib abs
La reconnaissance automatique des relations de cohérence RST en français.
Martial Pastor | Erik Bran Marino | Nelleke Oostdijk
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Les parseurs de discours ont suscité un intérêt considérable dans les récentes applications de traitement automatique du langage naturel. Cette approche dépasse les limites traditionnelles de la phrase et peut s’étendre pour englober l’identification de relation de discours. Il existe plusieurs parseurs spécialisés dans le traitement autmatique du discours, mais ces derniers ont été principalement évalués sur des corpus anglais. Par conséquent, il n’est pas évident de bien cerner les éléments linguistiques importants sur lesquels les parseurs se basent pour classifier les relations de discours en dehors de l’anglais. Cet article évalue les performances du parseur DMRST sur le corpus RST-DT traduit en français. Nous constatons que les performances de classification des relations de discours en français sont comparables à celles obtenues pour d’autres langues. En analysant les succès et échecs de la classification des relations, nous soulignons l’impact des marqueurs de discours et des structures syntaxiques sur la précision du parseur.

2023

pdf bib abs
EvoSem: A database of polysemous cognate sets
Mathieu Dehouck | Alex François | Siva Kalyan | Martial Pastor | David Kletz
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Polysemies, or “colexifications”, are of great interest in cognitive and historical linguistics, since meanings that are frequently expressed by the same lexeme are likely to be conceptually similar, and lie along a common pathway of semantic change. We argue that these types of inferences can be more reliably drawn from polysemies of cognate sets (which we call “dialexifications”) than from polysemies of lexemes. After giving a precise definition of dialexification, we introduce Evosem, a cross-linguistic database of etymologies scraped from several online sources. Based on this database, we measure for each pair of senses how many cognate sets include them both — i.e. how often this pair of senses is “dialexified”. This allows us to construct a weighted dialexification graph for any set of senses, indicating the conceptual and historical closeness of each pair. We also present an online interface for browsing our database, including graphs and interactive tables. We then discuss potential applications to NLP tasks and to linguistic research.

Co-authors

Erik Bran Marino 1

Patricia Martin - Rodilla 1

Javier Parapar 1

Venues

Fix author