Pierre Magistry

2025

A corpus-driven description of OV order in Archaic Chinese
Qishen Wu | Santiago Herrera | Pierre Magistry | Sylvain Kahane
Proceedings of the Eighth International Conference on Dependency Linguistics (Depling, SyntaxFest 2025)

This paper presents a quantitative study of Object‐Verb (OV) order in Archaic Chinese based on a Universal Dependencies (UD) treebanks. Treating word order as a binary choice (OV vs VO), we train a sparse logistic‐regression classifier that selects the most salient syntactic features needed for an accurate prediction to investigate the specific syntactic contexts allowing OV word order and to identify to what extent do these factors favour this order. The ranked features are understood as interpretable rules, and their coverage and precision as quantitative properties of each rule. The approach confirms earlier qualitative findings (e.g. pronoun object fronting and negation favour OV) and uncovers new contrasts in word order between different reflexive pronouns. It also identifies annotation errors that we corrected in the final analysis, illustrating how the quantitative models, combined with fine-grained corpus analysis, can improve treebank quality. Our study demonstrates that lightweight machine‐learning techniques applied to an existing syntactic resource can reveal fine‐grained patterns in historical word order and this can be reapplied to other languages.

pdf bib abs

La science participative et l’ANR DiLSi
Pierre Magistry | Ilaine Wang
Actes de l'atelier Science Participative pour les Données et Corpus Linguistiques 2025 (ParCol)

Cette communication propose un retour d’expérience sur les interactions entre le projet DiLSi et les communautés de locuteurs du teochew de la diaspora et du tâigí.

pdf bib abs

TinyMentalLLMs Enable Depression Detection in Chinese Social Media Texts
Jinyuan Xu | Tian Lan | Mathieu Valette | Pierre Magistry | Lei Li
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Depression remains a major global mental health concern, bringing a higher risk of suicide and growing social costs tied to mental disorders. Leveraging social media as a valuable source of emotional signals, we identify two limitations in current NLP-based depression detection frameworks: (1) prediction systems often lack clear, user-friendly explanations for predictions in Depression Detection, and (2) the computational and confidentiality demands of LLMs are misaligned with the need for dependable, privacy-focused small-scale deployments. To address these challenges, we introduce TinyMentalLLMs (TMLs), a compact framework that offers two key contributions: (a) the construction of a small yet representative dataset through psychology-based textometry, and (b) an efficient fine-tuning strategy centered on multiple aspects of depression. This design improves both accuracy and F1 scores in generative models with 0.5B and 1.5B parameters, consistently yielding over 20% performance gains across datasets. TMLs achieve results on par with, and deliver better text quality than, much larger state-of-the-art models.

2024

pdf bib abs

Experiments on Speech Synthesis for Teochew, Can Taiwanese Help ?
Pierre Magistry | Ilaine Wang | Ty Eng Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper reports on our preliminary experiments in speech processing for Teochew, an under-resourced Sinitic language spoken both in China and around the world in diasporan communities. Following the recent uptick of interest in Teochew from heritage speakers of the diaspora and in order to respond to the needs of this community, we develop a Teochew Text-to-Speech system. We describe experiments to build this system and to assess the possible contribution of available resources in Taiwanese Hokkien, the closest language with a significant body of resources. The results of these experiments are not as conclusive as we expected: the Taiwanese dataset did not help our model significantly, but considering our objectives, we find it encouraging that they show that a large training dataset was not necessary for this precise task. A promising model could still be obtained with only a small dataset of Teochew. We hope that this work inspires other communities of speakers of languages in a revitalization phase.

2023

pdf bib abs

Ertim at SemEval-2023 Task 2: Fine-tuning of Transformer Language Models and External Knowledge Leveraging for NER in Farsi, English, French and Chinese
Kevin Deturck | Pierre Magistry | Bénédicte Diot-Parvaz Ahmad | Ilaine Wang | Damien Nouvel | Hugo Lafayette
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Transformer language models are now a solid baseline for Named Entity Recognition and can be significantly improved by leveraging complementary resources, either by integrating external knowledge or by annotating additional data. In a preliminary step, this work presents experiments on fine-tuning transformer models. Then, a set of experiments has been conducted with a Wikipedia-based reclassification system. Additionally, we conducted a small annotation campaign on the Farsi language to evaluate the impact of additional data. These two methods with complementary resources showed improvements compared to fine-tuning only.

2022

pdf bib abs

(Re-)Digitizing 吳守禮 Ngôo Siú-lé’s Mandarin – Taiwanese Dictionary
Pierre Magistry | Afala Phaxay
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

This paper presents the efforts conducted to obtain a usable and open digital version in XML-TEI of one of the major lexicographic work for bilingual Taiwanese dictionaries, namely the 《國臺對照活用辭典》(Practical Mandarin-Taiwanese Dictionary) The original dictionary was published in 2000, after decades of work by Prof. 吳守禮 (Ngôo Siu-le/Wu Shouli)

2020

pdf bib abs

Répliquer et étendre pour l’alsacien “Étiquetage en parties du discours de langues peu dotées par spécialisation des plongements lexicaux” (Replicating and extending for Alsatian : “POS tagging for low-resource languages by adapting word embeddings”)
Alice Millour | Karën Fort | Pierre Magistry
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)

Nous présentons ici les résultats d’un travail de réplication et d’extension pour l’alsacien d’une expérience concernant l’étiquetage en parties du discours de langues peu dotées par spécialisation des plongements lexicaux (Magistry et al., 2018). Ce travail a été réalisé en étroite collaboration avec les auteurs de l’article d’origine. Cette interaction riche nous a permis de mettre au jour les éléments manquants dans la présentation de l’expérience, de les compléter, et d’étendre la recherche à la robustesse à la variation.

Pierre Magistry

2025

2024

2023

2022

2020

2018

2016

2014

2013

2012

2011

2010

2009

Co-authors

Venues