Jean-Philippe Goldman

2025

Bratly: A Python Extension for BRAT Functionalities
Jamil Zaghir | Jean-Philippe Goldman | Nikola Bjelogrlic | Mina Bjelogrlic | Christian Lovis
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

BRAT is a widely used web-based text annotation tool. However, it lacks robust Python support for effective annotation management and processing. We present Bratly, an open-source extension of BRAT that introduces a solid Python backend, enabling advanced functionalities such as annotation typings, collection typings with statistical insights, corpus and annotation handling, object modifications, and entity-level evaluation based on MUC-5 standards. These enhancements streamline annotation workflows, improve usability, and facilitate high-quality NLP research. This paper outlines the system’s architecture, functionalities and evaluation, positioning it as a valuable BRAT extension for its users. The tool is open-source, and the NLP community is welcome to suggest improvements.

2024

pdf bib

pdf bib abs

FRASIMED: A Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection
Jamil Zaghir | Mina Bjelogrlic | Jean-Philippe Goldman | Soukaïna Aananou | Christophe Gaudet-Blavignac | Christian Lovis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection and is freely available on GitHub (link: https://github.com/JamilProg/crosslingual_bert_annotation_projection). Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2’051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French.

pdf bib abs

NeuroTrialNER: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries
Simona Emilova Doneva | Tilia Ellendorff | Beate Sick | Jean-Philippe Goldman | Amelia Elaine Cannon | Gerold Schneider | Benjamin Victor Ineichen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Extracting and aggregating information from clinical trial registries could provide invaluable insights into the drug development landscape and advance the treatment of neurologic diseases. However, achieving this at scale is hampered by the volume of available data and the lack of an annotated corpus to assist in the development of automation tools. Thus, we introduce NeuroTrialNER, a new and fully open corpus for named entity recognition (NER). It comprises 1093 clinical trial summaries sourced from ClinicalTrials.gov, annotated for neurological diseases, therapeutic interventions, and control treatments. We describe our data collection process and the corpus in detail. We demonstrate its utility for NER using large language models and achieve a close-to-human performance. By bridging the gap in data resources, we hope to foster the development of text processing tools that help researchers navigate clinical trials data more easily.

2018

pdf bib

MIAPARLE: Online training for the discrimination of stress contrasts
Jean-Philippe Goldman | Sandra Schwab
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French
Jean-Philippe Goldman | Yves Scherrer | Julie Glikman | Mathieu Avanzi | Christophe Benzitoun | Philippe Boula de Mareüil
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Strategies and Challenges for Crowdsourcing Regional Dialect Perception Data for Swiss German and Swiss French
Jean-Philippe Goldman | Simon Clematide | Mathieu Avanzi | Raphael Tandler
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs

Cartopho : un site web de cartographie de variantes de prononciation en français (Cartopho: a website for mapping pronunciation variants in French)
Philippe Boula de Mareüil | Jean-Philippe Goldman | Albert Rilliard | Yves Scherrer | Frédéric Vernier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP

Le présent travail se propose de renouveler les traditionnels atlas dialectologiques pour cartographier les variantes de prononciation en français, à travers un site internet. La toile est utilisée non seulement pour collecter des données, mais encore pour disséminer les résultats auprès des chercheurs et du grand public. La méthodologie utilisée, à base de crowdsourcing (ou « production participative »), nous a permis de recueillir des informations auprès de 2500 francophones d’Europe (France, Belgique, Suisse). Une plateforme dynamique à l’interface conviviale a ensuite été développée pour cartographier la prononciation de 70 mots dans les différentes régions des pays concernés (des mots notamment à voyelle moyenne ou dont la consonne finale peut être prononcée ou non). Les options de visualisation par département/canton/province ou par région, combinant plusieurs traits de prononciation et ensembles de mots, sous forme de pastilles colorées, de hachures, etc. sont présentées dans cet article. On peut ainsi observer immédiatement un /E/ plus fermé (ainsi qu’un /O/ plus ouvert) dans le Nord-Pas-de-Calais et le sud de la France, pour des mots comme parfait ou rose, un /Œ/ plus fermé en Suisse pour un mot comme gueule, par exemple.

2014

pdf bib abs

The main objective of the Rhapsodie project (ANR Rhapsodie 07 Corp-030-01) was to define rich, explicit, and reproducible schemes for the annotation of prosody and syntax in different genres (Â± spontaneous, Â± planned, face-to-face interviews vs. broadcast, etc.), in order to study the prosody/syntax/discourse interface in spoken French, and their roles in the segmentation of speech into discourse units (Lacheret, Kahane, & Pietrandrea forthcoming). We here describe the deliverable, a syntactic and prosodic treebank of spoken French, composed of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and 33000 words), orthographically and phonetically transcribed. The transcriptions and the annotations are all aligned on the speech signal: phonemes, syllables, words, speakers, overlaps. This resource is freely available at www.projet-rhapsodie.fr. The sound samples (wav/mp3), the acoustic analysis (original F0 curve manually corrected and automatic stylized F0, pitch format), the orthographic transcriptions (txt), the microsyntactic annotations (tabular format), the macrosyntactic annotations (txt, tabular format), the prosodic annotations (xml, textgrid, tabular format), and the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France. The metadata are encoded in the IMDI-CMFI format and can be parsed on line.

pdf bib abs

A Crowdsourcing Smartphone Application for Swiss German: Putting Language Documentation in the Hands of the Users
Jean-Philippe Goldman | Adrian Leeman | Marie-José Kolly | Ingrid Hove | Ibrahim Almajai | Volker Dellwo | Steven Moran
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This contribution describes an on-going projects a smartphone application called Voice Ãpp, which is a follow-up of a previous application called Dialäkt Ãpp. The main purpose of both apps is to identify the users Swiss German dialect on the basis of the dialectal variations of 15 words. The result is returned as one or more geographical points on a map. In Dialäkt Ãpp, launched in 2013, the user provides his or her own pronunciation through buttons, while the Voice Ãpp, currently in development, asks users to pronounce the word and uses speech recognition techniques to identify the variants and localize the user. This second app is more challenging from a technical point of view but nevertheless recovers the nature of dialect variation of spoken language. Besides, the Voice Ãpp takes its users on a journey in which they explore the individuality of their own voices, answering questions such as: How high is my voice? How fast do I speak? Do I speak faster than users in the neighbouring city?

pdf bib abs

DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech
George Christodoulides | Mathieu Avanzi | Jean-Philippe Goldman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present DisMo, a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. DisMo is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). In this paper, we present the first public version of DisMo for French. The system is trained and its performance evaluated on a 57k-token corpus, including different varieties of French spoken in three countries (Belgium, France and Switzerland). DisMo supports a multi-level annotation scheme, in which the tokenisation to minimal word units is complemented with multi-word unit groupings (each having associated POS tags), as well as separate levels for annotating disfluencies and discourse phenomena. We present the systems architecture, linguistic resources and its hierarchical tag-set. Results show that DisMo achieves a precision of 95% (finest tag-set) to 96.8% (coarse tag-set) in POS-tagging non-punctuated, sound-aligned transcriptions of spoken French, while also offering substantial possibilities for automated multi-level annotation.

pdf bib abs

C-PhonoGenre: a 7-hours corpus of 7 speaking styles in French: relations between situational features and prosodic properties
Jean-Philippe Goldman | Tea Pršir | Antoine Auchlin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Phonogenres, or speaking styles, are typified acoustic images associated to types of language activities, causing prosodic and phonostylistic variations. This communication presents a large speech corpus (7 hours) in French, extending a previous work by Goldman et al. (2011a), Simon et al. (2010), with a greater number and complementary repertoire of considered phonogenres. The corpus is available with segmentation at phonetic, syllabic and word levels, as well as manual annotation. Segmentations and annotations were achieved semi-automatically, through a set of Praat implemented tools, and manual steps. The phonogenres are also described with a reduced set of situational dimensions as in Lucci (1983) and Koch & Oesterreichers (2001). A preliminary acoustic study, joining rhythmical comparative measurements (Dellwo 2010) to Goldman et al.s (2007a) ProsoReport, reports acoustic differences between phonogenres.

2011

pdf bib abs

Étude inter-langues de la distribution et des ambiguïtés syntaxiques des pronoms (A study of cross-language distribution and syntactic ambiguities of pronouns)
Lorenza Russo | Yves Scherrer | Jean-Philippe Goldman | Sharid Loáiciga | Luka Nerima | Éric Wehrli
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Ce travail décrit la distribution des pronoms selon le style de texte (littéraire ou journalistique) et selon la langue (français, anglais, allemand et italien). Sur la base d’un étiquetage morpho-syntaxique effectué automatiquement puis vérifié manuellement, nous pouvons constater que la proportion des différents types de pronoms varie selon le type de texte et selon la langue. Nous discutons les catégories les plus ambiguës de manière détaillée. Comme nous avons utilisé l’analyseur syntaxique Fips pour l’étiquetage des pronoms, nous l’avons également évalué et obtenu une précision moyenne de plus de 95%.

pdf bib abs

La traduction automatique des pronoms. Problèmes et perspectives (Automatic translation of pronouns. Problems and perspectives)
Yves Scherrer | Lorenza Russo | Jean-Philippe Goldman | Sharid Loáiciga | Luka Nerima | Éric Wehrli
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cette étude, notre système de traduction automatique, Its-2, a fait l’objet d’une évaluation manuelle de la traduction des pronoms pour cinq paires de langues et sur deux corpus : un corpus littéraire et un corpus de communiqués de presse. Les résultats montrent que les pourcentages d’erreurs peuvent atteindre 60% selon la paire de langues et le corpus. Nous discutons ainsi deux pistes de recherche pour l’amélioration des performances de Its-2 : la résolution des ambiguïtés d’analyse et la résolution des anaphores pronominales.

2010

pdf bib abs

Expressive : Génération automatique de parole expressive à partir de données non linguistiques
Olivier Blanc | Noémi Boubel | Jean-Philippe Goldman | Sophie Roekhaut | Anne Catherine Simon | Cédrick Fairon | Richard Beaufort
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons Expressive, un système de génération de parole expressive à partir de données non linguistiques. Ce système est composé de deux outils distincts : Taittingen, un générateur automatique de textes d’une grande variété lexico-syntaxique produits à partir d’une représentation conceptuelle du discours, et StyloPhone, un système de synthèse vocale multi-styles qui s’attache à rendre le discours produit attractif et naturel en proposant différents styles vocaux.

pdf bib abs

FipsColor : grammaire en couleur interactive pour l’apprentissage du français
Jean-Philippe Goldman | Kamel Nebhi | Christopher Laenzlinger
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

L’analyseur multilingue FiPS permet de transformer une phrase en une structure syntaxique riche et accompagnée d’informations lexicales, grammaticales et thématiques. On décrit ici une application qui adapte les structures en constituants de l’analyseur FiPS à une nomenclature grammaticale permettant la représentation en couleur. Cette application interactive et disponible en ligne (http://latl.unige.ch/fipscolor) peut être utilisée librement par les enseignants et élèves de primaire.

2001

pdf bib abs

Influence de facteurs stylistiques, syntaxiques et lexicaux sur la réalisation de la liaison en français
Cécile Fougeron | Jean-Philippe Goldman | Alicia Dart | Laurence Guélat | Clémentine Jeager
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les nombreuses recherches portant sur le phénomène de la liaison en français ont pu mettre en évidence l’influence de divers paramètres linguistiques et para-linguistiques sur la réalisation des liaisons. Notre contribution vise à déterminer la contribution relative de certains de ces facteurs en tirant parti d’une méthodologie robuste ainsi que d’outils de traitement automatique du langage. A partir d’un corpus de 5h de parole produit par 10 locuteurs, nous étudions les effets du style de parole (lecture oralisée/parole spontanée), du débit de parole (lecture normale/rapide), ainsi que la contribution de facteurs syntaxiques et lexicaux (longueur et fréquence lexicale) sur la réalisation de la liaison. Les résultats montrent que si plusieurs facteurs étudiés prédisent certaines liaisons, ces facteurs sont souvent interdépendants et ne permettent pas de modéliser avec exactitude la réalisation des liaisons.