Ilaine Wang

2025

La science participative et l’ANR DiLSi
Pierre Magistry | Ilaine Wang
Actes de l'atelier Science Participative pour les Données et Corpus Linguistiques 2025 (ParCol)

Cette communication propose un retour d’expérience sur les interactions entre le projet DiLSi et les communautés de locuteurs du teochew de la diaspora et du tâigí.

pdf bib abs

Systèmes d’écriture et qualité des données : l’affinage de modèles de translittération dans un contexte de faibles ressources
Emmett Strickland | Ilaine Wang | Damien Nouvel | Bénédicte Diot-Parvaz Ahmad
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Cet article présente une expérience visant à construire des modèles de romanisation affinés pour onze langues parmi lesquelles se trouvent des langues dites peu dotées. Nous démontrons qu’un modèle de romanisation efficace peut être créé en affinant un modèle de base entraîné sur un corpus important d’une ou plusieurs autres langues. Le système d’écriture semblerait jouer un rôle dans l’efficacité de certains modèles affinés. Nous présentons également des méthodes pour évaluer la qualité des données d’entraînement et d’évaluation, et comparons notre modèle arabe le plus performant à un modèle de référence.

2024

pdf bib abs

Experiments on Speech Synthesis for Teochew, Can Taiwanese Help ?
Pierre Magistry | Ilaine Wang | Ty Eng Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper reports on our preliminary experiments in speech processing for Teochew, an under-resourced Sinitic language spoken both in China and around the world in diasporan communities. Following the recent uptick of interest in Teochew from heritage speakers of the diaspora and in order to respond to the needs of this community, we develop a Teochew Text-to-Speech system. We describe experiments to build this system and to assess the possible contribution of available resources in Taiwanese Hokkien, the closest language with a significant body of resources. The results of these experiments are not as conclusive as we expected: the Taiwanese dataset did not help our model significantly, but considering our objectives, we find it encouraging that they show that a large training dataset was not necessary for this precise task. A promising model could still be obtained with only a small dataset of Teochew. We hope that this work inspires other communities of speakers of languages in a revitalization phase.

pdf bib

Design of a Taiwan Taigi Treebank Aligned on Mandarin and Teochew Translations
Pierre Magistry | Ilaine Wang
Proceedings of the 36th Conference on Computational Linguistics and Speech Processing (ROCLING 2024)

2023

pdf bib abs

Ertim at SemEval-2023 Task 2: Fine-tuning of Transformer Language Models and External Knowledge Leveraging for NER in Farsi, English, French and Chinese
Kevin Deturck | Pierre Magistry | Bénédicte Diot-Parvaz Ahmad | Ilaine Wang | Damien Nouvel | Hugo Lafayette
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Transformer language models are now a solid baseline for Named Entity Recognition and can be significantly improved by leveraging complementary resources, either by integrating external knowledge or by annotating additional data. In a preliminary step, this work presents experiments on fine-tuning transformer models. Then, a set of experiments has been conducted with a Wikipedia-based reclassification system. Additionally, we conducted a small annotation campaign on the Farsi language to evaluate the impact of additional data. These two methods with complementary resources showed improvements compared to fine-tuning only.

2022

pdf bib abs

Towards a Unified ASR System for the Armenian Standards
Samuel Chakmakjian | Ilaine Wang
Proceedings of the Workshop on Processing Language Variation: Digital Armenian (DigitAm) within the 13th Language Resources and Evaluation Conference

Armenian is a traditionally under-resourced language, which has seen a recent uptick in interest in the development of its tools and presence in the digital domain. Some of this recent interest has centred around the development of Automatic Speech Recognition (ASR) technologies. However, the language boasts two standard variants which diverge on multiple typological and structural levels. In this work, we examine some of the available bodies of data for ASR construction, present the challenges in the processing of these data and propose a methodology going forward.

2020

pdf bib abs

ODIL_Syntax: a Free Spontaneous Spoken French Treebank Annotated with Constituent Trees
Ilaine Wang | Aurore Pelletier | Jean-Yves Antoine | Anaïs Halftermeyer
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes ODIL Syntax, a French treebank built on spontaneous speech transcripts. The syntactic structure of every speech turn is represented by constituent trees, through a procedure which combines an automatic annotation provided by a parser (here, the Stanford Parser) and a manual revision. ODIL Syntax respects the annotation scheme designed for the French TreeBank (FTB), with the addition of some annotation guidelines that aims at representing specific features of the spoken language such as speech disfluencies. The corpus will be freely distributed by January 2020 under a Creative Commons licence. It will ground a further semantic enrichment dedicated to the representation of temporal entities and temporal relations, as a second phase of the ODIL@Temporal project. The paper details the annotation scheme we followed with a emphasis on the representation of speech disfluencies. We then present the annotation procedure that was carried out on the Contemplata annotation platform. In the last section, we provide some distributional characteristics of the annotated corpus (POS distribution, multiword expressions).

pdf bib abs

Contemplata, a Free Platform for Constituency Treebank Annotation
Jakub Waszczuk | Ilaine Wang | Jean-Yves Antoine | Anaïs Halftermeyer
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes Contemplata, an annotation platform that offers a generic solution for treebank building as well as treebank enrichment with relations between syntactic nodes. Contemplata is dedicated to the annotation of constituency trees. The framework includes support for syntactic parsers, which provide automatic annotations to be manually revised. The balanced strategy of annotation between automatic parsing and manual revision allows to reduce the annotator workload, which favours data reliability. The paper presents the software architecture of Contemplata, describes its practical use and eventually gives two examples of annotation projects that were conducted on the platform.

2016

pdf bib

From built examples to attested examples: a syntax-based query for non-specialists
Ilaine Wang | Sylvain Kahane | Isabelle Tellier
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf bib abs

Macrosyntactic Segmenters of a French Spoken Corpus
Ilaine Wang | Sylvain Kahane | Isabelle Tellier
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The aim of this paper is to describe an automated process to segment spoken French transcribed data into macrosyntactic units. While sentences are delimited by punctuation marks for written data, there is no obvious hint nor limit to major units for speech. As a reference, we used the manual annotation of macrosyntactic units based on illocutionary as well as syntactic criteria and developed for the Rhapsodie corpus, a 33.000 words prosodic and syntactic treebank. Our segmenters were built using machine learning methods as supervised classifiers : segmentation is about identifying the boundaries of units, which amounts to classifying each interword space. We trained six different models on Rhapsodie using different sets of features, including prosodic and morphosyntactic cues, on the assumption that their combination would be relevant for the task. Both types of cues could be resulting either from manual annotation/correction or from fully automated processes, which comparison might help determine the cost of manual effort, especially for the 3M words of spoken French of the Orfeo project those experiments are contributing to.

pdf bib

Can we chunk well with bad POS labels? (Peut-on bien chunker avec de mauvaises étiquettes POS ?) [in French]
Isabelle Tellier | Iris Eshkol-Taravella | Yoann Dupont | Ilaine Wang
Proceedings of TALN 2014 (Volume 1: Long Papers)