2024
pdf
bib
abs
Experiments on Speech Synthesis for Teochew, Can Taiwanese Help ?
Pierre Magistry
|
Ilaine Wang
|
Ty Eng Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper reports on our preliminary experiments in speech processing for Teochew, an under-resourced Sinitic language spoken both in China and around the world in diasporan communities. Following the recent uptick of interest in Teochew from heritage speakers of the diaspora and in order to respond to the needs of this community, we develop a Teochew Text-to-Speech system. We describe experiments to build this system and to assess the possible contribution of available resources in Taiwanese Hokkien, the closest language with a significant body of resources. The results of these experiments are not as conclusive as we expected: the Taiwanese dataset did not help our model significantly, but considering our objectives, we find it encouraging that they show that a large training dataset was not necessary for this precise task. A promising model could still be obtained with only a small dataset of Teochew. We hope that this work inspires other communities of speakers of languages in a revitalization phase.
2023
pdf
bib
abs
Ertim at SemEval-2023 Task 2: Fine-tuning of Transformer Language Models and External Knowledge Leveraging for NER in Farsi, English, French and Chinese
Kevin Deturck
|
Pierre Magistry
|
Bénédicte Diot-Parvaz Ahmad
|
Ilaine Wang
|
Damien Nouvel
|
Hugo Lafayette
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Transformer language models are now a solid baseline for Named Entity Recognition and can be significantly improved by leveraging complementary resources, either by integrating external knowledge or by annotating additional data. In a preliminary step, this work presents experiments on fine-tuning transformer models. Then, a set of experiments has been conducted with a Wikipedia-based reclassification system. Additionally, we conducted a small annotation campaign on the Farsi language to evaluate the impact of additional data. These two methods with complementary resources showed improvements compared to fine-tuning only.
2022
pdf
bib
abs
Towards a Unified ASR System for the Armenian Standards
Samuel Chakmakjian
|
Ilaine Wang
Proceedings of the Workshop on Processing Language Variation: Digital Armenian (DigitAm) within the 13th Language Resources and Evaluation Conference
Armenian is a traditionally under-resourced language, which has seen a recent uptick in interest in the development of its tools and presence in the digital domain. Some of this recent interest has centred around the development of Automatic Speech Recognition (ASR) technologies. However, the language boasts two standard variants which diverge on multiple typological and structural levels. In this work, we examine some of the available bodies of data for ASR construction, present the challenges in the processing of these data and propose a methodology going forward.
2020
pdf
bib
abs
ODIL_Syntax: a Free Spontaneous Spoken French Treebank Annotated with Constituent Trees
Ilaine Wang
|
Aurore Pelletier
|
Jean-Yves Antoine
|
Anaïs Halftermeyer
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper describes ODIL Syntax, a French treebank built on spontaneous speech transcripts. The syntactic structure of every speech turn is represented by constituent trees, through a procedure which combines an automatic annotation provided by a parser (here, the Stanford Parser) and a manual revision. ODIL Syntax respects the annotation scheme designed for the French TreeBank (FTB), with the addition of some annotation guidelines that aims at representing specific features of the spoken language such as speech disfluencies. The corpus will be freely distributed by January 2020 under a Creative Commons licence. It will ground a further semantic enrichment dedicated to the representation of temporal entities and temporal relations, as a second phase of the ODIL@Temporal project. The paper details the annotation scheme we followed with a emphasis on the representation of speech disfluencies. We then present the annotation procedure that was carried out on the Contemplata annotation platform. In the last section, we provide some distributional characteristics of the annotated corpus (POS distribution, multiword expressions).
pdf
bib
abs
Contemplata, a Free Platform for Constituency Treebank Annotation
Jakub Waszczuk
|
Ilaine Wang
|
Jean-Yves Antoine
|
Anaïs Halftermeyer
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper describes Contemplata, an annotation platform that offers a generic solution for treebank building as well as treebank enrichment with relations between syntactic nodes. Contemplata is dedicated to the annotation of constituency trees. The framework includes support for syntactic parsers, which provide automatic annotations to be manually revised. The balanced strategy of annotation between automatic parsing and manual revision allows to reduce the annotator workload, which favours data reliability. The paper presents the software architecture of Contemplata, describes its practical use and eventually gives two examples of annotation projects that were conducted on the platform.
2016
pdf
bib
From built examples to attested examples: a syntax-based query for non-specialists
Ilaine Wang
|
Sylvain Kahane
|
Isabelle Tellier
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters
2014
pdf
bib
abs
Macrosyntactic Segmenters of a French Spoken Corpus
Ilaine Wang
|
Sylvain Kahane
|
Isabelle Tellier
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The aim of this paper is to describe an automated process to segment spoken French transcribed data into macrosyntactic units. While sentences are delimited by punctuation marks for written data, there is no obvious hint nor limit to major units for speech. As a reference, we used the manual annotation of macrosyntactic units based on illocutionary as well as syntactic criteria and developed for the Rhapsodie corpus, a 33.000 words prosodic and syntactic treebank. Our segmenters were built using machine learning methods as supervised classifiers : segmentation is about identifying the boundaries of units, which amounts to classifying each interword space. We trained six different models on Rhapsodie using different sets of features, including prosodic and morphosyntactic cues, on the assumption that their combination would be relevant for the task. Both types of cues could be resulting either from manual annotation/correction or from fully automated processes, which comparison might help determine the cost of manual effort, especially for the 3M words of spoken French of the Orfeo project those experiments are contributing to.
pdf
bib
Can we chunk well with bad POS labels? (Peut-on bien chunker avec de mauvaises étiquettes POS ?) [in French]
Isabelle Tellier
|
Iris Eshkol-Taravella
|
Yoann Dupont
|
Ilaine Wang
Proceedings of TALN 2014 (Volume 1: Long Papers)