Yaru Wu


2024

pdf bib
Perplexing Canon: A study on GPT-based perplexity of canonical and non-canonical literary works
Yaru Wu | Yuri Bizzoni | Pascale Moreira | Kristoffer Nielbo
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

This study extends previous research on literary quality by using information theory-based methods to assess the level of perplexity recorded by three large language models when processing 20th-century English novels deemed to have high literary quality, recognized by experts as canonical, compared to a broader control group. We find that canonical texts appear to elicit a higher perplexity in the models, we explore which textual features might concur to create such an effect. We find that the usage of a more heavily nominal style, together with a more diverse vocabulary, is one of the leading causes of the difference between the two groups. These traits could reflect “strategies” to achieve an informationally dense literary style.

2022

pdf bib
Using a Knowledge Base to Automatically Annotate Speech Corpora and to Identify Sociolinguistic Variation
Yaru Wu | Fabian Suchanek | Ioana Vasilescu | Lori Lamel | Martine Adda-Decker
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Speech characteristics vary from speaker to speaker. While some variation phenomena are due to the overall communication setting, others are due to diastratic factors such as gender, provenance, age, and social background. The analysis of these factors, although relevant for both linguistic and speech technology communities, is hampered by the need to annotate existing corpora or to recruit, categorise, and record volunteers as a function of targeted profiles. This paper presents a methodology that uses a knowledge base to provide speaker-specific information. This can facilitate the enrichment of existing corpora with new annotations extracted from the knowledge base. The method also helps the large scale analysis by automatically extracting instances of speech variation to correlate with diastratic features. We apply our method to an over 120-hour corpus of broadcast speech in French and investigate variation patterns linked to reduction phenomena and/or specific to connected speech such as disfluencies. We find significant differences in speech rate, the use of filler words, and the rate of non-canonical realisations of frequent segments as a function of different professional categories and age groups.

pdf bib
Extracting Linguistic Knowledge from Speech: A Study of Stop Realization in 5 Romance Languages
Yaru Wu | Mathilde Hutin | Ioana Vasilescu | Lori Lamel | Martine Adda-Decker
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper builds upon recent work in leveraging the corpora and tools originally used to develop speech technologies for corpus-based linguistic studies. We address the non-canonical realization of consonants in connected speech and we focus on voicing alternation phenomena of stops in 5 standard varieties of Romance languages (French, Italian, Spanish, Portuguese, Romanian). For these languages, both large scale corpora and speech recognition systems were available for the study. We use forced alignment with pronunciation variants and machine learning techniques to examine to what extent such frequent phenomena characterize languages and what are the most triggering factors. The results confirm that voicing alternations occur in all Romance languages. Automatic classification underlines that surrounding contexts and segment duration are recurring contributing factors for modeling voicing alternation. The results of this study also demonstrate the new role that machine learning techniques such as classification algorithms can play in helping to extract linguistic knowledge from speech and to suggest interesting research directions.

2020

pdf bib
Alternances de voisement et processus de lénition et de fortition : une étude automatisée de grands corpus en cinq langues romanes [Voicing alternations in relation with lenition and fortition phenomena: an automated study of large corpora in five Romance languages]
Ioana Vasilescu | Yaru Wu | Adèle Jatteau | Martine Adda-Decker | Lori Lamel
Traitement Automatique des Langues, Volume 61, Numéro 1 : Varia [Varia]

pdf bib
Réduction temporelle en français spontané : où se cache-t-elle ? Une étude des segments, des mots et séquences de mots fréquemment réduits ()
Yaru Wu | Martine Adda-Decker
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole

Cette étude vise à proposer une méthode adaptée à l’étude de divers phénomènes de variation dans les grands corpus utilisant l’alignement automatique de la parole. Cette méthode est appliquée pour étudier la réduction temporelle en français spontané. Nous proposons de qualifier la réduction temporelle comme la réalisation de suites de segments courts consécutifs. Environ 14% du corpus est considéré comme réduit. Les résultats de l’alignement montrent que ces zones impliquent le plus souvent plus d’un mot (81%), et que sinon, la position interne du mot est la plus concernée. Parmi les exemples de suites de mots les plus réduits, on trouve des locutions utilisées comme des marqueurs discursifs.

2016

pdf bib
Rôle des contextes lexical et post-lexical dans la réalisation du schwa : apports du traitement automatique de grands corpus (Role of lexical and post-lexical contexts in French schwa realisations : benefits of automatic processing of large corpora )
Yaru Wu | Martine Adda-Decker | Cécile Fougeron
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP

Le rôle du contexte est connu dans la réalisation ou non du schwa en français. Deux grands corpus oraux de parole journalistique (ETAPE) et de parole familière (NCCFr), dans lesquels la realisation de schwa est déterminée à partir d’un alignement automatique, ont été utilisés pour examiner la contribution du contexte au sein du mot contenant schwa (lexical) vs. au travers de la frontière avec le mot précédent (post-lexical). Nos résultats montrent l’importance du contexte pré-frontière dans l’explication de la chute du schwa dans la première syllabe d’un mot polysyllabique en parole spontanée. Si le mot précédant se termine par une consonne, nous pouvons faire appel à la loi des trois consonnes et au principe de sonorité pour expliquer des différences de comportement en fonction de la nature des consonnes en contact.