Nicolas Hervé
2022
Using ASR-Generated Text for Spoken Language Modeling
Nicolas Hervé
|
Valentin Pelloin
|
Benoit Favre
|
Franck Dary
|
Antoine Laurent
|
Sylvain Meignier
|
Laurent Besacier
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. The new models (FlauBERT-Oral) will be shared with the community and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks : classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.
2020
A French Corpus for Event Detection on Twitter
Béatrice Mazoyer
|
Julia Cagé
|
Nicolas Hervé
|
Céline Hudelot
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present Event2018, a corpus annotated for event detection tasks, consisting of 38 million tweets in French (retweets excluded) including more than 130,000 tweets manually annotated by three annotators as related or unrelated to a given event. The 243 events were selected both from press articles and from subjects trending on Twitter during the annotation period (July to August 2018). In total, more than 95,000 tweets were annotated as related to one of the selected events. We also provide the titles and URLs of 15,500 news articles automatically detected as related to these events. In addition to this corpus, we detail the results of our event detection experiments on both this dataset and another publicly available dataset of tweets in English. We ran extensive tests with different types of text embeddings and a standard Topic Detection and Tracking algorithm, and detail our evaluation method. We show that tf-idf vectors allow the best performance for this task on both corpora. These results are intended to serve as a baseline for researchers wishing to test their own event detection systems on our corpus.
French Tweet Corpus for Automatic Stance Detection
Marc Evrard
|
Rémi Uro
|
Nicolas Hervé
|
Béatrice Mazoyer
Proceedings of the Twelfth Language Resources and Evaluation Conference
The automatic stance detection task consists in determining the attitude expressed in a text toward a target (text, claim, or entity). This is a typical intermediate task for the fake news detection or analysis, which is a considerably widespread and a particularly difficult issue to overcome. This work aims at the creation of a human-annotated corpus for the automatic stance detection of tweets written in French. It exploits a corpus of tweets collected during July and August 2018. To the best of our knowledge, this is the first freely available stance annotated tweet corpus in the French language. The four classes broadly adopted by the community were chosen for the annotation: support, deny, query, and comment with the addition of the ignore class. This paper presents the corpus along with the tools used to build it, its construction, an analysis of the inter-rater reliability, as well as the challenges and questions that were raised during the building process.
Search
Fix data
Co-authors
- Béatrice Mazoyer 2
- Laurent Besacier 1
- Julia Cagé 1
- Franck Dary 1
- Marc Evrard 1
- show all...