2024
pdf
bib
abs
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido
|
Sara Papi
|
Luisa Bentivogli
|
Alessio Brutti
|
Mauro Cettolo
|
Roberto Gretter
|
Marco Matassoni
|
Mohamed Nabih
|
Matteo Negri
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
2021
pdf
bib
abs
Seed Words Based Data Selection for Language Model Adaptation
Roberto Gretter
|
Marco Matassoni
|
Daniele Falavigna
Proceedings of the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW)
We address the problem of language model customization in applications where the ASR component needs to manage domain-specific terminology; although current state-of-the-art speech recognition technology provides excellent results for generic domains, the adaptation to specialized dictionaries or glossaries is still an open issue. In this work we present an approach for automatically selecting sentences, from a text corpus, that match, both semantically and morphologically, a glossary of terms (words or composite words) furnished by the user. The final goal is to rapidly adapt the language model of an hybrid ASR system with a limited amount of in-domain text data in order to successfully cope with the linguistic domain at hand; the vocabulary of the baseline model is expanded and tailored, reducing the resulting OOV rate. Data selection strategies based on shallow morphological seeds and semantic similarity via word2vec are introduced and discussed; the experimental setting consists in a simultaneous interpreting scenario, where ASRs in three languages are designed to recognize the domainspecific terms (i.e. dentistry). Results using different metrics (OOV rate, WER, precision and recall) show the effectiveness of the proposed techniques.
pdf
bib
abs
SmarTerp: A CAI System to Support Simultaneous Interpreters in Real-Time
Susana Rodriguez
|
Roberto Gretter
|
Marco Matassoni
|
Alvaro Alonso
|
Oscar Corcho
|
Mariano Rico
|
Falavigna Daniele
Proceedings of the Translation and Interpreting Technology Online Conference
We present a system to support simultaneous interpreting in specific domains. The system is going to be developed through a strong synergy among technicians, mostly experts on both speech and text processing, and end-users, i.e. professional interpreters who define the requirements and will test the final product. Some preliminary encouraging results have been achieved on benchmark tests collected with the aim of measuring the performance of single components of the whole system, namely: automatic speech recognition (ASR) and named entity recognition.
2020
pdf
bib
abs
TLT-school: a Corpus of Non Native Children Speech
Roberto Gretter
|
Marco Matassoni
|
Stefano Bannò
|
Falavigna Daniele
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper describes “TLT-school” a corpus of speech utterances collected in schools of northern Italy for assessing the performance of students learning both English and German. The corpus was recorded in the years 2017 and 2018 from students aged between nine and sixteen years, attending primary, middle and high school. All utterances have been scored, in terms of some predefined proficiency indicators, by human experts. In addition, most of utterances recorded in 2017 have been manually transcribed carefully. Guidelines and procedures used for manual transcriptions of utterances will be described in detail, as well as results achieved by means of an automatic speech recognition system developed by us. Part of the corpus is going to be freely distributed to scientific community particularly interested both in non-native speech recognition and automatic assessment of second language proficiency.
pdf
bib
abs
Automatically Assess Children’s Reading Skills
Ornella Mich
|
Nadia Mana
|
Roberto Gretter
|
Marco Matassoni
|
Daniele Falavigna
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)
Assessing reading skills is an important task teachers have to perform at the beginning of a new scholastic year to evaluate the starting level of the class and properly plan next learning activities. Digital tools based on automatic speech recognition (ASR) may be really useful to support teachers in this task, currently very time consuming and prone to human errors. This paper presents a web application for automatically assessing fluency and accuracy of oral reading in children attending Italian primary and lower secondary schools. Our system, based on ASR technology, implements the Cornoldi’s MT battery, which is a well-known Italian test to assess reading skills. The front-end of the system has been designed following the participatory design approach by involving end users from the beginning of the creation process. Teachers may use our system to both test student’s reading skills and monitor their performance over time. In fact, the system offers an effective graphical visualization of the assessment results for both individual students and entire class. The paper also presents the results of a pilot study to evaluate the system usability with teachers.
2014
pdf
bib
abs
Euronews: a multilingual speech corpus for ASR
Roberto Gretter
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper we present a multilingual speech corpus, designed for Automatic Speech Recognition (ASR) purposes. Data come from the portal Euronews and were acquired both from the Web and from TV. The corpus includes data in 10 languages (Arabic, English, French, German, Italian, Polish, Portuguese, Russian, Spanish and Turkish) and was designed both to train AMs and to evaluate ASR performance. For each language, the corpus is composed of about 100 hours of speech for training (60 for Polish) and about 4 hours, manually transcribed, for testing. Training data include the audio, some reference text, the ASR output and their alignment. We plan to make public at least part of the benchmark in view of a multilingual ASR benchmark for IWSLT 2014.
2013
pdf
bib
abs
FBK @ IWSLT 2013 – ASR tracks
Daniele Falavigna
|
Roberto Gretter
|
Fabio Brugnara
|
Diego Giuliani
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign
This paper reports on the participation of FBK at the IWSLT2013 evaluation campaign on automatic speech recognition (ASR): precisely on both English and German ASR track. Only primary submissions have been sent for evaluation. For English, the ASR system features acoustic models trained on a portion of the TED talk recordings that was automatically selected according to the fidelity of the provided transcriptions. Two decoding steps are performed interleaved by acoustic feature normalization and acoustic model adaptation. A final step combines the outputs obtained after having rescored the word graphs generated in the second decoding step with 4 different language models. The latter are trained on: out-of-domain text data, in-domain data and several sets of automatically selected data. For German, acoustic models have been trained on automatically selected portions of a broadcast news corpus, called ”Euronews”. Differently from English, in this case only two decoding steps are carried out without making use of any rescoring procedure.
1991
pdf
bib
abs
Stochastic Context-Free Grammars for Island-Driven Probabilistic Parsing
Anna Corazza
|
Renato De Mori
|
Roberto Gretter
|
Giorgio Satta
Proceedings of the Second International Workshop on Parsing Technologies
In automatic speech recognition the use of language models improves performance. Stochastic language models fit rather well the uncertainty created by the acoustic pattern matching. These models are used to score theories corresponding to partial interpretations of sentences. Algorithms have been developed to compute probabilities for theories that grow in a strictly left-to-right fashion. In this paper we consider new relations to compute probabilities of partial interpretations of sentences. We introduce theories containing a gap corresponding to an uninterpreted signal segment. Algorithms can be easily obtained from these relations. Computational complexity of these algorithms is also derived.