2024
pdf
bib
abs
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido
|
Sara Papi
|
Luisa Bentivogli
|
Alessio Brutti
|
Mauro Cettolo
|
Roberto Gretter
|
Marco Matassoni
|
Mohamed Nabih
|
Matteo Negri
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
2022
pdf
bib
abs
Cross-corpora experiments of automatic proficiency assessment and error detection for spoken English
Stefano Bannò
|
Marco Matassoni
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
The growing demand for learning English as a second language has led to an increasing interest in automatic approaches for assessing spoken language proficiency. One of the most significant challenges in this field is the lack of publicly available annotated spoken data. Another common issue is the lack of consistency and coherence in human assessment. To tackle both problems, in this paper we address the task of automatically predicting the scores of spoken test responses of English-as-a-second-language learners by training neural models on written data and using the presence of grammatical errors as a feature, as they can be considered consistent indicators of proficiency through their distribution and frequency. Specifically, we train a feature extractor on EFCAMDAT, a large written corpus containing error annotations and proficiency levels assigned by human experts, in order to extract information related to grammatical errors and, in turn, we use the resulting model for inference on the CLC-FCE corpus, on the ICNALE corpus, and on the spoken section of the TLT-school corpus, a collection of proficiency tests taken by Italian students. The work investigates the impact of the feature extractor on spoken proficiency assessment as well as the written-to-spoken approach. We find that our error-based approach can be beneficial for assessing spoken proficiency. The results obtained on the considered datasets are discussed and evaluated with appropriate metrics.
2021
pdf
bib
abs
Seed Words Based Data Selection for Language Model Adaptation
Roberto Gretter
|
Marco Matassoni
|
Daniele Falavigna
Proceedings of the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW)
We address the problem of language model customization in applications where the ASR component needs to manage domain-specific terminology; although current state-of-the-art speech recognition technology provides excellent results for generic domains, the adaptation to specialized dictionaries or glossaries is still an open issue. In this work we present an approach for automatically selecting sentences, from a text corpus, that match, both semantically and morphologically, a glossary of terms (words or composite words) furnished by the user. The final goal is to rapidly adapt the language model of an hybrid ASR system with a limited amount of in-domain text data in order to successfully cope with the linguistic domain at hand; the vocabulary of the baseline model is expanded and tailored, reducing the resulting OOV rate. Data selection strategies based on shallow morphological seeds and semantic similarity via word2vec are introduced and discussed; the experimental setting consists in a simultaneous interpreting scenario, where ASRs in three languages are designed to recognize the domainspecific terms (i.e. dentistry). Results using different metrics (OOV rate, WER, precision and recall) show the effectiveness of the proposed techniques.
pdf
bib
abs
SmarTerp: A CAI System to Support Simultaneous Interpreters in Real-Time
Susana Rodriguez
|
Roberto Gretter
|
Marco Matassoni
|
Alvaro Alonso
|
Oscar Corcho
|
Mariano Rico
|
Falavigna Daniele
Proceedings of the Translation and Interpreting Technology Online Conference
We present a system to support simultaneous interpreting in specific domains. The system is going to be developed through a strong synergy among technicians, mostly experts on both speech and text processing, and end-users, i.e. professional interpreters who define the requirements and will test the final product. Some preliminary encouraging results have been achieved on benchmark tests collected with the aim of measuring the performance of single components of the whole system, namely: automatic speech recognition (ASR) and named entity recognition.
2020
pdf
bib
abs
TLT-school: a Corpus of Non Native Children Speech
Roberto Gretter
|
Marco Matassoni
|
Stefano Bannò
|
Falavigna Daniele
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper describes “TLT-school” a corpus of speech utterances collected in schools of northern Italy for assessing the performance of students learning both English and German. The corpus was recorded in the years 2017 and 2018 from students aged between nine and sixteen years, attending primary, middle and high school. All utterances have been scored, in terms of some predefined proficiency indicators, by human experts. In addition, most of utterances recorded in 2017 have been manually transcribed carefully. Guidelines and procedures used for manual transcriptions of utterances will be described in detail, as well as results achieved by means of an automatic speech recognition system developed by us. Part of the corpus is going to be freely distributed to scientific community particularly interested both in non-native speech recognition and automatic assessment of second language proficiency.
pdf
bib
abs
Automatically Assess Children’s Reading Skills
Ornella Mich
|
Nadia Mana
|
Roberto Gretter
|
Marco Matassoni
|
Daniele Falavigna
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)
Assessing reading skills is an important task teachers have to perform at the beginning of a new scholastic year to evaluate the starting level of the class and properly plan next learning activities. Digital tools based on automatic speech recognition (ASR) may be really useful to support teachers in this task, currently very time consuming and prone to human errors. This paper presents a web application for automatically assessing fluency and accuracy of oral reading in children attending Italian primary and lower secondary schools. Our system, based on ASR technology, implements the Cornoldi’s MT battery, which is a well-known Italian test to assess reading skills. The front-end of the system has been designed following the participatory design approach by involving end users from the beginning of the creation process. Teachers may use our system to both test student’s reading skills and monitor their performance over time. In fact, the system offers an effective graphical visualization of the assessment results for both individual students and entire class. The paper also presents the results of a pilot study to evaluate the system usability with teachers.
2000
pdf
bib
Annotation of a Multichannel Noisy Speech Corpus
L. Cristoforetti
|
M. Matassoni
|
M. Omologo
|
P. Svaizer
|
E. Zovato
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)