Vincenzo Norman Vitale


2024

pdf bib
Modelling Filled Particles and Prolongation Using End-to-end Automatic Speech Recognition Systems: A Quantitative and Qualitative Analysis.
Vincenzo Norman Vitale | Loredana Schettino | Francesco Cutugno
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable perfor mances. However, phenomena that characterize spoken language such as fillers (eeh ehm) or segmental prolongations (theee) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders – a Connectionist Temporal Classification (CTC) decoder and a Transducer decoder – tend to model some speech features that are functional for the identification of filled pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these systems work.