Mario Perez-Enriquez


2024

pdf bib
Automatic Punctuation Model for Spanish Live Transcriptions
Mario Perez-Enriquez | Jose Manuel Masiello-Ruiz | Jose Luis Lopez-Cuadrado | Israel Gonzalez-Carrasco | Paloma Martinez-Fernandez | Belen Ruiz-Mezcua
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

With the widespread adoption of automatic transcription tools, acquiring speech transcriptions within seconds has become a reality. Nonetheless, many of these tools yield unpunctuated outputs, potentially incurring additional costs. This paper presents a novel approach to integrating punctuation into the transcriptions generated by such automatic tools, specifically focusing on Spanish-speaking contexts. Leveraging the RoBERTa-bne model pre-trained with data from the Spanish National Library, our training proposal is augmented with additional corpora to enhance performance on less common punctuation marks, such as question marks. Also, the proposed model has been trained through fine-tuning pre-trained models, involving adjustments for token classification and using SoftMax to identify the highest probability token. The proposed model obtains promising results when compared with other Spanish reference paper models. Ultimately, this model aims to facilitate punctuation on live transcriptions seamlessly and accurately. The proposed model will be applied to a real-case education project to improve the readability of the transcriptions.