Automatic Punctuation Model for Spanish Live Transcriptions

Mario Perez-Enriquez, Jose Manuel Masiello-Ruiz, Jose Luis Lopez-Cuadrado, Israel Gonzalez-Carrasco, Paloma Martinez-Fernandez, Belen Ruiz-Mezcua


Abstract
With the widespread adoption of automatic transcription tools, acquiring speech transcriptions within seconds has become a reality. Nonetheless, many of these tools yield unpunctuated outputs, potentially incurring additional costs. This paper presents a novel approach to integrating punctuation into the transcriptions generated by such automatic tools, specifically focusing on Spanish-speaking contexts. Leveraging the RoBERTa-bne model pre-trained with data from the Spanish National Library, our training proposal is augmented with additional corpora to enhance performance on less common punctuation marks, such as question marks. Also, the proposed model has been trained through fine-tuning pre-trained models, involving adjustments for token classification and using SoftMax to identify the highest probability token. The proposed model obtains promising results when compared with other Spanish reference paper models. Ultimately, this model aims to facilitate punctuation on live transcriptions seamlessly and accurately. The proposed model will be applied to a real-case education project to improve the readability of the transcriptions.
Anthology ID:
2024.lrec-main.175
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1953–1958
Language:
URL:
https://aclanthology.org/2024.lrec-main.175
DOI:
Bibkey:
Cite (ACL):
Mario Perez-Enriquez, Jose Manuel Masiello-Ruiz, Jose Luis Lopez-Cuadrado, Israel Gonzalez-Carrasco, Paloma Martinez-Fernandez, and Belen Ruiz-Mezcua. 2024. Automatic Punctuation Model for Spanish Live Transcriptions. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1953–1958, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Automatic Punctuation Model for Spanish Live Transcriptions (Perez-Enriquez et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.175.pdf