SNuC: The Sheffield Numbers Spoken Language Corpus

Emma Barker; Jon Barker; Robert Gaizauskas; Ning Ma; Monica Lestari Paramita

SNuC: The Sheffield Numbers Spoken Language Corpus

Emma Barker, Jon Barker, Robert Gaizauskas, Ning Ma, Monica Lestari Paramita

Abstract

We present SNuC, the first published corpus of spoken alphanumeric identifiers of the sort typically used as serial and part numbers in the manufacturing sector. The dataset contains recordings and transcriptions of over 50 native British English speakers, speaking over 13,000 multi-character alphanumeric sequences and totalling almost 20 hours of recorded speech. We describe requirements taken into account in the designing the corpus and the methodology used to construct it. We present summary statistics describing the corpus contents, as well as a preliminary investigation into errors in spoken alphanumeric identifiers. We validate the corpus by showing how it can be used to adapt a deep learning neural network based ASR system, resulting in improved recognition accuracy on the task of spoken alphanumeric identifier recognition. Finally, we discuss further potential uses for the corpus and for the tools developed to construct it.

Anthology ID:: 2022.lrec-1.212
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1978–1984
Language:
URL:: https://aclanthology.org/2022.lrec-1.212/
DOI:
Bibkey:
Cite (ACL):: Emma Barker, Jon Barker, Robert Gaizauskas, Ning Ma, and Monica Lestari Paramita. 2022. SNuC: The Sheffield Numbers Spoken Language Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1978–1984, Marseille, France. European Language Resources Association.
Cite (Informal):: SNuC: The Sheffield Numbers Spoken Language Corpus (Barker et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.212.pdf

PDF Cite Search Fix data