CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus

Li Nguyen, Christopher Bryant


Abstract
This paper introduces the Canberra Vietnamese-English Code-switching corpus (CanVEC), an original corpus of natural mixed speech that we semi-automatically annotated with language information, part of speech (POS) tags and Vietnamese translations. The corpus, which was built to inform a sociolinguistic study on language variation and code-switching, consists of 10 hours of recorded speech (87k tokens) between 45 Vietnamese-English bilinguals living in Canberra, Australia. We describe how we collected and annotated the corpus by pipelining several monolingual toolkits to considerably speed up the annotation process. We also describe how we evaluated the automatic annotations to ensure corpus reliability. We make the corpus available for research purposes.
Anthology ID:
2020.lrec-1.507
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4121–4129
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.507
DOI:
Bibkey:
Cite (ACL):
Li Nguyen and Christopher Bryant. 2020. CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4121–4129, Marseille, France. European Language Resources Association.
Cite (Informal):
CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus (Nguyen & Bryant, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.507.pdf