The Norwegian Dialect Corpus Treebank

Andre Kåsen, Kristin Hagen, Anders Nøklestad, Joel Priestly, Per Erik Solberg, Dag Trygve Truslew Haug


Abstract
This paper presents the NDC Treebank of spoken Norwegian dialects in the Bokmål variety of Norwegian. It consists of dialect recordings made between 2006 and 2012 which have been digitised, segmented, transcribed and subsequently annotated with morphological and syntactic analysis. The nature of the spoken data gives rise to various challenges both in segmentation and annotation. We follow earlier efforts for Norwegian, in particular the LIA Treebank of spoken dialects transcribed in the Nynorsk variety of Norwegian, in the annotation principles to ensure interusability of the resources. We have developed a spoken language parser on the basis of the annotated material and report on its accuracy both on a test set across the dialects and by holding out single dialects.
Anthology ID:
2022.lrec-1.516
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4827–4832
Language:
URL:
https://aclanthology.org/2022.lrec-1.516
DOI:
Bibkey:
Cite (ACL):
Andre Kåsen, Kristin Hagen, Anders Nøklestad, Joel Priestly, Per Erik Solberg, and Dag Trygve Truslew Haug. 2022. The Norwegian Dialect Corpus Treebank. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4827–4832, Marseille, France. European Language Resources Association.
Cite (Informal):
The Norwegian Dialect Corpus Treebank (Kåsen et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.516.pdf