Dag Trygve Truslew Haug
The Norwegian Dialect Corpus Treebank
Andre Kåsen | Kristin Hagen | Anders Nøklestad | Joel Priestly | Per Erik Solberg | Dag Trygve Truslew Haug
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents the NDC Treebank of spoken Norwegian dialects in the Bokmål variety of Norwegian. It consists of dialect recordings made between 2006 and 2012 which have been digitised, segmented, transcribed and subsequently annotated with morphological and syntactic analysis. The nature of the spoken data gives rise to various challenges both in segmentation and annotation. We follow earlier efforts for Norwegian, in particular the LIA Treebank of spoken dialects transcribed in the Nynorsk variety of Norwegian, in the annotation principles to ensure interusability of the resources. We have developed a spoken language parser on the basis of the annotated material and report on its accuracy both on a test set across the dialects and by holding out single dialects.