Finn Årup Nielsen


2021

pdf bib
The Danish Gigaword Corpus
Leon Strømberg-Derczynski | Manuel Ciosici | Rebekah Baglini | Morten H. Christiansen | Jacob Aarup Dalsgaard | Riccardo Fusaroli | Peter Juel Henrichsen | Rasmus Hvingelby | Andreas Kirkedal | Alex Speed Kjeldsen | Claus Ladefoged | Finn Årup Nielsen | Jens Madsen | Malte Lau Petersen | Jonathan Hvithamar Rystrøm | Daniel Varab
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.

2019

pdf bib
Danish in Wikidata lexemes
Finn Årup Nielsen
Proceedings of the 10th Global Wordnet Conference

Wikidata introduced support for lexicographic data in 2018. Here we describe the lexicographic part of Wikidata as well as experiences with setting up lexemes for the Danish language. We note various possible annotations for lexemes as well as discuss various choices made.