Peter Juel Henrichsen

Also published as: Peter Juel Henrichsen


2021

pdf bib
The Danish Gigaword Corpus
Leon Strømberg-Derczynski | Manuel Ciosici | Rebekah Baglini | Morten H. Christiansen | Jacob Aarup Dalsgaard | Riccardo Fusaroli | Peter Juel Henrichsen | Rasmus Hvingelby | Andreas Kirkedal | Alex Speed Kjeldsen | Claus Ladefoged | Finn Årup Nielsen | Jens Madsen | Malte Lau Petersen | Jonathan Hvithamar Rystrøm | Daniel Varab
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.

2020

pdf bib
World Class Language Technology - Developing a Language Technology Strategy for Danish
Sabine Kirchmeier | Bolette Pedersen | Sanni Nimb | Philip Diderichsen | Peter Juel Henrichsen
Proceedings of the 12th Language Resources and Evaluation Conference

Although Denmark is one of the most digitized countries in Europe, no coordinated efforts have been made in recent years to support the Danish language with regard to language technology and artificial intelligence. In March 2019, however, the Danish government adopted a new, ambitious strategy for LT and artificial intelligence. In this paper, we describe the process behind the development of the language-related parts of the strategy: A Danish Language Technology Committee was constituted and a comprehensive series of workshops were organized in which users, suppliers, developers, and researchers gave their valuable input based on their experiences. We describe how, based on this experience, the focus areas and recommendations for the LT strategy were established, and which steps are currently taken in order to put the strategy into practice.

pdf bib
Smatgrisene at SemEval-2020 Task 12: Offense Detection by AI - with a Pinch of Real I
Peter Juel Henrichsen | Marianne Rathje
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper discusses how ML based classifiers can be enhanced disproportionately by adding small amounts of qualitative linguistic knowledge. As an example we present the Danish classifier Smatgrisene, our contribution to the recent OffensEval Challenge 2020. The classifier was trained on 3000 social media posts annotated for offensiveness, supplemented by rules extracted from the reference work on Danish offensive language (Rathje 2014b). Smatgrisene did surprisingly well in the competition in spite of its extremely simple design, showing an interesting trade-off between technological muscle and linguistic intelligence. Finally, we comment on the perspectives in combining qualitative and quantitative methods for NLP.

2019

pdf bib
Garnishing a phonetic dictionary for ASR intake
Iben Nyholm Debess | Sandra Saxov Lamhauge | Peter Juel Henrichsen
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We present a new method for preparing a lexical-phonetic database as a resource for acoustic model training. The research is an offshoot of the ongoing Project Ravnur (Speech Recognition for Faroese), but the method is language-independent. At NODALIDA 2019 we demonstrate the method (called SHARP) online, showing how a traditional lexical-phonetic dictionary (with a very rich phone inventory) is transformed into an ASR-friendly database (with reduced phonetics, preventing data sparseness). The mapping procedure is informed by a corpus of speech transcripts. We conclude with a discussion on the benefits of a well-thought-out BLARK design (Basic Language Resource Kit), making tools like SHARP possible.

2017

pdf bib
TALERUM - Learning Danish by Doing Danish
Peter Juel Henrichsen
Proceedings of the 21st Nordic Conference on Computational Linguistics

2015

pdf bib
Talebob - an Interactive Speech Trainer for Danish
Peter Juel Henrichsen
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Taking the Danish Speech Trainer from CALL to ICALL
Peter Juel Henrichsen
Proceedings of the fourth workshop on NLP for computer-assisted language learning

2012

pdf bib
Sense Meets Nonsense - Sense Meets Nonsense - a dual-layer Danish speech corpus for perception studies
Thomas Ulrich Christiansen | Peter Juel Henrichsen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present the newly established Danish speech corpus PiTu. The corpus consists of recordings of 28 native Danish talkers (14 female and 14 male) each reproducing (i) a series of nonsense syllables, and (ii) a set of authentic natural language sentences. The speech corpus is tailored for investigating the relationship between early stages of the speech perceptual process and later stages. We present our considerations involved in preparing the experimental set-up, producing the anechoic recordings, compiling the data, and exploring the materials in linguistic research. We report on a small pilot experiment demonstrating how PiTu and similar speech corpora can be used in studies of prosody as a function of semantic content. The experiment addresses the issue of whether the governing principles of Danish prosody assignment is mainly talker-specific or mainly content-typical (under the specific experimental conditions). The corpus is available in its entirety for download at http://amtoolbox.sourceforge.net/pitu/.

pdf bib
SMALLWorlds – Multilingual Content-Controlled Monologues
Peter Juel Henrichsen | Marcus Uneson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the speech corpus SMALLWorlds (Spoken Multi-lingual Accounts of Logically Limited Worlds), newly established and still growing. SMALLWorlds contains monologic descriptions of scenes or worlds which are simple enough to be formally describable. The descriptions are instances of content-controlled monologue: semantically """"pre-specified"""" but still bearing most hallmarks of spontaneous speech (hesitations and filled pauses, relaxed syntax, repetitions, self-corrections, incomplete constituents, irrelevant or redundant information, etc.) as well as idiosyncratic speaker traits. In the paper, we discuss the pros and cons of data so elicited. Following that, we present a typical SMALLWorlds task: the description of a simple drawing with differently coloured circles, squares, and triangles, with no hints given as to which description strategy or language style to use. We conclude with an example on how SMALLWorlds may be used: unsupervised lexical learning from phonetic transcription. At the time of writing, SMALLWorlds consists of more than 250 recordings in a wide range of typologically diverse languages from many parts of the world, some unwritten and endangered.

2011

pdf bib
Fishing in a Speech Stream – Angling for a Lexicon
Peter Juel Henrichsen
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2007

pdf bib
A Norwegian Letter-to-Sound Engine with Danish as a Catalyst
Peter Juel Henrichsen
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib
Synthetic regional Danish
Bodil Kyst | Peter Juel Henrichsen
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

pdf bib
DanPO–a transcription-based dictionary for Danish speech technology
Peter Rossen Skadhauge | Peter Juel Henrichsen
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

2002

pdf bib
GraSp: Grammar Learning from Unlabelled Speech Corpora
Peter Juel Henrichsen
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

1998

pdf bib
Peeking Into the Danish Living Room. Internet access to a large speech corpus
Peter Juel Henrichsen
Proceedings of the 11th Nordic Conference of Computational Linguistics (NODALIDA 1998)