Dorte Haltrup Hansen

Also published as: Dorte H. Hansen, Dorte Haltrup Hansen


2022

pdf bib
The Subject Annotations of the Danish Parliament Corpus (2009-2017) - Evaluated with Automatic Multi-label Classification
Costanza Navarretta | Dorte Haltrup Hansen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper addresses the semi-automatic annotation of subjects, also called policy areas, in the Danish Parliament Corpus (2009-2017) v.2. Recently, the corpus has been made available through the CLARIN-DK repository, the Danish node of the European CLARIN infrastructure. The paper also contains an analysis of the subjects in the corpus, and a description of multi-label classification experiments act to verify the consistency of the subject annotation and the utility of the corpus for training classifiers on this type of data. The analysis of the corpus comprises an investigation of how often the parliament members addressed each subject and the relation between subjects and gender of the speaker. The classification experiments show that classifiers can determine the two co-occurring subjects of the speeches from the agenda titles with a performance similar to that of human annotators. Moreover, a multilayer perceptron achieved an F1-score of 0.68 on the same task when trained on bag of words vectors obtained from the speeches’ lemmas. This is an improvement of more than 0.6 with respect to the baseline, a majority classifier that accounts for the frequency of the classes. The result is promising given the high number of subject combinations (186) and the skewness of the data.

pdf bib
Immigration in the Manifestos and Parliament Speeches of Danish Left and Right Wing Parties between 2009 and 2020
Costanza Navarretta | Dorte Haltrup Hansen | Bart Jongejan
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

The paper presents a study of how seven Danish left and right wing parties addressed immigration in their 2011, 2015 and 2019 manifestos and in their speeches in the Danish Parliament from 2009 to 2020. The annotated manifestos are produced by the Comparative Manifesto Project, while the parliamentary speeches annotated with policy areas (subjects) have been recently released under CLARIN-DK. In the paper, we investigate how often the seven parties addressed immigration in the manifestos and parliamentary debates, and we analyse both datasets after having applied NLP tools to them. A sentiment analysis tool was run on the manifestos and its results were compared with the manifestos’ annotations, while topic modeling was applied to the parliamentary speeches in order to outline central themes in the immigration debates. Many of the resulting topic groups are related to cultural, religious and integration aspects which were heavily debated by politicians and media when discussing immigration policy during the past decade. Our analyses also show differences and similarities between parties and indicate how the 2015 immigrant crisis is reflected in the two types of data. Finally, we discuss advantages and limitations of our quantitative and tool-based analyses.

2020

pdf bib
Identifying Parties in Manifestos and Parliament Speeches
Costanza Navarretta | Dorte Haltrup Hansen
Proceedings of the Second ParlaCLARIN Workshop

This paper addresses differences in the word use of two left-winged and two right-winged Danish parties, and how these differences reflecting some of the basic stances of the parties can be used to automatically identify the party of politicians from their speeches. In the first study, the most frequent and characteristic lemmas in the manifestos of the political parties are analysed. The analysis shows that the most frequently occurring lemmas in the manifestos reflect either the ideology or the position of the parties towards specific subjects, confirming for Danish preceding studies of English and German manifestos. Successively, we scaled our analysis applying machine learning on different language models built on the transcribed speeches by members of the same parties in the Parliament (Hansards) in order to determine to what extent it is possible to predict the party of the politicians from the speeches. The speeches used are a subset of the Danish Parliament corpus 2009–2017. The best models resulted in a weighted F1-score of 0.57. These results are significantly better than the results obtained by the majority classifier (F1-score = 0.11) and by chance results (0.25) and show that building language models over the speeches used by politicians can be used to identify the politicians’ party even if they debate about the same subjects and thus often use the same terminology in many cases. In the future, we will include the subject of the speeches in the prediction experiments

2016

pdf bib
Facilitating Metadata Interoperability in CLARIN-DK
Lene Offersgaard | Dorte Haltrup Hansen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The issue for CLARIN archives at the metadata level is to facilitate the user’s possibility to describe their data, even with their own standard, and at the same time make these metadata meaningful for a variety of users with a variety of resource types, and ensure that the metadata are useful for search across all resources both at the national and at the European level. We see that different people from different research communities fill in the metadata in different ways even though the metadata was defined and documented. This has impacted when the metadata are harvested and displayed in different environments. A loss of information is at stake. In this paper we view the challenges of ensuring metadata interoperability through examples of propagation of metadata values from the CLARIN-DK archive to the VLO. We see that the CLARIN Community in many ways support interoperability, but argue that agreeing upon standards, making clear definitions of the semantics of the metadata and their content is inevitable for the interoperability to work successfully. The key points are clear and freely available definitions, accessible documentation and easily usable facilities and guidelines for the metadata creators.

2014

pdf bib
Using TEI, CMDI and ISOcat in CLARIN-DK
Dorte Haltrup Hansen | Lene Offersgaard | Sussi Olsen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the challenges and issues encountered in the conversion of TEI header metadata into the CMDI format. The work is carried out in the Danish research infrastructure, CLARIN-DK, in order to enable the exchange of language resources nationally as well as internationally, in particular with other partners of CLARIN ERIC. The paper describes the task of converting an existing TEI specification applied to all the text resources deposited in DK-CLARIN. During the task we have tried to reuse and share CMDI profiles and components in the CLARIN Component Registry, as well as linking the CMDI components and elements to the relevant data categories in the ISOcat Data Category Registry. The conversion of the existing metadata into the CMDI format turned out not to be a trivial task and the experience and insights gained from this work have resulted in a proposal for a work flow for future use. We also present a core TEI header metadata set.

pdf bib
Encompassing a spectrum of LT users in the CLARIN-DK Infrastructure
Lina Henriksen | Dorte Haltrup Hansen | Bente Maegaard | Bolette Sandford Pedersen | Claus Povlsen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

CLARIN-DK is a platform with language resources constituting the Danish part of the European infrastructure CLARIN ERIC. Unlike some other language based infrastructures CLARIN-DK is not solely a repository for upload and storage of data, but also a platform of web services permitting the user to process data in various ways. This involves considerable complications in relation to workflow requirements. The CLARIN-DK interface must guide the user to perform the necessary steps of a workflow; even when the user is inexperienced and perhaps has an unclear conception of the requested results. This paper describes a user driven approach to creating a user interface specification for CLARIN-DK. We indicate how different user profiles determined different crucial interface design options. We also describe some use cases established in order to give illustrative examples of how the platform may facilitate research.

2013

pdf bib
Let’sMT! as a Learning Platform for SMT
Hanne Fersøe | Dorte Haltrup Hansen | Lene Offersgaard | Susi Olsen | Claus Povlsen
Proceedings of Machine Translation Summit XIV: User track

2012

pdf bib
A Distributed Resource Repository for Cloud-Based Machine Translation
Jörg Tiedemann | Dorte Haltrup Hansen | Lene Offersgaard | Sussi Olsen | Matthias Zumpe
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present the architecture of a distributed resource repository developed for collecting training data for building customized statistical machine translation systems. The repository is designed for the cloud-based translation service integrated in the Let'sMT! platform which is about to be launched to the public. The system includes important features such as automatic import and alignment of textual documents in a variety of formats, a flexible database for meta-information using modern key-value stores and a grid-based backend for running off-line processes. The entire system is very modular and supports highly distributed setups to enable a maximum of flexibility and scalability. The system uses secure connections and includes an effective permission management to ensure data integrity. In this paper, we also take a closer look at the task of sentence alignment. The process of alignment is extremely important for the success of translation models trained on the platform. Alignment decisions significantly influence the quality of SMT engines.

pdf bib
Creation of an Open Shared Language Resource Repository in the Nordic and Baltic Countries
Andrejs Vasiļjevs | Markus Forsberg | Tatiana Gornostay | Dorte Haltrup Hansen | Kristín Jóhannsdóttir | Gunn Lyse | Krister Lindén | Lene Offersgaard | Sussi Olsen | Bolette Pedersen | Eiríkur Rögnvaldsson | Inguna Skadiņa | Koenraad De Smedt | Ville Oksanen | Roberts Rozis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

2010

pdf bib
Quality Indicators of LSP Texts — Selection and Measurements Measuring the Terminological Usefulness of Documents for an LSP Corpus
Jakob Halskov | Dorte Haltrup Hansen | Anna Braasch | Sussi Olsen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes and evaluates a prototype quality assurance system for LSP corpora. The system will be employed in compiling a corpus of 11 M tokens for various linguistic and terminological purposes. The system utilizes a number of linguistic features as quality indicators. These represent two dimensions of quality, namely readability/formality (e.g. word length and passive constructions) and density of specialized knowledge (e.g. out-of-vocabulary items). Threshold values for each indicator are induced from a reference corpus of general (fiction, magazines and newspapers) and specialized language (the domains of Health/Medicine and Environment/Climate). In order to test the efficiency of the indicators, a number of terminologically relevant, irrelevant and possibly relevant texts are manually selected from target web sites as candidate texts. By applying the indicators to these candidate texts, the system is able to filter out non-LSP and “poor” LSP texts with a precision of 100% and a recall of 55%. Thus, the experiment described in this paper constitutes fundamental work towards a formulation of ‘best practice’ for implementing quality assurance when selecting appropriate texts for an LSP corpus. The domain independence of the quality indicators still remains to be thoroughly tested on more than just two domains.

2004

pdf bib
Ontological resources and question answering
Roberto Basili | Dorte H. Hansen | Patrizia Paggio | Maria Teresa Pazienza | Fabio Massimo Zanzotto
Proceedings of the Workshop on Pragmatics of Question Answering at HLT-NAACL 2004

pdf bib
“Human Language Technology Elements in a Knowledge Organisation System - The VID Project”
Costanza Navarretta | Bolette Sandford Pedersen | Dorte Haltrup Hansen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes how Human Language Technologies and linguistic resources are used to support the construction of components of a knowledge organisation system. In particular we focus on methodologies and resources for building a corpus-based domain ontology and extracting relevant metadata information for text chunks from domain-specific corpora.