2023
pdf
bib
abs
The DA-ELEXIS Corpus - a Sense-Annotated Corpus for Danish with Parallel Annotations for Nine European Languages
Bolette Pedersen
|
Sanni Nimb
|
Sussi Olsen
|
Thomas Troelsgård
|
Ida Flörke
|
Jonas Jensen
|
Henrik Lorentzen
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
In this paper, we present the newly compiled DA-ELEXIS Corpus, which is one of the largest sense-annotated corpora available for Danish, and the first one to be annotated with the Danish wordnet, DanNet. The corpus is part of a European initiative, the ELEXIS project, and has corresponding parallel annotations in nine other European languages. As such it functions as a cross-lingual evaluative benchmark for a series of low and medium resourced European language. We focus here on the Danish annotation process, i.e. on the annotation scheme including annotation guidelines and a primary sense inventory constituted by DanNet as well as the fall-back sense inventory namely The Danish Dictionary (DDO). We analyse and discuss issues such as out of vocabulary (OOV) problems, problems with sense granularity and missing senses (in particular for verbs), and how to semantically tag multiword expressions (MWE), which prove to occur very frequently in the Danish corpus. Finally, we calculate the inter-annotator agreement (IAA) and show how IAA has improved during the annotation process. The openly available corpus contains 32,524 tokens of which sense annotations are given for all content words, amounting to 7,322 nouns, 3,099 verbs, 2,626 adjectives, and 1,677 adverbs.
pdf
bib
abs
Reusing the Danish WordNet for a New Central Word Register for Danish - a Project Report
Bolette Pedersen
|
Sanni Nimb
|
Nathalie Sørensen
|
Sussi Olsen
|
Ida Flörke
|
Thomas Troelsgård
Proceedings of the 12th Global Wordnet Conference
In this paper we report on a new Danish lexical initiative, the Central Word Register for Danish, (COR), which aims at providing an open-source, well curated and large-coverage lexicon for AI purposes. The semantic part of the lexicon (COR-S) relies to a large extent on the lexical-semantic information provided in the Danish wordnet, DanNet. However, we have taken the opportunity to evaluate and curate the wordnet information while compiling the new resource. Some information types have been simplified and more systematically curated. This is the case for the hyponymy relations, the ontological typing, and the sense inventory, i.e. the treatment of polysemy, including systematic polysemy.
2022
pdf
bib
abs
Compiling a Suitable Level of Sense Granularity in a Lexicon for AI Purposes: The Open Source COR Lexicon
Bolette Pedersen
|
Nathalie Carmen Hau Sørensen
|
Sanni Nimb
|
Ida Flørke
|
Sussi Olsen
|
Thomas Troelsgård
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present The Central Word Register for Danish (COR), which is an open source lexicon project for general AI purposes funded and initiated by the Danish Agency for Digitisation as part of an AI initiative embarked by the Danish Government in 2020. We focus here on the lexical semantic part of the project (COR-S) and describe how we – based on the existing fine-grained sense inventory from Den Danske Ordbog (DDO) – compile a more AI suitable sense granularity level of the vocabulary. A three-step methodology is applied: We establish a set of linguistic principles for defining core senses in COR-S and from there, we generate a hand-crafted gold standard of 6,000 lemmas depicting how to come from the fine-grained DDO sense to the COR inventory. Finally, we experiment with a number of language models in order to automatize the sense reduction of the rest of the lexicon. The models comprise a ruled-based model that applies our linguistic principles in terms of features, a word2vec model using cosine similarity to measure the sense proximity, and finally a deep neural BERT model fine-tuned on our annotations. The rule-based approach shows best results, in particular on adjectives, however, when focusing on the average polysemous vocabulary, the BERT model shows promising results too.