2021
pdf
bib
abs
The Icelandic Word Web: A language technology-focused redesign of a lexicosemantic database
Hjalti Daníelsson
|
Jón Hilmar Jónsson
|
Þórður Arnar Árnason
|
Alec Shaw
|
Einar Freyr Sigurðsson
|
Steinþór Steingrímsson
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
The new Icelandic Word Web (IW) is a language technology focused redesign of a lexicosemantic database of semantically related entries. The IW’s entities, relations, metadata and categorization scheme have all been implemented from scratch in two systems, OntoLex and SKOS. After certain adjustments were made to OntoLex and SKOS interoperability, it was also possible to implement specific IW features that, while potentially nonstandard, form an integral part of the Word Web’s lexicosemantic functionality. Also new in this implementation are access to a larger amount of linguistic data, a greater variety of search options, the possibility of automated processing, and the ability to conduct research through SPARQL without possessing a mastery of Icelandic.
2020
pdf
bib
abs
Parsing Icelandic Alþingi Transcripts: Parliamentary Speeches as a Genre
Kristján Rúnarsson
|
Einar Freyr Sigurðsson
Proceedings of the Second ParlaCLARIN Workshop
We introduce a corpus of transcripts from Alþingi, the Icelandic parliament. The corpus is syntactically parsed for phrase structure according to the annotation scheme of the Icelandic Parsed Historical Corpus (IcePaHC). This addition to IcePaHC makes it more diverse with respect to text types and we argue that having a syntactically parsed corpus facilitates research on differt types of texts. We furthermore argue that the speech corpus can be treated somewhat like spoken language even though the transcripts differ in various ways from daily spoken language. We also compare this text type to other types and argue that this genre can shed light on their properties. Finally, we exhibit how this addition to IcePaHC has helped us in identifying and solving issues with our parsing scheme.
pdf
bib
abs
Language Technology Programme for Icelandic 2019-2023
Anna Nikulásdóttir
|
Jón Guðnason
|
Anton Karl Ingason
|
Hrafn Loftsson
|
Eiríkur Rögnvaldsson
|
Einar Freyr Sigurðsson
|
Steinþór Steingrímsson
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we describe a new national language technology programme for Icelandic. The programme, which spans a period of five years, aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, open-source language resources and software. The research and development work within the programme is carried out by a consortium of universities, institutions, and private companies, with a strong emphasis on cooperation between academia and industries. Five core projects will be the main content of the programme: language resources, speech recognition, speech synthesis, machine translation, and spell and grammar checking. We also describe other national language technology programmes and give an overview over the history of language technology in Iceland.
pdf
bib
abs
A Universal Dependencies Conversion Pipeline for a Penn-format Constituency Treebank
Þórunn Arnardóttir
|
Hinrik Hafsteinsson
|
Einar Freyr Sigurðsson
|
Kristín Bjarnadóttir
|
Anton Karl Ingason
|
Hildur Jónsdóttir
|
Steinþór Steingrímsson
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.
2014
pdf
bib
abs
Rapid Deployment of Phrase Structure Parsing for Related Languages: A Case Study of Insular Scandinavian
Anton Karl Ingason
|
Hrafn Loftsson
|
Eiríkur Rögnvaldsson
|
Einar Freyr Sigurðsson
|
Joel C. Wallenberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents ongoing work that aims to improve machine parsing of Faroese using a combination of Faroese and Icelandic training data. We show that even if we only have a relatively small parsed corpus of one language, namely 53,000 words of Faroese, we can obtain better results by adding information about phrase structure from a closely related language which has a similar syntax. Our experiment uses the Berkeley parser. We demonstrate that the addition of Icelandic data without any other modification to the experimental setup results in an f-measure improvement from 75.44% to 78.05% in Faroese and an improvement in part-of-speech tagging accuracy from 88.86% to 90.40%.
2012
pdf
bib
abs
The Icelandic Parsed Historical Corpus (IcePaHC)
Eiríkur Rögnvaldsson
|
Anton Karl Ingason
|
Einar Freyr Sigurðsson
|
Joel Wallenberg
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12th century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern Icelandic spelling. We explain why we choose to use a phrase structure Penn style annotation scheme and briefly describe the syntactic anno-tation process. We also describe a spin-off project which is only in its beginning stages: a parsed historical corpus of Faroese. Finally, we advocate the importance of an open source policy as regards language resources.