2024
pdf
bib
abs
Cogs in a Machine, Doing What They’re Meant to Do – the AMI Submission to the WMT24 General Translation Task
Atli Jasonarson
|
Hinrik Hafsteinsson
|
Bjarki Ármannsson
|
Steinthór Steingrímsson
Proceedings of the Ninth Conference on Machine Translation
This paper presents the submission of the Arni Magnusson Institute’s team to the WMT24 General translation task. We work on the English→Icelandic translation direction. Our system comprises four translation models and a grammar correction model. For training our systems we carefully curate our datasets, aggressively filtering out sentence pairs that may detrimentally affect the quality of our systems output. Some of our data are collected from human translations and some are synthetically generated. A part of the synthetic data is generated using an LLM, and we find that it increases the translation capability of our system significantly.
pdf
bib
abs
Killing Two Flies with One Stone: An Attempt to Break LLMs Using English-Icelandic Idioms and Proper Names
Bjarki Ármannsson
|
Hinrik Hafsteinsson
|
Atli Jasonarson
|
Steinthor Steingrimsson
Proceedings of the Ninth Conference on Machine Translation
The submission of the Árni Magnússon Institute’s team to the WMT24 test suite subtask focuses on idiomatic expressions and proper names for the English→Icelandic translation direction. Intuitively and empirically, idioms and proper names are known to be a significant challenge for neural translation models. We create two different test suites. The first evaluates the competency of MT systems in translating common English idiomatic expressions, as well as testing whether systems can distinguish between those expressions and the same phrases when used in a literal context. The second test suite consists of place names that should be translated into their Icelandic exonyms (and correctly inflected) and pairs of Icelandic names that share a surface form between the male and female variants, so that incorrect translations impact meaning as well as readibility. The scores reported are relatively low, especially for idiomatic expressions and place names, and indicate considerable room for improvement.
2023
pdf
bib
abs
Evaluating a Universal Dependencies Conversion Pipeline for Icelandic
Þórunn Arnardóttir
|
Hinrik Hafsteinsson
|
Atli Jasonarson
|
Anton Ingason
|
Steinþór Steingrímsson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We describe the evaluation and development of a rule-based treebank conversion tool, UDConverter, which converts treebanks from the constituency-based PPCHE annotation scheme to the dependency-based Universal Dependencies (UD) scheme. The tool has already been used in the production of three UD treebanks, although no formal evaluation of the tool has been carried out as of yet. By manually correcting new output files from the converter and comparing them to the raw output, we measured the labeled attachment score (LAS) and unlabeled attachment score (UAS) of the converted texts. We obtain an LAS of 82.87 and a UAS of 87.91. In comparison to other tools, UDConverter currently provides the best results in automatic UD treebank creation for Icelandic.
2021
pdf
bib
abs
Towards cross-lingual application of language-specific PoS tagging schemes
Hinrik Hafsteinsson
|
Anton Karl Ingason
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
We describe the process of conversion between the PoS tagging schemes of two languages, the Icelandic MIM-GOLD tagging scheme and the Faroese Sosialurin tagging scheme. These tagging schemes are functionally similar but use separate ways to encode fine-grained morphological information on tokenised text. As Faroese and Icelandic are lexically and grammatically similar, having a systematic method to convert between these two tagging schemes would be beneficial in the field of language technology, specifically in research on transfer learning between the two languages. As a product of our work, we present a provisional version of Icelandic corpora, prepared in the Faroese PoS tagging scheme, ready for use in cross-lingual NLP applications.
pdf
bib
Shared Digital Resource Application within Insular Scandinavian
Hinrik Hafsteinsson
|
Anton Karl Ingason
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
2020
pdf
bib
abs
Developing a Faroese PoS-tagging solution using Icelandic methods
Hinrik Hafsteinsson
|
Anton Karl Ingason
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
We describe the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese, a North Germanic language with about 50,000 speakers. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on a 100,000 word PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which contains morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest reported accuracy for a dedicated Faroese PoS-tagger. The tagging model, morphological database, proposed revised PoS tagset for Faroese as well as a revised and standardised PoS tagged corpus are all presented as products of this project and are made available for use in further research in Faroese language technology
pdf
bib
abs
A Universal Dependencies Conversion Pipeline for a Penn-format Constituency Treebank
Þórunn Arnardóttir
|
Hinrik Hafsteinsson
|
Einar Freyr Sigurðsson
|
Kristín Bjarnadóttir
|
Anton Karl Ingason
|
Hildur Jónsdóttir
|
Steinþór Steingrímsson
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.