Krister Lindén

Also published as: Krister Linden


2024

pdf bib
Investigating Multilinguality in the Plenary Sessions of the Parliament of Finland with Automatic Language Identification
Tommi Jauhiainen | Jussi Piitulainen | Erik Axelson | Ute Dieckmann | Mietta Lennes | Jyrki Niemi | Jack Rueter | Krister Lindén
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

In this paper, we use automatic language identification to investigate the usage of different languages in the plenary sessions of the Parliament of Finland. Finland has two national languages, Finnish and Swedish. The plenary sessions are published as transcriptions of speeches in Parliament, reflecting the language the speaker used. In addition to charting out language use, we demonstrate how language identification can be used to audit the quality of the dataset. On the one hand, we made slight improvements to our language identifier; on the other hand, we made a list of improvement suggestions for the next version of the dataset.

pdf bib
Improving Language Coverage on HeLI-OTS
Tommi Jauhiainen | Krister Lindén
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

In this paper, we add under-resourced languages into the language repertoire of an existing off-the-shelf language identifier, HeLI-OTS. Adding more languages to a language identifier often comes with the drawback of lessened accuracy for the languages already part of the repertoire. We aim to minimize this effect. As sources for training and development data in the new languages, we use the OpenLID and FLORES-200 datasets. They are openly available high-quality datasets that are especially well-suited for language identifier development. By carefully inspecting the effect of each added language and the quality of their training and development data, we managed to add support for 20 new under-resourced languages to HeLI-OTS without affecting the performance of any existing languages to a noticeable extent.

2023

pdf bib
A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages
Aleksi Sahala | Krister Lindén
Proceedings of the Ancient Language Processing Workshop

We presented a pipeline for POS-tagging and lemmatizing cuneiform languages and evaluated its performance on Sumerian, first millennium Babylonian, Neo-Assyrian and Urartian texts extracted from Oracc. The system achieves a POS-tagging accuracy between 95-98% and a lemmatization accuracy of 94-96% depending on the language or dialect. For OOV words only, the current version can predict correct POS-tags for 83-91%, and lemmata for 68-84% of the input words. Compared with the earlier version, the current one has about 10% higher accuracy in OOV lemmatization and POS-tagging due to better neural network performance. We also tested the system for lemmatizing and POS-tagging the PROIEL Ancient Greek and Latin treebanks, achieving results similar to those with the cuneiform languages.

2022

pdf bib
Italian Language and Dialect Identification and Regional French Variety Detection using Adaptive Naive Bayes
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

This article describes the language identification approach used by the SUKI team in the Identification of Languages and Dialects of Italy and the French Cross-Domain Dialect Identification shared tasks organized as part of the VarDial workshop 2022. We describe some experiments and the preprocessing techniques we used for the training data in preparation for the shared task submissions, which are also discussed. Our Naive Bayes-based adaptive system reached the first position in Italian language identification and came second in the French variety identification task.

pdf bib
Optimizing Naive Bayes for Arabic Dialect Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

This article describes the language identification system used by the SUKI team in the 2022 Nuanced Arabic Dialect Identification (NADI) shared task. In addition to the system description, we give some details of the dialect identification experiments we conducted while preparing our submissions. In the end, we submitted only one official run. We used a Naive Bayes-based language identifier with character n-grams from one to four, of which we implemented a new version, which automatically optimizes its parameters. We also experimented with clustering the training data according to different topics. With the macro F1 score of 0.1963 on test set A and 0.1058 on test set B, we achieved the 18th position out of the 19 competing teams.

pdf bib
HeLI-OTS, Off-the-shelf Language Identifier for Text
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method. The HeLI-OTS language identifier is equipped with language models for 200 languages and licensed for academic as well as commercial use. We present the HeLI method and its use in our previous research. Then we compare the performance of the HeLI-OTS language identifier with that of fastText on two different data sets, showing that fastText favors the recall of common languages, whereas HeLI-OTS reaches both high recall and high precision for all languages. While introducing existing off-the-shelf language identification tools, we also give a picture of digital humanities-related research that uses such tools. The validity of the results of such research depends on the results given by the language identifier used, and especially for research focusing on the less common languages, the tendency to favor widely used languages might be very detrimental, which Heli-OTS is now able to remedy.

2021

pdf bib
Findings of the VarDial Evaluation Campaign 2021
Bharathi Raja Chakravarthi | Gaman Mihaela | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Ruba Priyadharshini | Christoph Purschke | Eswari Rajagopal | Yves Scherrer | Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.

pdf bib
Naive Bayes-based Experiments in Romanian Dialect Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This article describes the experiments and systems developed by the SUKI team for the second edition of the Romanian Dialect Identification (RDI) shared task which was organized as part of the 2021 VarDial Evaluation Campaign. We submitted two runs to the shared task and our second submission was the overall best submission by a noticeable margin. Our best submission used a character n-gram based naive Bayes classifier with adaptive language models. We describe our experiments on the development set leading to both submissions.

2020

pdf bib
A Report on the VarDial Evaluation Campaign 2020
Mihaela Gaman | Dirk Hovy | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Christoph Purschke | Yves Scherrer | Marcos Zampieri
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.

pdf bib
Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora
Tommi Jauhiainen | Heidi Jauhiainen | Niko Partanen | Krister Lindén
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.

pdf bib
Experiments in Language Variety Geolocation and Dialect Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

In this paper we describe the systems we used when participating in the VarDial Evaluation Campaign organized as part of the 7th workshop on NLP for similar languages, varieties and dialects. The shared tasks we participated in were the second edition of the Romanian Dialect Identification (RDI) and the first edition of the Social Media Variety Geolocation (SMG). The submissions of our SUKI team used generative language models based on Naive Bayes and character n-grams.

pdf bib
Building Web Corpora for Minority Languages
Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén
Proceedings of the 12th Web as Corpus Workshop

Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the “Finno-Ugric Languages and the Internet” (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.

pdf bib
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib
Automated Phonological Transcription of Akkadian Cuneiform Text
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian was an East-Semitic language spoken in ancient Mesopotamia. The language is attested on hundreds of thousands of cuneiform clay tablets. Several Akkadian text corpora contain only the transliterated text. In this paper, we investigate automated phonological transcription of the transliterated corpora. The phonological transcription provides a linguistically appealing form to represent Akkadian, because the transcription is normalized according to the grammatical description of a given dialect and explicitly shows the Akkadian renderings for Sumerian logograms. Because cuneiform text does not mark the inflection for logograms, the inflected form needs to be inferred from the sentence context. To the best of our knowledge, this is the first documented attempt to automatically transcribe Akkadian. Using a context-aware neural network model, we are able to automatically transcribe syllabic tokens at near human performance with 96% recall @ 3, while the logogram transcription remains more challenging at 82% recall @ 3.

pdf bib
BabyFST - Towards a Finite-State Based Computational Model of Ancient Babylonian
Aleksi Sahala | Miikka Silfverberg | Antti Arppe | Krister Lindén
Proceedings of the Twelfth Language Resources and Evaluation Conference

Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1% of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4% of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.

pdf bib
Akkadian Treebank for early Neo-Assyrian Royal Inscriptions
Mikko Luukko | Aleksi Sahala | Sam Hardwick | Krister Lindén
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf bib
Language and Dialect Identification of Cuneiform Texts
Tommi Jauhiainen | Heidi Jauhiainen | Tero Alstola | Krister Lindén
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus. We also describe the CLI dataset and how it was derived from the corpus. In addition, we provide some baseline language identification results using the CLI dataset. To the best of our knowledge, the experiments detailed here represent the first time that automatic language identification methods have been used on cuneiform data.

pdf bib
Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models
Tommi Jauhiainen | Krister Lindén | Heidi Jauhiainen
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the language identification systems used by the SUKI team in the Discriminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the German Dialect Identification (GDI) shared tasks which were held as part of the third VarDial Evaluation Campaign. The DMT shared task included two separate tracks, one for the simplified Chinese script and one for the traditional Chinese script. We submitted three runs on both tracks of the DMT task as well as on the GDI task. We won the traditional Chinese track using Naive Bayes with language model adaptation, came second on GDI with an adaptive version of the HeLI 2.0 method, and third on the simplified Chinese track using again the adaptive Naive Bayes.

2018

pdf bib
Iterative Language Model Adaptation for Indo-Aryan Language Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

This paper presents the experiments and results obtained by the SUKI team in the Indo-Aryan Language Identification shared task of the VarDial 2018 Evaluation Campaign. The shared task was an open one, but we did not use any corpora other than what was distributed by the organizers. A total of eight teams provided results for this shared task. Our submission using a HeLI-method based language identifier with iterative language model adaptation obtained the best results in the shared task with a macro F1-score of 0.958.

pdf bib
HeLI-based Experiments in Discriminating Between Dutch and Flemish Subtitles
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

This paper presents the experiments and results obtained by the SUKI team in the Discriminating between Dutch and Flemish in Subtitles shared task of the VarDial 2018 Evaluation Campaign. Our best submission was ranked 8th, obtaining macro F1-score of 0.61. Our best results were produced by a language identifier implementing the HeLI method without any modifications. We describe, in addition to the best method we used, some of the experiments we did with unsupervised clustering.

pdf bib
HeLI-based Experiments in Swiss German Dialect Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the experiments and results by the SUKI team in the German Dialect Identification shared task of the VarDial 2018 Evaluation Campaign. Our submission using HeLI with adaptive language models obtained the best results in the shared task with a macro F1-score of 0.686, which is clearly higher than the other submitted results. Without some form of unsupervised adaptation on the test set, it might not be possible to reach as high an F1-score with the level of domain difference between the datasets of the shared task. We describe the methods used in detail, as well as some additional experiments carried out during the shared task.

2017

pdf bib
OCR and post-correction of historical Finnish texts
Senka Drobac | Pekka Kauppinen | Krister Lindén
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Evaluation of language identification methods using 285 languages
Tommi Jauhiainen | Krister Lindén | Heidi Jauhiainen
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Evaluating HeLI with Non-Linear Mappings
Tommi Jauhiainen | Krister Lindén | Heidi Jauhiainen
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop. Our SUKI team participated on the closed track together with 10 other teams. Our system reached the 7th position in the track. We describe the HeLI method and the non-linear mappings in mathematical notation. The HeLI method uses a probabilistic model with character n-grams and word-based backoff. We also describe our trials using the non-linear mappings instead of relative frequencies and we present statistics about the back-off function of the HeLI method.

2016

pdf bib
Data-Driven Spelling Correction using Weighted Finite-State Methods
Miikka Silfverberg | Pekka Kauppinen | Krister Lindén
Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata

pdf bib
HeLI, a Word-Based Backoff Method for Language Identification
Tommi Jauhiainen | Krister Lindén | Heidi Jauhiainen
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we describe the Helsinki language identification method, HeLI, and the resources we created for and used in the 3rd edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2016 workshop. The shared task comprised of a total of 8 tracks, of which we participated in 7. The shared task had a record number of participants, with 17 teams providing results for the closed track of the test set A. Our system reached the 2nd position in 4 tracks (A closed and open, B1 open and B2 open) and in this paper we are focusing on the methods and data used for those tracks. We describe our word-based backoff method in mathematical notation. We also describe how we selected the corpus we used in the open tracks.

2015

pdf bib
Extracting Semantic Frames using hfst-pmatch
Sam Hardwick | Miikka Silfverberg | Krister Lindén
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Automated Lossless Hyper-Minimization for Morphological Analyzers
Senka Drobac | Miikka Silfverberg | Krister Lindén
Proceedings of the 12th International Conference on Finite-State Methods and Natural Language Processing 2015 (FSMNLP 2015 Düsseldorf)

pdf bib
Discriminating Similar Languages with Token-Based Backoff
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

2014

pdf bib
HFST-SweNER — A New NER Resource for Swedish
Dimitrios Kokkinakis | Jyrki Niemi | Sam Hardwick | Krister Lindén | Lars Borin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Named entity recognition (NER) is a knowledge-intensive information extraction task that is used for recognizing textual mentions of entities that belong to a predefined set of categories, such as locations, organizations and time expressions. NER is a challenging, difficult, yet essential preprocessing technology for many natural language processing applications, and particularly crucial for language understanding. NER has been actively explored in academia and in industry especially during the last years due to the advent of social media data. This paper describes the conversion, modeling and adaptation of a Swedish NER system from a hybrid environment, with integrated functionality from various processing components, to the Helsinki Finite-State Transducer Technology (HFST) platform. This new HFST-based NER (HFST-SweNER) is a full-fledged open source implementation that supports a variety of generic named entity types and consists of multiple, reusable resource layers, e.g., various n-gram-based named entity lists (gazetteers).

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

pdf bib
CLARA: A New Generation of Researchers in Common Language Resources and Their Applications
Koenraad De Smedt | Erhard Hinrichs | Detmar Meurers | Inguna Skadiņa | Bolette Pedersen | Costanza Navarretta | Núria Bel | Krister Lindén | Markéta Lopatková | Jan Hajič | Gisle Andersen | Przemyslaw Lenkiewicz
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multimodal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.

pdf bib
Heuristic Hyper-minimization of Finite State Lexicons
Senka Drobac | Krister Lindén | Tommi Pirinen | Miikka Silfverberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Flag diacritics, which are special multi-character symbols executed at runtime, enable optimising finite-state networks by combining identical sub-graphs of its transition graph. Traditionally, the feature has required linguists to devise the optimisations to the graph by hand alongside the morphological description. In this paper, we present a novel method for discovering flag positions in morphological lexicons automatically, based on the morpheme structure implicit in the language description. With this approach, we have gained significant decrease in the size of finite-state networks while maintaining reasonable application speed. The algorithm can be applied to any language description, where the biggest achievements are expected in large and complex morphologies. The most noticeable reduction in size we got with a morphological transducer for Greenlandic, whose original size is on average about 15 times larger than other morphologies. With the presented hyper-minimization method, the transducer is reduced to 10,1% of the original size, with lookup speed decreased only by 9,5%.

pdf bib
Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy
Miikka Silfverberg | Teemu Ruokolainen | Krister Lindén | Mikko Kurimo
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Accelerated Estimation of Conditional Random Fields using a Pseudo-Likelihood-inspired Perceptron Variant
Teemu Ruokolainen | Miikka Silfverberg | Mikko Kurimo | Krister Linden
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

pdf bib
Nordic and Baltic Wordnets Aligned and Compared through “WordTies”
Bolette Sandford Pedersen | Lars Borin | Markus Forsberg | Neeme Kahusk | Krister Lindén | Jyrki Niemi | Niklas Nisbeth | Lars Nygaard | Heili Orav | Eirikur Rögnvaldsson | Mitchell Seaton | Kadri Vider | Kaarlo Voionmaa
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Baltic and Nordic Parts of the European Linguistic Infrastructure
Inguna Skadiņa | Andrejs Vasiļjevs | Lars Borin | Krister Lindén | Gyri Losnegaard | Sussi Olsen | Bolette Sandford Pedersen | Roberts Rozis | Koenraad De Smedt
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib
Representing the Translation Relation in a Bilingual Wordnet
Jyrki Niemi | Krister Lindén
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes representing translations in the Finnish wordnet, FinnWordNet (FiWN), and constructing the FiWN database. FiWN was created by translating all the word senses of the Princeton WordNet (PWN) into Finnish and by joining the translations with the semantic and lexical relations of PWN extracted into a relational (database) format. The approach naturally resulted in a translation relation between PWN and FiWN. Unlike many other multilingual wordnets, the translation relation in FiWN is not primarily on the synset level, but on the level of an individual word sense, which allows more precise translation correspondences. This can easily be projected into a synset-level translation relation, used for linking with other wordnets, for example, via Core WordNet. Synset-level translations are also used as a default in the absence of word-sense translations. The FiWN data in the relational database can be converted to other formats. In the PWN database format, translations are attached to source-language words, allowing the implementation of a Web search interface also working as a bilingual dictionary. Another representation encodes the translation relation as a finite-state transducer.

pdf bib
Creation of an Open Shared Language Resource Repository in the Nordic and Baltic Countries
Andrejs Vasiļjevs | Markus Forsberg | Tatiana Gornostay | Dorte Haltrup Hansen | Kristín Jóhannsdóttir | Gunn Lyse | Krister Lindén | Lene Offersgaard | Sussi Olsen | Bolette Pedersen | Eiríkur Rögnvaldsson | Inguna Skadiņa | Koenraad De Smedt | Ville Oksanen | Roberts Rozis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

pdf bib
Specifying Treebanks, Outsourcing Parsebanks: FinnTreeBank 3
Atro Voutilainen | Kristiina Muhonen | Tanja Purtonen | Krister Lindén
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Corpus-based treebank annotation is known to result in incomplete coverage of mid- and low-frequency linguistic constructions: the linguistic representation and corpus annotation quality are sometimes suboptimal. Large descriptive grammars cover also many mid- and low-frequency constructions. We argue for use of large descriptive grammars and their sample sentences as a basis for specifying higher-coverage grammatical representations. We present an sample case from an ongoing project (FIN-CLARIN FinnTreeBank) where an grammatical representation is documented as an annotator's manual alongside manual annotation of sample sentences extracted from a large descriptive grammar of Finnish. We outline the linguistic representation (morphology and dependency syntax) for Finnish, and show how the resulting `Grammar Definition Corpus' and the documentation is used as a task specification for an external subcontractor for building a parser engine for use in morphological and dependency syntactic analysis of large volumes of Finnish for parsebanking purposes. The resulting corpus, FinnTreeBank 3, is due for release in June 2012, and will contain tens of millions of words from publicly available corpora of Finnish with automatic morphological and dependency syntactic analysis, for use in research on the corpus linguistics and language engineering.

2011

pdf bib
META-NORD: Towards Sharing of Language Resources in Nordic and Baltic Countries
Inguna Skadiņa | Andrejs Vasiļjevs | Lars Borin | Koenraad De Smedt | Krister Lindén | Eiríkur Rögnvaldsson
Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm

pdf bib
Do wordnets also improve human performance on NLP tasks?
Kristiina Muhonen | Krister Lindén
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

pdf bib
Combining Statistical Models for POS Tagging using Finite-State Calculus
Miikka Silfverberg | Krister Lindén
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2009

pdf bib
Weighted Finite-State Morphological Analysis of Finnish Compounding with HFST-LEXC
Krister Lindén | Tommi Pirinen
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
Corpus-based Paradigm Selection for Morphological Entries
Krister Lindén | Jussi Tuovila
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
Conflict Resolution Using Weighted Rules in HFST-TWOLC
Miikka Silfverberg | Krister Lindén
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2004

pdf bib
Discovering Synonyms and Other Related Words
Krister Lindén | Jussi Piitulainen
Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology

Search
Co-authors