Jenna Kanerva

2026

Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs
Joonatan Laato | Veera Schroderus | Jenna Kanerva | Jenni Kauppi | Virpi Lummaa | Filip Ginter
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026

We study how to better use digitized historical archives to answer sociological and historical questions that require more context than raw text mentions provide. Using Finnish World War II Karelian evacuee family interviews, we build on prior extraction of 350K mentions of leisure activities and organizational memberships (71K unique names) that are too diverse and unstructured to analyze directly. We introduce a categorization framework capturing key dimensions of participation: type of activity/organization, typical sociality, regularity, and the level of physical demand. After creating a gold-standard annotated set, we evaluate whether large language models can apply the schema at scale and find that an open-weight LLM, combined with simple multi-run voting, closely matches expert judgments. We then label all 350K entities to produce a structured resource for downstream analyses of social integration and related outcomes.

2025

pdf bib abs

A Deep Dive into Multi-Head Attention and Multi-Aspect Embedding
Maryam Teimouri | Jenna Kanerva | Filip Ginter
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Multi-vector embedding models play an increasingly important role in retrieval-augmented generation, yet their internal behaviour lacks comprehensive analysis. We conduct a systematic, head-level study of the 32-head Semantic Feature Representation (SFR) encoder with the FineWeb corpus containing 10 billion tokens. For a set of 4,000 web documents, we pair head-specific embeddings with GPT-4o topic annotations and analyse the results using t-SNE visualisations, heat maps, and a 32-way logistic probe. The analysis shows that (i) clear semantic separation between heads emerges only at an intermediate layer, (ii) some heads align with specific topics while others capture broader corpus features, and (iii) naive pooling of head outputs can blur these distinctions, leading to frequent topic mismatches. The study offers practical guidance on where to extract embeddings, which heads may be pruned, and how to aggregate them to support more transparent and controllable retrieval pipelines.

pdf bib abs

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches
Jenna Kanerva | Cassandra Ledins | Siiri Käpyaho | Filip Ginter
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.

2024

pdf bib abs

Improving Latin Dependency Parsing by Combining Treebanks and Predictions
Hanna-Mari Kristiina Kupari | Erik Henriksson | Veronika Laippala | Jenna Kanerva
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This paper introduces new models designed to improve the morpho-syntactic parsing of the five largest Latin treebanks in the Universal Dependencies (UD) framework. First, using two state-of-the-art parsers, Trankit and Stanza, along with our custom UD tagger, we train new models on the five treebanks both individually and by combining them into novel merged datasets. We also test the models on the CIRCSE test set. In an additional experiment, we evaluate whether this set can be accurately tagged using the novel LASLA corpus (https://github.com/CIRCSE/LASLA). Second, we aim to improve the results by combining the predictions of different models through an atomic morphological feature voting system. The results of our two main experiments demonstrate significant improvements, particularly for the smaller treebanks, with LAS scores increasing by 16.10 and 11.85%-points for UDante and Perseus, respectively (Gamba and Zeman, 2023a). Additionally, the voting system for morphological features (FEATS) brings improvements, especially for the smaller Latin treebanks: Perseus 3.15% and CIRCSE 2.47%-points. Tagging the CIRCSE set with our custom model using the LASLA model improves POS 6.71 and FEATS 11.04%-points respectively, compared to our best-performing UD PROIEL model. Our results show that larger datasets and ensemble predictions can significantly improve performance.

2023

Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.

2022

pdf bib abs

Towards Automatic Short Answer Assessment for Finnish as a Paraphrase Retrieval Task
Li-Hsin Chang | Jenna Kanerva | Filip Ginter
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

Automatic grouping of textual answers has the potential of allowing batch grading, but is challenging because the answers, especially longer essays, have many claims. To explore the feasibility of grouping together answers based on their semantic meaning, this paper investigates the grouping of short textual answers, proxies of single claims. This is approached as a paraphrase identification task, where neural and non-neural sentence embeddings and a paraphrase identification model are tested. These methods are evaluated on a dataset consisting of over 4000 short textual answers from various disciplines. The results map out the suitable question types for the paraphrase identification model and those for the neural and non-neural methods.

pdf bib abs

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

pdf bib abs

Out-of-Domain Evaluation of Finnish Dependency Parsing
Jenna Kanerva | Filip Ginter
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The prevailing practice in the academia is to evaluate the model performance on in-domain evaluation data typically set aside from the training corpus. However, in many real world applications the data on which the model is applied may very substantially differ from the characteristics of the training data. In this paper, we focus on Finnish out-of-domain parsing by introducing a novel UD Finnish-OOD out-of-domain treebank including five very distinct data sources (web documents, clinical, online discussions, tweets, and poetry), and a total of 19,382 syntactic words in 2,122 sentences released under the Universal Dependencies framework. Together with the new treebank, we present extensive out-of-domain parsing evaluation utilizing the available section-level information from three different Finnish UD treebanks (TDT, PUD, OOD). Compared to the previously existing treebanks, the new Finnish-OOD is shown include sections more challenging for the general parser, creating an interesting evaluation setting and yielding valuable information for those applying the parser outside of its training domain.

2021

pdf bib

Quantitative Evaluation of Alternative Translations in a Corpus of Highly Dissimilar Finnish Paraphrases
Li-Hsin Chang | Sampo Pyysalo | Jenna Kanerva | Filip Ginter
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

pdf bib abs

WikiBERT Models: Deep Transfer Learning for Many Languages
Sampo Pyysalo | Jenna Kanerva | Antti Virtanen | Filip Ginter
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Deep neural language models such as BERT have enabled substantial recent advances in many natural language processing tasks. However, due to the effort and computational cost involved in their pre-training, such models are typically introduced only for a small number of high-resource languages such as English. While multilingual models covering large numbers of languages are available, recent work suggests monolingual training can produce better models, and our understanding of the tradeoffs between mono- and multilingual training is incomplete. In this paper, we introduce a simple, fully automated pipeline for creating language-specific BERT models from Wikipedia data and introduce 42 new such models, most for languages up to now lacking dedicated deep neural language models. We assess the merits of these models using cloze tests and the state-of-the-art UDify parser on Universal Dependencies data, contrasting performance with results using the multilingual BERT (mBERT) model. We find that the newly introduced WikiBERT models outperform mBERT in cloze tests for nearly all languages, and that UDify using WikiBERT models outperforms the parser using mBERT on average, with the language-specific models showing substantially improved performance for some languages, yet limited improvement or a decrease in performance for others. All of the methods and models introduced in this work are available under open licenses from https://github.com/turkunlp/wikibert.

pdf bib abs

In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts. Additionally, we establish a manual candidate selection method and demonstrate its feasibility in high quality paraphrase selection in terms of both cost and quality.

2020

pdf bib abs

Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task
Jenna Kanerva | Filip Ginter | Sampo Pyysalo
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

We present the approach of the TurkuNLP group to the IWPT 2020 shared task on Multilingual Parsing into Enhanced Universal Dependencies. The task involves 28 treebanks in 17 different languages and requires parsers to generate graph structures extending on the basic dependency trees. Our approach combines language-specific BERT models, the UDify parser, neural sequence-to-sequence lemmatization and a graph transformation approach encoding the enhanced structure into a dependency tree. Our submission averaged 84.5% ELAS, ranking first in the shared task. We make all methods and resources developed for this study freely available under open licenses from https://turkunlp.org.

pdf bib abs

This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns.

2019

pdf bib abs

Neural Dependency Parsing of Biomedical Text: TurkuNLP entry in the CRAFT Structural Annotation Task
Thang Minh Ngo | Jenna Kanerva | Filip Ginter | Sampo Pyysalo
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

We present the approach taken by the TurkuNLP group in the CRAFT Structural Annotation task, a shared task on dependency parsing. Our approach builds primarily on the Turku neural parser, a native dependency parser that ranked among the best in the recent CoNLL tasks on parsing Universal Dependencies. To adapt the parser to the biomedical domain, we considered and evaluated a number of approaches, including the generation of custom word embeddings, combination with other in-domain resources, and the incorporation of information from named entity recognition. We achieved a labeled attachment score of 89.7%, the best result among task participants.

pdf bib abs

Template-free Data-to-Text Generation of Finnish Sports News
Jenna Kanerva | Samuel Rönnqvist | Riina Kekki | Tapio Salakoski | Filip Ginter
Proceedings of the 22nd Nordic Conference on Computational Linguistics

News articles such as sports game reports are often thought to closely follow the underlying game statistics, but in practice they contain a notable amount of background knowledge, interpretation, insight into the game, and quotes that are not present in the official statistics. This poses a challenge for automated data-to-text news generation with real-world news corpora as training data. We report on the development of a corpus of Finnish ice hockey news, edited to be suitable for training of end-to-end news generation methods, as well as demonstrate generation of text, which was judged by journalists to be relatively close to a viable product. The new dataset and system source code are available for research purposes.

pdf bib abs

Is Multilingual BERT Fluent in Language Generation?
Samuel Rönnqvist | Jenna Kanerva | Tapio Salakoski | Filip Ginter
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

The multilingual BERT model is trained on 104 languages and meant to serve as a universal language model and tool for encoding sentences. We explore how well the model performs on several languages across several tasks: a diagnostic classification probing the embeddings for a particular syntactic property, a cloze task testing the language modelling ability to fill in gaps in a sentence, and a natural language generation task testing for the ability to produce coherent text fitting a given context. We find that the currently available multilingual BERT model is clearly inferior to the monolingual counterparts, and cannot in many cases serve as a substitute for a well-trained monolingual model. We find that the English and German models perform well at generation, whereas the multilingual model is lacking, in particular, for Nordic languages. The code of the experiments in the paper is available at: https://github.com/TurkuNLP/bert-eval

2018

pdf bib abs

Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task
Jenna Kanerva | Filip Ginter | Niko Miekka | Akseli Leino | Tapio Salakoski
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper we describe the TurkuNLP entry at the CoNLL 2018 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies. Compared to the last year, this year the shared task includes two new main metrics to measure the morphological tagging and lemmatization accuracies in addition to syntactic trees. Basing our motivation into these new metrics, we developed an end-to-end parsing pipeline especially focusing on developing a novel and state-of-the-art component for lemmatization. Our system reached the highest aggregate ranking on three main metrics out of 26 teams by achieving 1st place on metric involving lemmatization, and 2nd on both morphological tagging and parsing.

pdf bib

Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions
Kira Droganova | Daniel Zeman | Jenna Kanerva | Filip Ginter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

Mind the Gap: Data Enrichment in Dependency Parsing of Elliptical Constructions
Kira Droganova | Filip Ginter | Jenna Kanerva | Daniel Zeman
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

In this paper, we focus on parsing rare and non-trivial constructions, in particular ellipsis. We report on several experiments in enrichment of training data for this specific construction, evaluated on five languages: Czech, English, Finnish, Russian and Slovak. These data enrichment methods draw upon self-training and tri-training, combined with a stratified sampling method mimicking the structural complexity of the original treebank. In addition, using these same methods, we also demonstrate small improvements over the CoNLL-17 parsing shared task winning system for four of the five languages, not only restricted to the elliptical constructions.

pdf bib abs

We evaluate two cross-lingual techniques for adding enhanced dependencies to existing treebanks in Universal Dependencies. We apply a rule-based system developed for English and a data-driven system trained on Finnish to Swedish and Italian. We find that both systems are accurate enough to bootstrap enhanced dependencies in existing UD treebanks. In the case of Italian, results are even on par with those of a prototype language-specific system.

2017

pdf bib abs

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib abs

TurkuNLP: Delexicalized Pre-training of Word Embeddings for Dependency Parsing
Jenna Kanerva | Juhani Luotolahti | Filip Ginter
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the TurkuNLP entry in the CoNLL 2017 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies. The system is based on the UDPipe parser with our focus being in exploring various techniques to pre-train the word embeddings used by the parser in order to improve its performance especially on languages with small training sets. The system ranked 11th among the 33 participants overall, being 8th on the small treebanks, 10th on the large treebanks, 12th on the parallel test sets, and 26th on the surprise languages.

pdf bib

Dep_search: Efficient Search Tool for Large Dependency Parsebanks
Juhani Luotolahti | Jenna Kanerva | Filip Ginter
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib abs

Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks v2.0
Juhani Luotolahti | Jenna Kanerva | Filip Ginter
Proceedings of the Third Workshop on Discourse in Machine Translation

In this paper we present our system in the DiscoMT 2017 Shared Task on Crosslingual Pronoun Prediction. Our entry builds on our last year’s success, our system based on deep recurrent neural networks outperformed all the other systems with a clear margin. This year we investigate whether different pre-trained word embeddings can be used to improve the neural systems, and whether the recently published Gated Convolutions outperform the Gated Recurrent Units used last year.

pdf bib

Fully Delexicalized Contexts for Syntax-Based Word Embeddings
Jenna Kanerva | Sampo Pyysalo | Filip Ginter
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib

Assessing the Annotation Consistency of the Universal Dependencies Corpora
Marie-Catherine de Marneffe | Matias Grioni | Jenna Kanerva | Filip Ginter
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

Jenna Kanerva

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

Co-authors

Venues