Veronika Laippala - ACL Anthology

Veronika Laippala

2026

Perplexity as a Metric for Dialectal Distance: A Computational Study of Greek Varieties
Stergios Chatzikyriakidis | Erofili Psaltaki | Dimitrios Papadakis | Erik Henriksson | Veronika Laippala
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects

In this paper, we use LLM perplexity as a measure to assess Greek dialectal distance. We test seven models on Standard Modern Greek (SMG) and eight dialects, namely Heptanesian, Cypriot, Maniot, Pontic, Northern, Cretan, Tsakonian, and Griko. Using samples of 5k, 15k, and 25k tokens from the GRDD+ corpus for each variety, we find a consistent dialect ranking across models, with Heptanesian closest to SMG, and Griko most distant (perplexity ratio 3.6–14.5× depending on model). These results are largely in agreement with theoretical dialectological knowledge. For example, Tsakonian consistently appears distant in all measures, reflecting its status as the sole Doric descendant, while Heptanesian appears closer by all metrics, pointing to its status as one of the dialects used to shape the official variety. Perplexity correlates strongly with Bits Per-Character (mean r = 0.94) and Normalized Compression Distance (mean r = 0.87, range 0.76–0.93), providing support for its use as a dialectometric tool. However, a number of important confounds are also found. First, tokenization effects compress Llama 2’s perplexity range. Second, genre artifacts seem to inflate the results for Cretan. Third, potential training data contamination likely reduces perplexity for Cypriot and Pontic. Lastly, we find that Greek-specific models like Meltemi and Krikri do not consistently outperform general models.

2025

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

Perspectives on Forests and Forestry in Finnish Online Discussions - A Topic Modeling Approach to Suomi24
Telma Peura | Attila Krizsán | Salla-Riikka Kuusalu | Veronika Laippala
Proceedings of the 1st Workshop on Ecology, Environment, and Natural Language Processing (NLP4Ecology2025)

This paper explores how forests and forest industry are perceived on the largest online discussion forum in Finland, Suomi24 (‘Finland24’). Using 30,636 posts published in 2014–2020, we investigate what kind of topics and perspectives towards forest management can be found. We use BERTopic as our topic modeling approach and evaluate the results of its different modular combinations. As the dataset is not labeled, we demonstrate the validity of our best model through illustrating some of the topics about forest use. The results show that a combination of UMAP and K-means leads to the best topic quality. Our exploratory qualitative analysis indicates that the posts reflect polarized discourses between the forest industry and forest conservation adherents.

Analyzing register variation in web texts through automatic segmentation
Erik Henriksson | Saara Hellström | Veronika Laippala
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

2024

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora
Amanda Myntti | Liina Repo | Elian Freyermuth | Antti Kanner | Veronika Laippala | Erik Henriksson
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.

Improving Latin Dependency Parsing by Combining Treebanks and Predictions
Hanna-Mari Kristiina Kupari | Erik Henriksson | Veronika Laippala | Jenna Kanerva
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This paper introduces new models designed to improve the morpho-syntactic parsing of the five largest Latin treebanks in the Universal Dependencies (UD) framework. First, using two state-of-the-art parsers, Trankit and Stanza, along with our custom UD tagger, we train new models on the five treebanks both individually and by combining them into novel merged datasets. We also test the models on the CIRCSE test set. In an additional experiment, we evaluate whether this set can be accurately tagged using the novel LASLA corpus (https://github.com/CIRCSE/LASLA). Second, we aim to improve the results by combining the predictions of different models through an atomic morphological feature voting system. The results of our two main experiments demonstrate significant improvements, particularly for the smaller treebanks, with LAS scores increasing by 16.10 and 11.85%-points for UDante and Perseus, respectively (Gamba and Zeman, 2023a). Additionally, the voting system for morphological features (FEATS) brings improvements, especially for the smaller Latin treebanks: Perseus 3.15% and CIRCSE 2.47%-points. Tagging the CIRCSE set with our custom model using the LASLA model improves POS 6.71 and FEATS 11.04%-points respectively, compared to our best-performing UD PROIEL model. Our results show that larger datasets and ensemble predictions can significantly improve performance.

From Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with LLM Annotations
Erik Henriksson | Amanda Myntti | Saara Hellström | Selcen Erten-Johansson | Anni Eskelinen | Liina Repo | Veronika Laippala
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

In corpus linguistics, registers–language varieties suited to different contexts–have traditionally been defined by their situations of use, yet recent studies reveal significant situational variation within registers. Previous quantitative studies, however, have been limited to English, leaving this variation in other languages largely unexplored. To address this gap, we apply a quantitative situational analysis to a large multilingual web register corpus, using large language models (LLMs) to annotate texts in English, Finnish, French, Swedish, and Turkish for 23 situational parameters. Using clustering techniques, we identify six situational text types, such as “Advice”, “Opinion” and “Marketing”, each characterized by distinct situational features. We explore the relationship between these text types and traditional register categories, finding partial alignment, though no register maps perfectly onto a single cluster. These results support the quantitative approach to situational analysis and are consistent with earlier findings for English. Cross-linguistic comparisons show that language accounts for only a small part of situational variation within registers, suggesting registers are situationally similar across languages. This study demonstrates the utility of LLMs in multilingual register analysis and deepens our understanding of situational variation within registers.

Building Question-Answer Data Using Web Register Identification
Anni Eskelinen | Amanda Myntti | Erik Henriksson | Sampo Pyysalo | Veronika Laippala
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This article introduces a resource-efficient method for developing question-answer (QA) datasets by extracting QA pairs from web-scale data using machine learning (ML). Our method benefits from recent advances in web register (genre) identification and consists of two ML steps with an additional post-processing step. First, using XLM-R and the multilingual CORE web register corpus series with categories such as QA Forum, we train a multilingual classifier to retrieve documents that are likely to contain QA pairs from web-scale data. Second, we develop a NER-style token classifier to identify the QA text spans within these documents. To this end, we experiment with training on a semi-synthetic dataset built on top of the English LFQA, a small set of manually cleaned web QA pairs in English and Finnish, and a Finnish web QA pair dataset cleaned using ChatGPT. The evaluation of our pipeline demonstrates its capability to efficiently retrieve a substantial volume of QA pairs. While the approach is adaptable to any language given the availability of language models and extensive web data, we showcase its efficiency in English and Finnish, developing the first open, non-synthetic and non-machine translated QA dataset for Finnish – Turku WebQA – comprising over 200,000 QA pairs.

Automated Emotion Annotation of Finnish Parliamentary Speeches Using GPT-4
Otto Tarkka | Jaakko Koljonen | Markus Korhonen | Juuso Laine | Kristian Martiskainen | Kimmo Elo | Veronika Laippala
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

In this paper, we test the efficacy of using GPT-4 to annotate a dataset that is the used to train a BERT classifier for emotion analysis. Manual data annotation is often a laborious and expensive task and emotion annotation, specifically, has proved difficult even for expert annotators. We show that using GPT-4 can produce equally good results as doing data annotation manually while saving a lot of time and money. We train a BERT classifier on our automatically annotated dataset and get results that outperform a BERT classifier that is trained on machine translated data. Our paper shows how Large Language Models can be used to work with and analyse parliamentary corpora.

2023

Toxicity Detection in Finnish Using Machine Translation
Anni Eskelinen | Laura Silvala | Filip Ginter | Sampo Pyysalo | Veronika Laippala
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Due to the popularity of social media platforms and the sheer amount of user-generated content online, the automatic detection of toxic language has become crucial in the creation of a friendly and safe digital space. Previous work has been mostly focusing on English leaving many lower-resource languages behind. In this paper, we present novel resources for toxicity detection in Finnish by introducing two new datasets, a machine translated toxicity dataset for Finnish based on the widely used English Jigsaw dataset and a smaller test set of Suomi24 discussion forum comments originally written in Finnish and manually annotated following the definitions of the labels that were used to annotate the Jigsaw dataset. We show that machine translating the training data to Finnish provides better toxicity detection results than using the original English training data and zero-shot cross-lingual transfer with XLM-R, even with our newly annotated dataset from Suomi24.

Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.

2022

Explaining Classes through Stable Word Attributions
Samuel Rönnqvist | Aki-Juhani Kyröläinen | Amanda Myntti | Filip Ginter | Veronika Laippala
Findings of the Association for Computational Linguistics: ACL 2022

Input saliency methods have recently become a popular tool for explaining predictions of deep learning models in NLP. Nevertheless, there has been little work investigating methods for aggregating prediction-level explanations to the class level, nor has a framework for evaluating such class explanations been established. We explore explanations based on XLM-R and the Integrated Gradients input attribution method, and propose 1) the Stable Attribution Class Explanation method (SACX) to extract keyword lists of classes in text classification tasks, and 2) a framework for the systematic evaluation of the keyword lists. We find that explanations of individual predictions are prone to noise, but that stable explanations can be effectively identified through repeated training and explanation. We evaluate on web register data and show that the class explanations are linguistically meaningful and distinguishing of the classes.

Towards better structured and less noisy Web data: Oscar with Register annotations
Veronika Laippala | Anna Salmela | Samuel Rönnqvist | Alham Fikri Aji | Li-Hsin Chang | Asma Dhifallah | Larissa Goulart | Henna Kortelainen | Marc Pàmies | Deise Prina Dutra | Valtteri Skantsi | Lintang Sutawika | Sampo Pyysalo
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.

2021

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
Liina Repo | Valtteri Skantsi | Samuel Rönnqvist | Saara Hellström | Miika Oinonen | Anna Salmela | Douglas Biber | Jesse Egbert | Sampo Pyysalo | Veronika Laippala
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.

Multilingual and Zero-Shot is Closing in on Monolingual Web Register Classification
Samuel Rönnqvist | Valtteri Skantsi | Miika Oinonen | Veronika Laippala
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

This article studies register classification of documents from the unrestricted web, such as news articles or opinion blogs, in a multilingual setting, exploring both the benefit of training on multiple languages and the capabilities for zero-shot cross-lingual transfer. While the wide range of linguistic variation found on the web poses challenges for register classification, recent studies have shown that good levels of cross-lingual transfer from the extensive English CORE corpus to other languages can be achieved. In this study, we show that training on multiple languages 1) benefits languages with limited amounts of register-annotated data, 2) on average achieves performance on par with monolingual models, and 3) greatly improves upon previous zero-shot results in Finnish, French and Swedish. The best results are achieved with the multilingual XLM-R model. As data, we use the CORE corpus series featuring register annotated data from the unrestricted web.

2020

A Broad-coverage Corpus for Finnish Named Entity Recognition
Jouni Luoma | Miika Oinonen | Maria Pyykönen | Veronika Laippala | Sampo Pyysalo
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus of 754 documents (200,000 tokens) representing ten different genres of text, we introduce annotation marking person, organization, location, product and event names as well as dates. The new annotation identifies in total over 10,000 mentions. An evaluation of inter-annotator agreement indicates that the quality and consistency of annotation are high, at 94.5% F-score for exact match. A comprehensive evaluation using state-of-the-art machine learning methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity mentions in texts drawn from most domains at precision and recall approaching or exceeding 90%. Remaining challenges such as the identification of names in blog posts and transcribed speech are also identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via https://turkunlp.org/turku-ner-corpus .

From Web Crawl to Clean Register-Annotated Corpora
Veronika Laippala | Samuel Rönnqvist | Saara Hellström | Juhani Luotolahti | Liina Repo | Anna Salmela | Valtteri Skantsi | Sampo Pyysalo
Proceedings of the 12th Web as Corpus Workshop

The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs. We 1) fetch HTML pages based on the URLs and run boilerplate removal, 2) train a classifier to further clean out undesired text fragments, and 3) annotate text registers. We compare boilerplate removal against the CoNLL texts, and find an improvement. For the further cleaning of undesired material, the best results are achieved using Multilingual BERT with monolingual fine-tuning. However, our results are promising also in a cross-lingual setting, without fine-tuning on the target language. Finally, the register annotations show that most of the documents belong to a relatively small set of registers, which are relatively similar in the two languages. A number of additional flags in the annotation are, however, necessary to reflect the wide range of linguistic variation associated with the documents.

2019

Toward Multilingual Identification of Online Registers
Veronika Laippala | Roosa Kyllönen | Jesse Egbert | Douglas Biber | Sampo Pyysalo
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE). Using CORE and the newly introduced corpus, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convolutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.

2017

Creating register sub-corpora for the Finnish Internet Parsebank
Veronika Laippala | Juhani Luotolahti | Aki-Juhani Kyröläinen | Tapio Salakoski | Filip Ginter
Proceedings of the 21st Nordic Conference on Computational Linguistics

2015

Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality
Veronika Laippala | Jenna Kanerva | Anna Missilä | Sampo Pyysalo | Tapio Salakoski | Filip Ginter
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

Towards Universal Web Parsebanks
Juhani Luotolahti | Jenna Kanerva | Veronika Laippala | Sampo Pyysalo | Filip Ginter
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

Universal Dependencies for Finnish
Sampo Pyysalo | Jenna Kanerva | Anna Missilä | Veronika Laippala | Filip Ginter
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2013

Towards a Dependency-Based PropBank of General Finnish
Katri Haverinen | Veronika Laippala | Samuel Kohonen | Anna Missilä | Jenna Nyblom | Stina Ojala | Timo Viljanen | Tapio Salakoski | Filip Ginter
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

Building a Large Automatically Parsed Corpus of Finnish
Filip Ginter | Jenna Nyblom | Veronika Laippala | Samuel Kohonen | Katri Haverinen | Simo Vihjanen | Tapio Salakoski
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2010

Dependency-Based PropBanking of Clinical Finnish
Katri Haverinen | Filip Ginter | Timo Viljanen | Veronika Laippala | Tapio Salakoski
Proceedings of the Fourth Linguistic Annotation Workshop

2009

Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers
Katri Haverinen | Filip Ginter | Veronika Laippala | Tapio Salakoski
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2007

On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA
Sampo Pyysalo | Filip Ginter | Veronika Laippala | Katri Haverinen | Juho Heimonen | Tapio Salakoski
Biological, translational, and clinical language processing

Co-authors

Katri Haverinen 5

Jenna Kanerva 5

Samuel Rönnqvist 5

Anni Eskelinen 4

Saara Hellström 4

Valtteri Skantsi 4

Juhani Luotolahti 3

Anna Missilä 3

Miika Oinonen 3

Nikolay Arefyev 2

Marta Bañón 2

Douglas Biber 2

Laurie Burchell 2

Mariia Fedorova 2

Liane Guillou 2

Jindřich Helcl 2

Samuel Kohonen 2

Ville Komulainen 2

Andrey Kutuzov 2

Aki-Juhani Kyröläinen 2

Bhavitvya Malik 2

Farrokh Mehryary 2

Vladislav Mikhailov 2

Niklas Muennighoff 2

Stephan Oepen 2

Dayyán O’Brien 2

Gema Ramírez-Sánchez 2

Pavel Stepachev 2

Jörg Tiedemann 2

Timo Viljanen 2

Jaume Zaragoza-Bernabeu 2

Ona de Gibert 2

Tosin Adewumi 1

Alham Fikri Aji 1

Matthew Blumberg 1

Tien-Tung Bui 1

Li-Hsin Chang 1

Stergios Chatzikyriakidis 1

Liang-Yu Chen 1

Vu Minh Chien 1

Arnav Varma Dantuluri 1

Asma Dhifallah 1

Aleksandr Drozd 1

Selcen Erten-Johansson 1

Elian Freyermuth 1

Felix Friedrich 1

Tommaso Furlanello 1

Larissa Goulart 1

Kshitij Gupta 1

Juho Heimonen 1

Jyrki Heinonen 1

Adalberto Barbosa Junior 1

Marzena Karpinska 1

Mateusz Klimaszewski 1

Jaakko Koljonen 1

Markus Korhonen 1

Henna Kortelainen 1

Attila Krizsán 1

Hanna-Mari Kupari 1

Hanna-Mari Kristiina Kupari 1

Wojciech Kusa 1

Salla-Riikka Kuusalu 1

Roosa Kyllönen 1

Joona Kytöniemi 1

Risto Luukkonen 1

Kristian Martiskainen 1

Virendra Mehta 1

Mikko Merioksa 1

Mayank Mishra 1

Diganta Misra 1

Nour Moustafa-Fahmy 1

Petter Mæhlum 1

Taishi Nakamura 1

Roberto Navigli 1

Dimitrios Papadakis 1

Aleksandra Piktus 1

Deise Prina Dutra 1

Erofili Psaltaki 1

Maria Pyykönen 1

Samuli Sairanen 1

Laura Silvala 1

Jason T. Stillerman 1

Osma Suominen 1

Lintang Sutawika 1

Nouamane Tazi 1

Simone Tedeschi 1

Simo Vihjanen 1

Tereza Vojtěchová 1

Prateek Yadav 1

Terry Yue Zhuo 1

Venues