Jussi Karlgren

2025

We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs).Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The very high number of submissions highlights the interest of the community in hallucination detection. We present the results of the participating systems and provide an empirical analysis in order to better understand the factors that can lead to strong performance in this task. We also underscore current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.

pdf bib abs

Comparing Human and Machine Translations of Generative Language Model Evaluation Datasets
Sander Bijl de Vroe | George Stampoulidis | Kai Hakala | Aku Rouhe | Mark van Heeswijk | Jussi Karlgren
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

The evaluation of Large Language Models (LLMs) is one of the crucial current challenges in the field of Natural Language Processing (NLP) and becomes even more challenging in the multilingual setting. Since the majority of the community’s benchmarks exist only in English, test sets are now being machine translated at scale into dozens of languages. This work explores the feasibility of that approach, comparing a Finnish machine translation (MT) of ARC-Challenge with a new human translated version. Our findings suggest that since absolute scores are fairly close and model size rankings are preserved, machine translation is adequate in this case. Surprisingly, however, the datasets reverse the order of base models compared to their chat-finetuned counterparts.

2023

pdf bib abs

High-dimensional vector spaces can accommodate constructional features quite conveniently
Jussi Karlgren
Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023)

Current language processing tools presuppose input in the form of a sequence of high-dimensional vectors with continuous values. Lexical items can be converted to such vectors with standard methodology and subsequent processing is assumed to handle structural features of the string. Constructional features do typically not fit in that processing pipeline: they are not as clearly sequential, they overlap with other items, and the fact that they are combinations of lexical items obscures their ontological status as observable linguistic items in their own right. Constructional grammar frameworks allow for a more general view on how to understand lexical items and their configurations in a common framework. This paper introduces an approach to accommodate that understanding in a vector symbolic architecture, a processing framework which allows for combinations of continuous vectors and discrete items, convenient for various downstream processing using e.g. neural processing or other tools which expect input in vector form.

2022

pdf bib abs

Challenging the Assumption of Structure-based embeddings in Few- and Zero-shot Knowledge Graph Completion
Filip Cornell | Chenda Zhang | Jussi Karlgren | Sarunas Girdzijauskas
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we report experiments on Few- and Zero-shot Knowledge Graph completion, where the objective is to add missing relational links between entities into an existing Knowledge Graph with few or no previous examples of the relation in question. While previous work has used pre-trained embeddings based on the structure of the graph as input for a neural network, nobody has, to the best of our knowledge, addressed the task by only using textual descriptive data associated with the entities and relations, much since current standard benchmark data sets lack such information. We therefore enrich the benchmark data sets for these tasks by collecting textual description data to provide a new resource for future research to bridge the gap between structural and textual Knowledge Graph completion. Our results show that we can improve the results for Knowledge Graph completion for both Few- and Zero-shot scenarios with up to a two-fold increase of all metrics in the Zero-shot setting. From a more general perspective, our experiments demonstrate the value of using textual resources to enrich more formal representations of human knowledge and in the utility of transfer learning from textual data and text collections to enrich and maintain knowledge resources.

pdf bib abs

Lexical variation in English language podcasts, editorial media, and social media
Jussi Karlgren
Northern European Journal of Language Technology, Volume 8

The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversations. Most of the recorded differences are as might be expected, reflecting known or assumed difference between spoken and written language, between dialogue and soliloquy, and between scripted formal and unscripted informal language use. Most notably, podcast material, compared to the hitherto typical training sets from editorial media, is characterised by being in the present tense, and with a much higher incidence of pronouns, interjections, and negations. These characteristics are, unsurprisingly, largely shared with social media texts. Where podcast material differs from social media material is in its attitudinal content, with many more amplifiers and much less negative attitude than in blog texts. This variation, besides being of philological interest, has ramifications for computational work. Information access for material which is not primarily topical should be designed to be sensitive to such variation that defines the data set itself and discriminates items within it. In general, training sets for language models are a non-trivial parameter which are likely to show effects both expected and unexpected when applied to data from other sources and the characteristics and provenance of data used to train a model should be listed on the label as a minimal form of downstream consumer protection.

2020

pdf bib abs

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

2019

pdf bib abs

Team Harry Friberg at SemEval-2019 Task 4: Identifying Hyperpartisan News through Editorially Defined Metatopics
Nazanin Afsarmanesh | Jussi Karlgren | Peter Sumbler | Nina Viereckel
Proceedings of the 13th International Workshop on Semantic Evaluation

This report describes the starting point for a simple rule based hypothesis testing excercise on identifying hyperpartisan news items carried out by the Harry Friberg team from Gavagai. We used manually crafted metatopics, topics which often appear in hyperpartisan texts as rant conduits, together with tonality analysis to identify general characteristics of hyperpartisan news items. While the precision of the resulting effort is less than stellar— our contribution ranked 37th of the 42 successfully submitted experiments with overly high recall (95%) and low precision (54%)—we believe we have a model which allows us to continue exploring the underlying features of what the subgenre of hyperpartisan news items is characterised by.

2016

pdf bib abs

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.

Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism
Per Almquist | Jussi Karlgren
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib

Uncertainty Detection as Approximate Max-Margin Sequence Labelling
Oscar Täckström | Sumithra Velupillai | Martin Hassel | Gunnar Eriksson | Hercules Dalianis | Jussi Karlgren
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib

Cross-Lingual Comparison between Distributionally Determined Word Similarity Networks
Olof Görnerup | Jussi Karlgren
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing

2008

pdf bib abs

Experiments to Investigate the Connection between Case Distribution and Topical Relevance of Search Terms in an Information Retrieval Setting
Jussi Karlgren | Hercules Dalianis | Bart Jongejan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We have performed a set of experiments made to investigate the utility of morphological analysis to improve retrieval of documents written in languages with relatively large morphological variation in a practical commercial setting, using the SiteSeeker search system developed and marketed by Euroling Ab. The objective of the experiments was to evaluate different lemmatisers and stemmers to determine which would be the most practical for the task at hand: highly interactive, relatively high precision web searches in commercial customer-oriented document collections. This paper gives an overview of some of the results for Finnish and German, and describes specifically one experiment designed to investigate the case distribution of nouns in a highly inflectional language (Finnish) and the topicality of the nouns in target texts. We find that topical nouns taken from queries are distributed differently over relevant and non-relevant documents depending on their grammatical case.