Jussi Karlgren


2023

pdf bib
High-dimensional vector spaces can accommodate constructional features quite conveniently
Jussi Karlgren
Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023)

Current language processing tools presuppose input in the form of a sequence of high-dimensional vectors with continuous values. Lexical items can be converted to such vectors with standard methodology and subsequent processing is assumed to handle structural features of the string. Constructional features do typically not fit in that processing pipeline: they are not as clearly sequential, they overlap with other items, and the fact that they are combinations of lexical items obscures their ontological status as observable linguistic items in their own right. Constructional grammar frameworks allow for a more general view on how to understand lexical items and their configurations in a common framework. This paper introduces an approach to accommodate that understanding in a vector symbolic architecture, a processing framework which allows for combinations of continuous vectors and discrete items, convenient for various downstream processing using e.g. neural processing or other tools which expect input in vector form.

2022

pdf bib
Lexical variation in English language podcasts, editorial media, and social media
Jussi Karlgren
Northern European Journal of Language Technology, Volume 8

The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversations. Most of the recorded differences are as might be expected, reflecting known or assumed difference between spoken and written language, between dialogue and soliloquy, and between scripted formal and unscripted informal language use. Most notably, podcast material, compared to the hitherto typical training sets from editorial media, is characterised by being in the present tense, and with a much higher incidence of pronouns, interjections, and negations. These characteristics are, unsurprisingly, largely shared with social media texts. Where podcast material differs from social media material is in its attitudinal content, with many more amplifiers and much less negative attitude than in blog texts. This variation, besides being of philological interest, has ramifications for computational work. Information access for material which is not primarily topical should be designed to be sensitive to such variation that defines the data set itself and discriminates items within it. In general, training sets for language models are a non-trivial parameter which are likely to show effects both expected and unexpected when applied to data from other sources and the characteristics and provenance of data used to train a model should be listed on the label as a minimal form of downstream consumer protection.

pdf bib
Challenging the Assumption of Structure-based embeddings in Few- and Zero-shot Knowledge Graph Completion
Filip Cornell | Chenda Zhang | Jussi Karlgren | Sarunas Girdzijauskas
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we report experiments on Few- and Zero-shot Knowledge Graph completion, where the objective is to add missing relational links between entities into an existing Knowledge Graph with few or no previous examples of the relation in question. While previous work has used pre-trained embeddings based on the structure of the graph as input for a neural network, nobody has, to the best of our knowledge, addressed the task by only using textual descriptive data associated with the entities and relations, much since current standard benchmark data sets lack such information. We therefore enrich the benchmark data sets for these tasks by collecting textual description data to provide a new resource for future research to bridge the gap between structural and textual Knowledge Graph completion. Our results show that we can improve the results for Knowledge Graph completion for both Few- and Zero-shot scenarios with up to a two-fold increase of all metrics in the Zero-shot setting. From a more general perspective, our experiments demonstrate the value of using textual resources to enrich more formal representations of human knowledge and in the utility of transfer learning from textual data and text collections to enrich and maintain knowledge resources.

2020

pdf bib
100,000 Podcasts: A Spoken English Document Corpus
Ann Clifton | Sravana Reddy | Yongze Yu | Aasish Pappu | Rezvaneh Rezapour | Hamed Bonab | Maria Eskevich | Gareth Jones | Jussi Karlgren | Ben Carterette | Rosie Jones
Proceedings of the 28th International Conference on Computational Linguistics

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

2019

pdf bib
Team Harry Friberg at SemEval-2019 Task 4: Identifying Hyperpartisan News through Editorially Defined Metatopics
Nazanin Afsarmanesh | Jussi Karlgren | Peter Sumbler | Nina Viereckel
Proceedings of the 13th International Workshop on Semantic Evaluation

This report describes the starting point for a simple rule based hypothesis testing excercise on identifying hyperpartisan news items carried out by the Harry Friberg team from Gavagai. We used manually crafted metatopics, topics which often appear in hyperpartisan texts as rant conduits, together with tonality analysis to identify general characteristics of hyperpartisan news items. While the precision of the resulting effort is less than stellar— our contribution ranked 37th of the 42 successfully submitted experiments with overly high recall (95%) and low precision (54%)—we believe we have a model which allows us to continue exploring the underlying features of what the subgenre of hyperpartisan news items is characterised by.

2016

pdf bib
The Gavagai Living Lexicon
Magnus Sahlgren | Amaru Cuba Gyllensten | Fredrik Espinoza | Ola Hamfors | Jussi Karlgren | Fredrik Olsson | Per Persson | Akshay Viswanathan | Anders Holst
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.

2015

pdf bib
Inferring the location of authors from words in their texts
Max Berggren | Jussi Karlgren | Robert Östling | Mikael Parkvall
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2013

pdf bib
New Measures to Investigate Term Typology by Distributional Data
Jussi Karlgren
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2011

pdf bib
Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism
Per Almquist | Jussi Karlgren
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib
Cross-Lingual Comparison between Distributionally Determined Word Similarity Networks
Olof Görnerup | Jussi Karlgren
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing

pdf bib
Uncertainty Detection as Approximate Max-Margin Sequence Labelling
Oscar Täckström | Sumithra Velupillai | Martin Hassel | Gunnar Eriksson | Hercules Dalianis | Jussi Karlgren
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

2008

pdf bib
Experiments to Investigate the Connection between Case Distribution and Topical Relevance of Search Terms in an Information Retrieval Setting
Jussi Karlgren | Hercules Dalianis | Bart Jongejan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We have performed a set of experiments made to investigate the utility of morphological analysis to improve retrieval of documents written in languages with relatively large morphological variation in a practical commercial setting, using the SiteSeeker search system developed and marketed by Euroling Ab. The objective of the experiments was to evaluate different lemmatisers and stemmers to determine which would be the most practical for the task at hand: highly interactive, relatively high precision web searches in commercial customer-oriented document collections. This paper gives an overview of some of the results for Finnish and German, and describes specifically one experiment designed to investigate the case distribution of nouns in a highly inflectional language (Finnish) and the topicality of the nouns in target texts. We find that topical nouns taken from queries are distributed differently over relevant and non-relevant documents depending on their grammatical case.

2007

pdf bib
SICS: Valence annotation based on seeds in word space
Magnus Sahlgren | Jussi Karlgren | Gunnar Eriksson
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Creating bilingual lexica using reference wordlists for alignment of monolingual semantic vector spaces
Jon Holmlund | Magnus Sahlgren | Jussi Karlgren
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

pdf bib
Compound terms and their constituent elements in information retrieval
Jussi Karlgren
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

pdf bib
Multilingual interactive experiments with Flickr
Paul D. Clough | Julio Gonzales | Jussi Karlgren
Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources

1998

pdf bib
Assembling a Balanced Corpus from the Internet
Johan Dewe | Jussi Karlgren | Ivan Bretan
Proceedings of the 11th Nordic Conference of Computational Linguistics (NODALIDA 1998)

1994

pdf bib
Clustering Sentences – Making Sense of Synonymous Sentences
Jussi Karlgren | Björn Gambäck | Christer Samuelsson
Proceedings of the 9th Nordic Conference of Computational Linguistics (NODALIDA 1993)

pdf bib
Dilemma - An Instant Lexicographer
Hans Karlgren | Jussi Karlgren | Magnus Nordstrom | Paul Pettersson | Bengt Wahrolen
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
Recognizing Text Genres With Simple Metrics Using Discriminant Analysis
Jussi Karlgren | Douglass Cutting
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1993

pdf bib
A Speech to Speech Translation System Built From Standard Components
Manny Rayner | Hiyan Alshawi | Ivan Bretan | David Carter | Vassilios Digalakis | Bjorn Gamback | Jaan Kaja | Jussi Karlgren | Bertil Lyberg | Steve Pulman | Patti Price | Christer Samuelsson
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993