Arne Jönsson

Also published as: Arne Jonsson

2024

pdf bib abs
Controllable Sentence Simplification in Swedish Using Control Prefixes and Mined Paraphrases
Julius Monsen | Arne Jonsson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Making information accessible to diverse target audiences, including individuals with dyslexia and cognitive disabilities, is crucial. Automatic Text Simplification (ATS) systems aim to facilitate readability and comprehension by reducing linguistic complexity. However, they often lack customizability to specific user needs, and training data for smaller languages can be scarce. This paper addresses ATS in a Swedish context, using methods that provide more control over the simplification. A dataset of Swedish paraphrases is mined from large amounts of text and used to train ATS models utilizing prefix-tuning with control prefixes. We also introduce a novel data-driven method for selecting complexity attributes for controlling the simplification and compare it with previous approaches. Evaluation of the trained models using SARI and BLEU demonstrates significant improvements over the baseline — a fine-tuned Swedish BART model — and compared to previous Swedish ATS results. These findings highlight the effectiveness of employing paraphrase data in conjunction with controllable generation mechanisms for simplification. Additionally, the set of explored attributes yields similar results compared to previously used attributes, indicating their ability to capture important simplification aspects.

2023

pdf bib
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning
David Alfter | Elena Volodina | Thomas François | Arne Jönsson | Evelina Rennes
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

pdf bib abs
Context-aware Swedish Lexical Simplification
Emil Graichen | Arne Jonsson
Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability

We present results from the development and evaluation of context-aware Lexical simplification (LS) systems for the Swedish language. Three versions of LS models, LäsBERT, LäsBERT-baseline, and LäsGPT, were created and evaluated on a newly constructed Swedish LS evaluation dataset. The LS systems demonstrated promising potential in aiding audiences with reading difficulties by providing context-aware word replacements. While there were areas for improvement, particularly in complex word identification, the systems showed agreement with human annotators on word replacements.

pdf bib abs
Who said what? Speaker Identification from Anonymous Minutes of Meetings
Daniel Holmer | Lars Ahrenberg | Julius Monsen | Arne Jönsson | Mikael Apel | Marianna Grimaldi
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We study the performance of machine learning techniques to the problem of identifying speakers at meetings from anonymous minutes issued afterwards. The data comes from board meetings of Sveriges Riksbank (Sweden’s Central Bank). The data is split in two ways, one where each reported contribution to the discussion is treated as a data point, and another where all contributions from a single speaker have been aggregated. Using interpretable models we find that lexical features and topic models generated from speeches held by the board members outside of board meetings are good predictors of speaker identity. Combining topic models with other features gives prediction accuracies close to 80% on aggregated data, though there is still a sizeable gap in performance compared to a not easily interpreted BERT-based transformer model that we offer as a benchmark.

2022

pdf bib abs
Classifying Implant-Bearing Patients via their Medical Histories: a Pre-Study on Swedish EMRs with Semi-Supervised GanBERT
Benjamin Danielsson | Marina Santini | Peter Lundberg | Yosef Al-Abasse | Arne Jonsson | Emma Eneling | Magnus Stridsman
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we compare the performance of two BERT-based text classifiers whose task is to classify patients (more precisely, their medical histories) as having or not having implant(s) in their body. One classifier is a fully-supervised BERT classifier. The other one is a semi-supervised GAN-BERT classifier. Both models are compared against a fully-supervised SVM classifier. Since fully-supervised classification is expensive in terms of data annotation, with the experiments presented in this paper, we investigate whether we can achieve a competitive performance with a semi-supervised classifier based only on a small amount of annotated data. Results are promising and show that the semi-supervised classifier has a competitive performance with the fully-supervised classifier.

pdf bib abs
Evaluating Pre-Trained Language Models for Focused Terminology Extraction from Swedish Medical Records
Oskar Jerdhaf | Marina Santini | Peter Lundberg | Tomas Bjerner | Yosef Al-Abasse | Arne Jonsson | Thomas Vakili
Proceedings of the Workshop on Terminology in the 21st century: many faces, many places

In the experiments briefly presented in this abstract, we compare the performance of a generalist Swedish pre-trained language model with a domain-specific Swedish pre-trained model on the downstream task of focussed terminology extraction of implant terms, which are terms that indicate the presence of implants in the body of patients. The fine-tuning is identical for both models. For the search strategy we rely on KD-Tree that we feed with two different lists of term seeds, one with noise and one without noise. Results shows that the use of a domain-specific pre-trained language model has a positive impact on focussed terminology extraction only when using term seeds without noise.

pdf bib abs
The Swedish Simplification Toolkit: – Designed with Target Audiences in Mind
Evelina Rennes | Marina Santini | Arne Jonsson
Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference

In this paper, we present the current version of The Swedish Simplification Toolkit. The toolkit includes computational and empirical tools that have been developed along the years to explore a still neglected area of NLP, namely the simplification of “standard” texts to meet the needs of target audiences. Target audiences, such as people affected by dyslexia, aphasia, autism, but also children and second language learners, require different types of text simplification and adaptation. For example, while individual with aphasia have difficulties in reading compounds (such as arbetsmarknadsdepartement, eng. ministry of employment), second language learners struggle with cultural-specific vocabulary (e.g. konflikträdd, eng. afraid of conflicts). The toolkit allows user to selectively decide the types of simplification that meet the specific needs of the target audience they belong to. The Swedish Simplification Toolkit is one of the first attempts to overcome the one-fits-all approach that is still dominant in Automatic Text Simplification, and proposes a set of computational methods that, used individually or in combination, may help individuals reduce reading (and writing) difficulties.

2021

pdf bib abs
Synonym Replacement based on a Study of Basic-level Nouns in Swedish Texts of Different Complexity
Evelina Rennes | Arne Jönsson
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Basic-level terms have been described as the most important to human categorisation. They are the earliest emerging words in children’s language acquisition, and seem to be more frequently occurring in language in general. In this article, we explored the use of basic-level nouns in texts of different complexity, and hypothesise that hypernyms with characteristics of basic-level words could be useful for the task of lexical simplification. We conducted two corpus studies using four different corpora, two corpora of standard Swedish and two corpora of simple Swedish, and explored whether corpora of simple texts contain a higher proportion of basic-level nouns than corpora of standard Swedish. Based on insights from the corpus studies, we developed a novel algorithm for choosing the best synonym by rewarding high relative frequencies and monolexemity, and restricting the climb in the word hierarchy not to suggest synonyms of a too high level of inclusiveness.

2020

pdf bib abs
Visualizing Facets of Text Complexity across Registers
Marina Santini | Arne Jonsson | Evelina Rennes
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

In this paper, we propose visualizing results of a corpus-based study on text complexity using radar charts. We argue that the added value of this type of visualisation is the polygonal shape that provides an intuitive grasp of text complexity similarities across the registers of a corpus. The results that we visualize come from a study where we explored whether it is possible to automatically single out different facets of text complexity across the registers of a Swedish corpus. To this end, we used factor analysis as applied in Biber’s Multi-Dimensional Analysis framework. The visualization of text complexity facets with radar charts indicates that there is correspondence between linguistic similarity and similarity of shape across registers.

2019

pdf bib abs
Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
Marina Santini | Benjamin Danielsson | Arne Jönsson
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We explore the effectiveness of four feature representations – bag-of-words, word embeddings, principal components and autoencoders – for the binary categorization of the easy-to-read variety vs standard language. Standard language refers to the ordinary language variety used by a population as a whole or by a community, while the “easy-to-read” variety is a simpler (or a simplified) version of the standard language. We test the efficiency of these feature representations on three corpora, which differ in size, class balance, unit of analysis, language and topic. We rely on supervised and unsupervised machine learning algorithms. Results show that bag-of-words is a robust and straightforward feature representation for this task and performs well in many experimental settings. Its performance is equivalent or equal to the performance achieved with principal components and autoencorders, whose preprocessing is however more time-consuming. Word embeddings are less accurate than the other feature representations for this classification task.

Data driven approaches to readability analysis for languages other than English has been plagued by a scarcity of suitable corpora. Often, relevant corpora consist only of easy-to-read texts with no rank information or empirical readability scores, making only binary approaches, such as classification, applicable. We propose a Bayesian, latent variable, approach to get the most out of these kinds of corpora. In this paper we present results on using such a model for readability ranking. The model is evaluated on a preliminary corpus of ranked student texts with encouraging results. We also assess the model by showing that it performs readability classification on par with a state of the art classifier while at the same being transparent enough to allow more sophisticated interpretations.

pdf bib abs
Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes
Sarah Albertsson | Evelina Rennes | Arne Jönsson
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

Comparable or parallel corpora are beneficial for many NLP tasks. The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus. The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was developed, and the methods were compared. For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.

2015

pdf bib
A Tool for Automatic Simplification of Swedish Texts
Evelina Rennes | Arne Jönsson
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

pdf bib abs
The Impact of Cohesion Errors in Extraction Based Summaries
Evelina Rennes | Arne Jönsson
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present results from an eye tracking study of automatic text summarization. Automatic text summarization is a growing field due to the modern world’s Internet based society, but to automatically create perfect summaries is challenging. One problem is that extraction based summaries often have cohesion errors. By the usage of an eye tracking camera, we have studied the nature of four different types of cohesion errors occurring in extraction based summaries. A total of 23 participants read and rated four different texts and marked the most difficult areas of each text. Statistical analysis of the data revealed that absent cohesion or context and broken anaphoric reference (pronouns) caused some disturbance in reading, but that the impact is restricted to the effort to read rather than the comprehension of the text. However, erroneous anaphoric references (pronouns) were not always detected by the participants which poses a problem for automatic text summarizers. The study also revealed other potential disturbing factors.

pdf bib
Classifying easy-to-read texts without parsing
Johan Falkenjack | Arne Jönsson
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

2013

pdf bib
Features Indicating Readability in Swedish Text
Johan Falkenjack | Katarina Heimann Mühlenbock | Arne Jönsson
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Iterative Development and Evaluation of a Social Conversational Agent
Annika Silvervarg | Arne Jönsson
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
A More Cohesive Summarizer
Christian Smith | Henrik Danielsson | Arne Jönsson
Proceedings of COLING 2012: Posters

pdf bib abs
A good space: Lexical predictors in word space evaluation
Christian Smith | Henrik Danielsson | Arne Jönsson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Vector space models benefit from using an outside corpus to train the model. It is, however, unclear what constitutes a good training corpus. We have investigated the effect on summary quality when using various language resources to train a vector space based extraction summarizer. This is done by evaluating the performance of the summarizer utilizing vector spaces built from corpora from different genres, partitioned from the Swedish SUC-corpus. The corpora are also characterized using a variety of lexical measures commonly used in readability studies. The performance of the summarizer is measured by comparing automatically produced summaries to human created gold standard summaries using the ROUGE F-score. Our results show that the genre of the training corpus does not have a significant effect on summary quality. However, evaluating the variance in the F-score between the genres based on lexical measures as independent variables in a linear regression model, shows that vector spaces created from texts with high syntactic complexity, high word variation, short sentences and few long words produce better summaries.

pdf bib abs
This also affects the context - Errors in extraction based summaries
Thomas Kaspersson | Christian Smith | Henrik Danielsson | Arne Jönsson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Although previous studies have shown that errors occur in texts summarized by extraction based summarizers, no study has investigated how common different types of errors are and how that changes with degree of summarization. We have conducted studies of errors in extraction based single document summaries using 30 texts, summarized to 5 different degrees and tagged for errors by human judges. The results show that the most common errors are absent cohesion or context and various types of broken or missing anaphoric references. The amount of errors is dependent on the degree of summarization where some error types have a linear relation to the degree of summarization and others have U-shaped or cut-off linear relations. These results show that the degree of summarization has to be taken into account to minimize the amount of errors by extraction based summarizers.

2011

pdf bib
Automatic summarization as means of simplifying texts, an evaluation for Swedish
Christian Smith | Arne Jönsson
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

pdf bib
Enhancing extraction based summarization with outside word space
Christian Smith | Arne Jönsson
Proceedings of 5th International Joint Conference on Natural Language Processing

2008

pdf bib abs
Using Random Indexing to improve Singular Value Decomposition for Latent Semantic Analysis
Linus Sellberg | Arne Jönsson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present results from using Random indexing for Latent Semantic Analysis to handle Singular Value Decomposition tractability issues. In the paper we compare Latent Semantic Analysis, Random Indexing and Latent Semantic Analysis on Random Indexing reduced matrices. Our results show that Latent Semantic Analysis on Random Indexing reduced matrices provide better results on Precision and Recall than Random Indexing only. Furthermore, computation time for Singular Value Decomposition on a Random indexing reduced matrix is almost halved compared to Latent Semantic Analysis.