Robert Östling - ACL Anthology

Robert Östling

2025

LLM-based post-editing as reference-free GEC evaluation
Robert Östling | Murathan Kurfali | Andrew Caines
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Evaluation of Grammatical Error Correction (GEC) systems is becoming increasingly challenging as the quality of such systems increases and traditional automatic metrics fail to adequately capture such nuances as fluency versus minimal edits, alternative valid corrections compared to the ‘ground truth’, and the difference between corrections that are useful in a language learning scenario versus those preferred by native readers. Previous work has suggested using human post-editing of GEC system outputs, but this is very labor-intensive. We investigate the use of Large Language Models (LLMs) as post-editors of English and Swedish texts, and perform a meta-analysis of a range of different evaluation setups using a set of recent GEC systems. We find that for the two languages studied in our work, automatic evaluation based on post-editing agrees well with both human post-editing and direct human rating of GEC systems. Furthermore, we find that a simple n-gram overlap metric is sufficient to measure post-editing distance, and that including human references when prompting the LLMs generally does not improve agreement with human ratings. The resulting evaluation metric is reference-free and requires no language-specific training or additional resources beyond an LLM capable of handling the given language.Evaluation of Grammatical Error Correction (GEC) systems is becoming increasingly challenging as the quality of such systems increases and traditional automatic metrics fail to adequately capture such nuances as fluency versus minimal edits, alternative valid corrections compared to the ‘ground truth’, and the difference between corrections that are useful in a language learning scenario versus those preferred by native readers. Previous work has suggested using human post-editing of GEC system outputs, but this is very labor-intensive. We investigate the use of Large Language Models (LLMs) as post-editors of English and Swedish texts, and perform a meta-analysis of a range of different evaluation setups using a set of recent GEC systems. We find that for the two languages studied in our work, automatic evaluation based on post-editing agrees well with both human post-editing and direct human rating of GEC systems. Furthermore, we find that a simple n-gram overlap metric is sufficient to measure post-editing distance, and that including human references when prompting the LLMs generally does not improve agreement with human ratings. The resulting evaluation metric is reference-free and requires no language-specific training or additional resources beyond an LLM capable of handling the given language.

Conflicting Needles in a Haystack: How LLMs behave when faced with contradictory information
Murathan Kurfali | Robert Östling
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have demonstrated an impressive ability to retrieve and summarize complex information, but their reliability in conflicting contexts remains poorly understood. We introduce an adversarial extension of the Needle-in-a-Haystack framework in which three mutually exclusive “needles” are embedded within long documents. By systematically manipulating factors such as position, repetition, layout, and domain relevance, we evaluate how LLMs handle contradictions. We find that models almost always fail to signal uncertainty and instead confidently select a single answer, exhibiting strong and consistent biases toward repetition, recency, and particular surface forms. We further analyze whether these patterns persist across model families and sizes, and we evaluate both probability-based and generation-based retrieval. Our framework highlights critical limitations in the robustness of current LLMs—including commercial systems—to contradiction. These limitations reveal potential shortcomings in RAG systems’ ability to handle noisy or manipulated inputs and exposes risks for deployment in high-stakes applications.

Prompting the Past: Exploring Zero-Shot Learning for Named Entity Recognition in Historical Texts Using Prompt-Answering LLMs
Crina Tudor | Beata Megyesi | Robert Östling
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

This paper investigates the application of prompt-answering Large Language Models (LLMs) for the task of Named Entity Recognition (NER) in historical texts. Historical NER presents unique challenges due to language change through time, spelling variation, limited availability of digitized data (and, in particular, labeled data), and errors introduced by Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) processes. Leveraging the zero-shot capabilities of prompt-answering LLMs, we address these challenges by prompting the model to extract entities such as persons, locations, organizations, and dates from historical documents. We then conduct an extensive error analysis of the model output in order to identify and address potential weaknesses in the entity recognition process. The results show that, while such models display ability for extracting named entities, their overall performance is lackluster. Our analysis reveals that model performance is significantly affected by hallucinations in the model output, as well as by challenges imposed by the evaluation of NER output.

The MultiGEC-2025 Shared Task on Multilingual Grammatical Error Correction at NLP4CALL
Arianna Masciolini | Andrew Caines | Orphée De Clercq | Joni Kruijsbergen | Murathan Kurfalı | Ricardo Muñoz Sánchez | Elena Volodina | Robert Östling
Proceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning

A Cross-Lingual Perspective on Neural Machine Translation Difficulty
Esther Ploeger | Johannes Bjerva | Jörg Tiedemann | Robert Östling
Proceedings of the Tenth Conference on Machine Translation

Intuitively, machine translation (MT) between closely related languages, such as Swedish and Danish, is easier than MT between more distant pairs, such as Finnish and Danish. Yet, the notions of ‘closely related’ languages and ‘easier’ translation have so far remained underspecified. Moreover, in the context of neural MT, this assumption was almost exclusively evaluated in scenarios where English was either the source or target language, leaving a broader cross-lingual view unexplored. In this work, we present a controlled study of language similarity and neural MT difficulty for 56 European translation directions. We test a range of language similarity metrics, some of which are reasonable predictors of MT difficulty. On a text-level, we reassess previously introduced indicators of MT difficulty, and find that they are not well-suited to our domain, or neural MT more generally. Ultimately, we hope that this work inspires further cross-lingual investigations of neural MT difficulty

2024

Evaluation of Really Good Grammatical Error Correction
Robert Östling | Katarina Gillholm | Murathan Kurfalı | Marie Mattson | Mats Wirén
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Traditional evaluation methods for Grammatical Error Correction (GEC) fail to fully capture the full range of system capabilities and objectives. The emergence of large language models (LLMs) has further highlighted the shortcomings of these evaluation strategies, emphasizing the need for a paradigm shift in evaluation methodology. In the current study, we perform a comprehensive evaluation of various GEC systems using a recently published dataset of Swedish learner texts. The evaluation is performed using established evaluation metrics as well as human judges. We find that GPT-3 in a few-shot setting by far outperforms previous grammatical error correction systems for Swedish, a language comprising only about 0.1% of its training data. We also found that current evaluation methods contain undesirable biases that a human evaluation is able to reveal. We suggest using human post-editing of GEC system outputs to analyze the amount of change required to reach native-level human performance on the task, and provide a dataset annotated with human post-edits and assessments of grammaticality, fluency and meaning preservation of GEC system outputs.

2023

Language Embeddings Sometimes Contain Typological Generalizations
Robert Östling | Murathan Kurfalı
Computational Linguistics, Volume 49, Issue 4 - December 2023

To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1,295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (1) multiple sets of language representations, (2) multilingual word embeddings, (3) projected and predicted syntactic and morphological features, (4) software to provide linguistically sound evaluations of language representations.

A distantly supervised Grammatical Error Detection/Correction system for Swedish
Murathan Kurfalı | Robert Östling
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

2021

Probing Multilingual Language Models for Discourse
Murathan Kurfalı | Robert Östling
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Pre-trained multilingual language models have become an important building block in multilingual Natural Language Processing. In the present paper, we investigate a range of such models to find out how well they transfer discourse-level knowledge across languages. This is done with a systematic evaluation on a broader set of discourse-level tasks than has been previously been assembled. We find that the XLM-RoBERTa family of models consistently show the best performance, by simultaneously being good monolingual models and degrading relatively little in a zero-shot setting. Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations, while language dissimilarity at most has a modest effect. We hope that our test suite, covering 5 tasks with a total of 22 languages in 10 distinct families, will serve as a useful evaluation platform for multilingual performance at and beyond the sentence level.

Let’s be explicit about that: Distant supervision for implicit discourse relation classification via connective prediction
Murathan Kurfalı | Robert Östling
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language

In implicit discourse relation classification, we want to predict the relation between adjacent sentences in the absence of any overt discourse connectives. This is challenging even for humans, leading to shortage of annotated data, a fact that makes the task even more difficult for supervised machine learning approaches. In the current study, we perform implicit discourse relation classification without relying on any labeled implicit relation. We sidestep the lack of data through explicitation of implicit relations to reduce the task to two sub-problems: language modeling and explicit discourse relation classification, a much easier problem. Our experimental results show that this method can even marginally outperform the state-of-the-art, in spite of being much simpler than alternative models of comparable performance. Moreover, we show that the achieved performance is robust across domains as suggested by the zero-shot experiments on a completely different domain. This indicates that recent advances in language modeling have made language models sufficiently good at capturing inter-sentence relations without the help of explicit discourse markers.

2020

A Sentiment-annotated Dataset of English Causal Connectives
Marta Andersson | Murathan Kurfalı | Robert Östling
Proceedings of the 14th Linguistic Annotation Workshop

This paper investigates the semantic prosody of three causal connectives: due to, owing to and because of in seven varieties of the English language. While research in the domain of English causality exists, we are not aware of studies that would cover the domain of causal connectives in English. Our claim is that connectives such as because of link two arguments, (at least) one of which will include a phrase that contributes to the interpretation of the relation as positive or negative, and hence define the prosody of the connective used. As our results demonstrate, the majority of the prosodies identified are negative for all three connectives; the proportions are stable across the varieties of English studied, and contrary to our expectations, we find no significant differences between the functions of the connectives and discourse preferences. Further, we investigate whether automatizing the sentiment annotation procedure via a simple language-model based classifier is possible. The initial results highlights the complexity of the task and the need for complicated systems, probably aided with other related datasets to achieve reasonable performance.

A Multi-word Expression Dataset for Swedish
Murathan Kurfalı | Robert Östling | Johan Sjons | Mats Wirén
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a new set of 96 Swedish multi-word expressions annotated with degree of (non-)compositionality. In contrast to most previous compositionality datasets we also consider syntactically complex constructions and publish a formal specification of each expression. This allows evaluation of computational models beyond word bigrams, which have so far been the norm. Finally, we use the annotations to evaluate a system for automatic compositionality estimation based on distributional semantics. Our analysis of the disagreements between human annotators and the distributional model reveal interesting questions related to the perception of compositionality, and should be informative to future work in the area.

Disambiguation of Potentially Idiomatic Expressions with Contextual Embeddings
Murathan Kurfalı | Robert Östling
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

The majority of multiword expressions can be interpreted as figuratively or literally in different contexts which pose challenges in a number of downstream tasks. Most previous work deals with this ambiguity following the observation that MWEs with different usages occur in distinctly different contexts. Following this insight, we explore the usefulness of contextual embeddings by means of both supervised and unsupervised classification. The results show that in the supervised setting, the state-of-the-art can be substantially improved for all expressions in the experiments. The unsupervised classification, similarly, yields very impressive results, comparing favorably to the supervised classifier for the majority of the expressions. We also show that multilingual contextual embeddings can also be employed for this task without leading to any significant loss in performance; hence, the proposed methodology has the potential to be extended to a number of languages.

2019

What Do Language Representations Really Represent?
Johannes Bjerva | Robert Östling | Maria Han Veiga | Jörg Tiedemann | Isabelle Augenstein
Computational Linguistics, Volume 45, Issue 2 - June 2019

A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just as it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, whereas genetic relationships—a convenient benchmark used for evaluation in previous work—appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

Noisy Parallel Corpus Filtering through Projected Word Embeddings
Murathan Kurfalı | Robert Östling
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.

Zero-shot transfer for implicit discourse relation classification
Murathan Kurfalı | Robert Östling
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

Automatically classifying the relation between sentences in a discourse is a challenging task, in particular when there is no overt expression of the relation. It becomes even more challenging by the fact that annotated training data exists only for a small number of languages, such as English and Chinese. We present a new system using zero-shot transfer learning for implicit discourse relation classification, where the only resource used for the target language is unannotated parallel text. This system is evaluated on the discourse-annotated TED-MDB parallel corpus, where it obtains good results for all seven languages using only English training data.

2018

Identifying Speakers and Addressees in Dialogues Extracted from Literary Fiction
Adam Ek | Mats Wirén | Robert Östling | Kristina N. Björkenstam | Gintarė Grigonytė | Sofia Gustafson Capková
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Continuous multilinguality with language vectors
Robert Östling | Jörg Tiedemann
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Most existing models for multilingual natural language processing (NLP) treat language as a discrete category, and make predictions for either one language or the other. In contrast, we propose using continuous vector representations of language. We show that these can be learned efficiently with a character-based neural language model, and used to improve inference about language varieties not seen during training. In experiments with 1303 Bible translations into 990 different languages, we empirically explore the capacity of multilingual language models, and also show that the language vectors capture genetic relationships between languages.

SU-RUG at the CoNLL-SIGMORPHON 2017 shared task: Morphological Inflection with Attentional Sequence-to-Sequence Models
Robert Östling | Johannes Bjerva
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

ResSim at SemEval-2017 Task 1: Multilingual Word Representations for Semantic Textual Similarity
Johannes Bjerva | Robert Östling
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Shared Task 1 at SemEval-2017 deals with assessing the semantic similarity between sentences, either in the same or in different languages. In our system submission, we employ multilingual word representations, in which similar words in different languages are close to one another. Using such representations is advantageous, since the increasing amount of available parallel data allows for the application of such methods to many of the languages in the world. Hence, semantic similarity can be inferred even for languages for which no annotated data exists. Our system is trained and evaluated on all language pairs included in the shared task (English, Spanish, Arabic, and Turkish). Although development results are promising, our system does not yield high performance on the shared task test sets.

Cross-lingual Learning of Semantic Textual Similarity with Multilingual Word Representations
Johannes Bjerva | Robert Östling
Proceedings of the 21st Nordic Conference on Computational Linguistics

Iconic Locations in Swedish Sign Language: Mapping Form to Meaning with Lexical Databases
Carl Börstell | Robert Östling
Proceedings of the 21st Nordic Conference on Computational Linguistics

Universal Dependencies for Swedish Sign Language
Robert Östling | Carl Börstell | Moa Gärdenfors | Mats Wirén
Proceedings of the 21st Nordic Conference on Computational Linguistics

The Helsinki Neural Machine Translation System
Robert Östling | Yves Scherrer | Jörg Tiedemann | Gongbo Tang | Tommi Nieminen
Proceedings of the Second Conference on Machine Translation

Neural Networks and Spelling Features for Native Language Identification
Johannes Bjerva | Gintarė Grigonytė | Robert Östling | Barbara Plank
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We present the RUG-SU team’s submission at the Native Language Identification Shared Task 2017. We combine several approaches into an ensemble, based on spelling error features, a simple neural network using word representations, a deep residual network using word and character features, and a system based on a recurrent neural network. Our best system is an ensemble of neural networks, reaching an F1 score of 0.8323. Although our system is not the highest ranking one, we do outperform the baseline by far.

Transparent text quality assessment with convolutional neural networks
Robert Östling | Gintare Grigonyte
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We present a very simple model for text quality assessment based on a deep convolutional neural network, where the only supervision required is one corpus of user-generated text of varying quality, and one contrasting text corpus of consistently high quality. Our model is able to provide local quality assessments in different parts of a text, which allows visual feedback about where potentially problematic parts of the text are located, as well as a way to evaluate which textual features are captured by our model. We evaluate our method on two corpora: a large corpus of manually graded student essays and a longitudinal corpus of language learner written production, and find that the text quality metric learned by our model is a fairly strong predictor of both essay grade and learner proficiency level.

2016

A Bayesian model for joint word alignment and part-of-speech transfer
Robert Östling
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Current methods for word alignment require considerable amounts of parallel text to deliver accurate results, a requirement which is met only for a small minority of the world’s approximately 7,000 languages. We show that by jointly performing word alignment and annotation transfer in a novel Bayesian model, alignment accuracy can be improved for language pairs where annotations are available for only one of the languages—a finding which could facilitate the study and processing of a vast number of low-resource languages. We also present an evaluation where our method is used to perform single-source and multi-source part-of-speech transfer with 22 translations of the same text in four different languages. This allows us to quantify the considerable variation in accuracy depending on the specific source text(s) used, even with different translations into the same language.

Modelling the informativeness and timing of non-verbal cues in parent-child interaction
Kristina Nilsson Björkenstam | Mats Wirén | Robert Östling
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

Morphological reinflection with convolutional neural networks
Robert Östling
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Phrase-Based SMT for Finnish with More Data, Better Models and Alternative Alignment and Translation Tools
Jörg Tiedemann | Fabienne Cap | Jenna Kanerva | Filip Ginter | Sara Stymne | Robert Östling | Marion Weller-Di Marco
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

How Many Languages Can a Language Model Model?
Robert Östling
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

One of the purposes of the VarDial workshop series is to encourage research into NLP methods that treat human languages as a continuum, by designing models that exploit the similarities between languages and variants. In my work, I am using a continuous vector representation of languages that allows modeling and exploring the language continuum in a very direct way. The basic tool for this is a character-based recurrent neural network language model conditioned on language vectors whose values are learned during training. By feeding the model Bible translations in a thousand languages, not only does the learned vector space capture language similarity, but by interpolating between the learned vectors it is possible to generate text in unattested intermediate forms between the training languages.

2015

Word Order Typology through Multilingual Word Alignment
Robert Östling
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Inferring the location of authors from words in their texts
Max Berggren | Jussi Karlgren | Robert Östling | Mikael Parkvall
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

Enriching the Swedish Sign Language Corpus with Part of Speech Tags Using Joint Bayesian Word Alignment and Annotation Transfer
Robert Östling | Carl Börstell | Lars Wallin
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

Bayesian Word Alignment for Massively Parallel Texts
Robert Östling
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

Automated Essay Scoring for Swedish
Robert Östling | Andre Smolentzov | Björn Tyrefors Hinnerich | Erik Höglin
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic
Hrafn Loftsson | Robert Östling
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

Venues