Lukas Edman - ACL Anthology

Lukas Edman

2025

Positional Overload: Positional Debiasing and Context Window Extension for Large Language Models using Set Encoding
Lukas Kinder | Lukas Edman | Alexander Fraser | Tobias Käfer
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) typically track the order of tokens using positional encoding, which causes the following problems: positional bias, where the model is influenced by an ordering within the prompt, and a fixed context window, as models struggle to generalize to positions beyond those encountered during training. To address these limitations, we developed a novel method called set encoding. This method allows multiple pieces of text to be encoded in the same position, thereby eliminating positional bias entirely. Another promising use case for set encoding is to increase the size of the input an LLM can handle. Our experiments demonstrate that set encoding allows an LLM to solve tasks with far more tokens than without set encoding. To our knowledge, set encoding is the first technique to effectively extend an LLM’s context window without requiring any additional training.

Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs
Lukas Edman | Alexander Fraser
Proceedings of the First BabyLM Workshop

We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model’s ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model’s morphological generalization capabilities. Our submission beats the baseline in the strict-small track.

EXECUTE: A Multilingual Benchmark for LLM Token Understanding
Lukas Edman | Helmut Schmid | Alexander Fraser
Findings of the Association for Computational Linguistics: ACL 2025

The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs’ understanding of character components.

Findings of the WMT 2025 Shared Task LLMs with Limited Resources for Slavic Languages: MT and QA
Shu Okabe | Daryna Dementieva | Marion Di Marco | Lukas Edman | Katharina Haemmerl | Marko Měškank | Anita Hendrichowa | Alexander Fraser
Proceedings of the Tenth Conference on Machine Translation

We present the findings of the WMT 2025 Shared Task LLMs with Limited Resources for Slavic Languages. This shared task focuses on training LLMs using limited data and compute resources for three Slavic languages: Upper Sorbian (hsb), Lower Sorbian (dsb), and Ukrainian (uk), with the objective to develop and improve LLMs for these languages. We consider two tasks which are to be evaluated jointly: Machine Translation (MT) and Multiple-Choice Question Answering (QA). In total, three teams participated in this shared task, with submissions from all three teams for the Sorbian languages and one submission for Ukrainian. All submissions led to an improvement compared to the baseline Qwen2.5-3B model through varying fine-tuning strategies. We note, however, that training purely on MT degrades original QA capabilities. We also report further analyses on the submissions, including MT evaluation using advanced neural metrics for Ukrainian, as well as manual annotation and comparison to the current Sorbian machine translator.

2024

Are BabyLMs Second Language Learners?
Lukas Edman | Lisa Bylinina | Faeze Ghorbanpour | Alexander Fraser
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge. Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data.We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two best models being trained on 1) a mix of paraphrase data and data from the BabyLM pretraining dataset, and 2) exclusively paraphrase data.

CUTE: Measuring LLMs’ Understanding of Their Tokens
Lukas Edman | Helmut Schmid | Alexander Fraser
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation
Lukas Edman | Gabriele Sarti | Antonio Toral | Gertjan van Noord | Arianna Bisazza
Transactions of the Association for Computational Linguistics, Volume 12

Pretrained character-level and byte-level language models have been shown to be competitive with popular subword models across a range of Natural Language Processing tasks. However, there has been little research on their effectiveness for neural machine translation (NMT), particularly within the popular pretrain-then-finetune paradigm. This work performs an extensive comparison across multiple languages and experimental conditions of character- and subword-level pretrained models (ByT5 and mT5, respectively) on NMT. We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited. In our analysis, we show how character models’ gains in translation quality are reflected in better translations of orthographically similar words and rare words. While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5, suggesting an ability to modulate word-level and character-level information during generation. We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.

2023

Too Much Information: Keeping Training Simple for BabyLMs
Lukas Edman | Lisa Bylinina
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

LCT-1 at SemEval-2023 Task 10: Pre-training and Multi-task Learning for Sexism Detection and Classification
Konstantin Chernyshev | Ekaterina Garanina | Duygu Bayram | Qiankun Zheng | Lukas Edman
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Misogyny and sexism are growing problems in social media. Advances have been made in online sexism detection but the systems are often uninterpretable. SemEval-2023 Task 10 on Explainable Detection of Online Sexism aims at increasing explainability of the sexism detection, and our team participated in all the proposed subtasks. Our system is based on further domain-adaptive pre-training. Building on the Transformer-based models with the domain adaptation, we compare fine-tuning with multi-task learning and show that each subtask requires a different system configuration. In our experiments, multi-task learning performs on par with standard fine-tuning for sexism detection and noticeably better for coarse-grained sexism classification, while fine-tuning is preferable for fine-grained classification.

2022

Subword-Delimited Downsampling for Better Character-Level Translation
Lukas Edman | Antonio Toral | Gertjan van Noord
Findings of the Association for Computational Linguistics: EMNLP 2022

Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords.This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.

RUG-1-Pegasussers at SemEval-2022 Task 3: Data Generation Methods to Improve Recognizing Appropriate Taxonomic Word Relations
Frank van den Berg | Gijs Danoe | Esther Ploeger | Wessel Poelman | Lukas Edman | Tommaso Caselli
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes our system created for the SemEval 2022 Task 3: Presupposed Taxonomies - Evaluating Neural-network Semantics. This task is focused on correctly recognizing taxonomic word relations in English, French and Italian. We developed various datageneration techniques that expand the originally provided train set and show that all methods increase the performance of modelstrained on these expanded datasets. Our final system outperformed the baseline system from the task organizers by achieving an average macro F1 score of 79.6 on all languages, compared to the baseline’s 67.4.

2021

The Importance of Context in Very Low Resource Language Modeling
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

This paper investigates very low resource language model pretraining, when less than 100 thousand sentences are available. We find that, in very low-resource scenarios, statistical n-gram language models outperform state-of-the-art neural models. Our experiments show that this is mainly due to the focus of the former on a local context. As such, we introduce three methods to improve a neural model’s performance in the low-resource setting, finding that limiting the model’s self-attention is the most effective one, improving on downstream tasks such as NLI and POS tagging by up to 5% for the languages we test on: English, Hindi, and Turkish.

Unsupervised Translation of German–Lower Sorbian: Exploring Training and Novel Transfer Methods on a Low-Resource Language
Lukas Edman | Ahmet Üstün | Antonio Toral | Gertjan van Noord
Proceedings of the Sixth Conference on Machine Translation

This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2021 Unsupervised Machine Translation task for German–Lower Sorbian (DE–DSB): a high-resource language to a low-resource one. Our system uses a transformer encoder-decoder architecture in which we make three changes to the standard training procedure. First, our training focuses on two languages at a time, contrasting with a wealth of research on multilingual systems. Second, we introduce a novel method for initializing the vocabulary of an unseen language, achieving improvements of 3.2 BLEU for DE->DSB and 4.0 BLEU for DSB->DE.Lastly, we experiment with the order in which offline and online back-translation are used to train an unsupervised system, finding that using online back-translation first works better for DE->DSB by 2.76 BLEU. Our submissions ranked first (tied with another team) for DSB->DE and third for DE->DSB.

2020

Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Unsupervised Machine Translation has been advancing our ability to translate without parallel data, but state-of-the-art methods assume an abundance of monolingual data. This paper investigates the scenario where monolingual data is limited as well, finding that current unsupervised methods suffer in performance under this stricter setting. We find that the performance loss originates from the poor quality of the pretrained monolingual embeddings, and we offer a potential solution: dependency-based word embeddings. These embeddings result in a complementary word representation which offers a boost in performance of around 1.5 BLEU points compared to standard word2vec when monolingual data is limited to 1 million sentences per language. We also find that the inclusion of sub-word information is crucial to improving the quality of the embeddings.

Machine Translation for English–Inuktitut with Segmentation, Data Acquisition and Pre-Training
Christian Roest | Lukas Edman | Gosse Minnema | Kevin Kelly | Jennifer Spenader | Antonio Toral
Proceedings of the Fifth Conference on Machine Translation

Translating to and from low-resource polysynthetic languages present numerous challenges for NMT. We present the results of our systems for the English–Inuktitut language pair for the WMT 2020 translation tasks. We investigated the importance of correct morphological segmentation, whether or not adding data from a related language (Greenlandic) helps, and whether using contextual word embeddings improves translation. While each method showed some promise, the results are mixed.

Data Selection for Unsupervised Translation of German–Upper Sorbian
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the Fifth Conference on Machine Translation

This paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German–Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.

2019

Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data
Antonio Toral | Lukas Edman | Galiya Yeshmagambetova | Jennifer Spenader
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the systems submitted by the University of Groningen to the English– Kazakh language pair (both translation directions) for the WMT 2019 news translation task. We explore the potential benefits of (i) morphological segmentation (both unsupervised and rule-based), given the agglutinative nature of Kazakh, (ii) data from two additional languages (Turkish and Russian), given the scarcity of English–Kazakh data and (iii) synthetic data, both for the source and for the target language. Our best submissions ranked second for Kazakh→English and third for English→Kazakh in terms of the BLEU automatic evaluation metric.

Venues