David Samuel - ACL Anthology

David Samuel

2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
Vladislav Mikhailov | Tita Enstad | David Samuel | Hans Christian Farsethås | Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Findings of the Association for Computational Linguistics: ACL 2025

This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets – of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokmål and Nynorsk. All our datasets and a collection of over 100 human-created prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pretrained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.

The use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

Small Languages, Big Models: A Study of Continual Training on Languages of Norway
David Samuel | Vladislav Mikhailov | Erik Velldal | Lilja Øvrelid | Lucas Georges Gabriel Charpentier | Andrey Kutuzov | Stephan Oepen
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

Multi-label Scandinavian Language Identification (SLIDE)
Mariia Fedorova | Jonas Sebulon Frydenberg | Victoria Handford | Victoria Ovedie Chruickshank Langø | Solveig Helene Willoch | Marthe Løken Midtgaard | Yves Scherrer | Petter Mæhlum | David Samuel
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

2024

It’s Difficult to Be Neutral – Human and LLM-based Sentiment Annotation of Patient Comments
Petter Mæhlum | David Samuel | Rebecka Maria Norman | Elma Jelin | Øyvind Andresen Bjertnæs | Lilja Øvrelid | Erik Velldal
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024

Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.

GPT or BERT: why not both?
Lucas Georges Gabriel Charpentier | David Samuel
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack – GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

More room for language: Investigating the effect of retrieval on language models
David Samuel | Lucas Charpentier | Sondre Wold
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Retrieval-augmented language models pose a promising alternative to standard language modeling. During pretraining, these models search in a corpus of documents for contextually relevant information that could aid the language modeling objective. We introduce an ‘ideal retrieval’ methodology to study these models in a fully controllable setting. We conduct an extensive evaluation to examine how retrieval augmentation affects the behavior of the underlying language model. Among other things, we observe that these models: (i) save substantially less world knowledge in their weights, (ii) are better at understanding local context and inter-word dependencies, but (iii) are worse at comprehending global context.

2023

Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings
David Samuel
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Not all layers are equally as important: Every Layer Counts BERT
Lucas Georges Gabriel Charpentier | David Samuel
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Trained on 100 million words and still in shape: BERT meets British National Corpus
David Samuel | Andrey Kutuzov | Lilja Øvrelid | Erik Velldal
Findings of the Association for Computational Linguistics: EACL 2023

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

Tokenization with Factorized Subword Encoding
David Samuel | Lilja Øvrelid
Findings of the Association for Computational Linguistics: ACL 2023

In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. The effectiveness of the proposed tokenization method, referred to as the Factorizer, is evaluated on language modeling and morpho-syntactic tasks for 7 diverse languages. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.

BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer
Lucas Charpentier | Sondre Wold | David Samuel | Egil Rønningstad
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Retrieval-based language models are increasingly employed in question-answering tasks. These models search in a corpus of documents for relevant information instead of having all factual knowledge stored in its parameters, thereby enhancing efficiency, transparency, and adaptability. We develop the first Norwegian retrieval-based model by adapting the REALM framework and evaluate it on various tasks. After training, we also separate the language model, which we call the reader, from the retriever components, and show that this can be fine-tuned on a range of downstream tasks. Results show that retrieval augmented language modeling improves the reader’s performance on extractive question-answering, suggesting that this type of training improves language models’ general ability to use context and that this does not happen at the expense of other abilities such as part-of-speech tagging, dependency parsing, named entity recognition, and lemmatization. Code, trained models, and data are made publicly available.

NoCoLA: The Norwegian Corpus of Linguistic Acceptability
Matias Jentoft | David Samuel
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA-class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA-zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.

NorBench – A Benchmark for Norwegian Language Models
David Samuel | Andrey Kutuzov | Samia Touileb | Erik Velldal | Lilja Øvrelid | Egil Rønningstad | Elina Sigdel | Anna Palatkina
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.

2022

Direct parsing to sentiment graphs
David Samuel | Jeremy Barnes | Robin Kurtz | Stephan Oepen | Lilja Øvrelid | Erik Velldal
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper demonstrates how a graph-based semantic parser can be applied to the task of structured sentiment analysis, directly predicting sentiment graphs from text. We advance the state of the art on 4 out of 5 standard benchmark sets. We release the source code, models and predictions.

EventGraph: Event Extraction as Semantic Graph Parsing
Huiling You | David Samuel | Samia Touileb | Lilja Øvrelid
Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

Event extraction involves the detection and extraction of both the event triggers and the corresponding arguments. Existing systems often decompose event extraction into multiple subtasks, without considering their possible interactions. In this paper, we propose EventGraph, a joint framework for event extraction, which encodes events as graphs. We represent event triggers and arguments as nodes in a semantic graph. Event extraction therefore becomes a graph parsing problem, which provides the following advantages: 1) performing event detection and argument extraction jointly; 2) detecting and extracting multiple events from a piece of text; 3) capturing the complicated interaction between event arguments and triggers. Experimental results on ACE2005 show that our model is competitive to state-of-the-art systems and has substantially improved the results on argument extraction. Additionally, we create two new datasets from ACE2005 where we keep the entire text spans for event arguments, instead of just the head word(s). Our code and models will be released as open-source.

EventGraph at CASE 2021 Task 1: A General Graph-based Approach to Protest Event Extraction
Huiling You | David Samuel | Samia Touileb | Lilja Øvrelid
Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

This paper presents our submission to the 2022 edition of the CASE 2021 shared task 1, subtask 4. The EventGraph system adapts an end-to-end, graph-based semantic parser to the task of Protest Event Extraction and more specifically subtask 4 on event trigger and argument extraction. We experiment with various graphs, encoding the events as either “labeled-edge” or “node-centric” graphs. We show that the “node-centric” approach yields best results overall, performing well across the three languages of the task, namely English, Spanish, and Portuguese. EventGraph is ranked 3rd for English and Portuguese, and 4th for Spanish.

2021

ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5
David Samuel | Milan Straka
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at https://github.com/ufal/multilexnorm2021 and the fine-tuned models at https://huggingface.co/ufal.

2020

ÚFAL at MRP 2020: Permutation-invariant Semantic Parsing in PERIN
David Samuel | Milan Straka
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

We present PERIN, a novel permutation-invariant approach to sentence-to-graph semantic parsing. PERIN is a versatile, cross-framework and language independent architecture for universal modeling of semantic structures. Our system participated in the CoNLL 2020 shared task, Cross-Framework Meaning Representation Parsing (MRP 2020), where it was evaluated on five different frameworks (AMR, DRG, EDS, PTG and UCCA) across four languages. PERIN was one of the winners of the shared task. The source code and pretrained models are available at http://www.github.com/ufal/perin.

Co-authors

Stephan Oepen 5

Petter Mæhlum 4

Mariia Fedorova 3

Samia Touileb 3

Nikolay Arefyev 2

Marta Bañón 2

Laurie Burchell 2

Hans Christian Farsethås 2

Liane Guillou 2

Jindřich Helcl 2

Erik Henriksson 2

Veronika Laippala 2

Bhavitvya Malik 2

Farrokh Mehryary 2

Amanda Myntti 2

Dayyán O’Brien 2

Sampo Pyysalo 2

Gema Ramírez-Sánchez 2

Egil Rønningstad 2

Pavel Stepachev 2

Jörg Tiedemann 2

Jaume Zaragoza-Bernabeu 2

Ona de Gibert 2

Jeremy Barnes 1

Magnus Breder Birkenes 1

Øyvind Andresen Bjertnæs 1

Rolv-Arild Braaten 1

Svein Arne Brygfjeld 1

Javier De La Rosa 1

Jonas Sebulon Frydenberg 1

Jon Atle Gulla 1

Victoria Handford 1

Matias Jentoft 1

Mateusz Klimaszewski 1

Ville Komulainen 1

Joona Kytöniemi 1

Victoria Ovedie Chruickshank Langø 1

Marthe Løken Midtgaard 1

Aslak Sira Myhre 1

Rebecka Maria Norman 1

Anna Palatkina 1

Yves Scherrer 1

Tereza Vojtěchová 1

Freddy Wetjen 1

Solveig Helene Willoch 1

Wilfred Østgulen 1

Venues