David Samuel


2024

pdf bib
More room for language: Investigating the effect of retrieval on language models
David Samuel | Lucas Charpentier | Sondre Wold
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Retrieval-augmented language models pose a promising alternative to standard language modeling. During pretraining, these models search in a corpus of documents for contextually relevant information that could aid the language modeling objective. We introduce an ‘ideal retrieval’ methodology to study these models in a fully controllable setting. We conduct an extensive evaluation to examine how retrieval augmentation affects the behavior of the underlying language model. Among other things, we observe that these models: (i) save substantially less world knowledge in their weights, (ii) are better at understanding local context and inter-word dependencies, but (iii) are worse at comprehending global context.

pdf bib
It’s Difficult to Be Neutral – Human and LLM-based Sentiment Annotation of Patient Comments
Petter Mæhlum | David Samuel | Rebecka Maria Norman | Elma Jelin | Øyvind Andresen Bjertnæs | Lilja Øvrelid | Erik Velldal
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024

Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.

2023

pdf bib
Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings
David Samuel
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib
Not all layers are equally as important: Every Layer Counts BERT
Lucas Georges Gabriel Charpentier | David Samuel
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib
Trained on 100 million words and still in shape: BERT meets British National Corpus
David Samuel | Andrey Kutuzov | Lilja Øvrelid | Erik Velldal
Findings of the Association for Computational Linguistics: EACL 2023

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

pdf bib
Tokenization with Factorized Subword Encoding
David Samuel | Lilja Øvrelid
Findings of the Association for Computational Linguistics: ACL 2023

In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. The effectiveness of the proposed tokenization method, referred to as the Factorizer, is evaluated on language modeling and morpho-syntactic tasks for 7 diverse languages. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.

pdf bib
BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer
Lucas Charpentier | Sondre Wold | David Samuel | Egil Rønningstad
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Retrieval-based language models are increasingly employed in question-answering tasks. These models search in a corpus of documents for relevant information instead of having all factual knowledge stored in its parameters, thereby enhancing efficiency, transparency, and adaptability. We develop the first Norwegian retrieval-based model by adapting the REALM framework and evaluate it on various tasks. After training, we also separate the language model, which we call the reader, from the retriever components, and show that this can be fine-tuned on a range of downstream tasks. Results show that retrieval augmented language modeling improves the reader’s performance on extractive question-answering, suggesting that this type of training improves language models’ general ability to use context and that this does not happen at the expense of other abilities such as part-of-speech tagging, dependency parsing, named entity recognition, and lemmatization. Code, trained models, and data are made publicly available.

pdf bib
NoCoLA: The Norwegian Corpus of Linguistic Acceptability
Matias Jentoft | David Samuel
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA-class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA-zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.

pdf bib
NorBench – A Benchmark for Norwegian Language Models
David Samuel | Andrey Kutuzov | Samia Touileb | Erik Velldal | Lilja Øvrelid | Egil Rønningstad | Elina Sigdel | Anna Palatkina
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.

2022

pdf bib
Direct parsing to sentiment graphs
David Samuel | Jeremy Barnes | Robin Kurtz | Stephan Oepen | Lilja Øvrelid | Erik Velldal
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper demonstrates how a graph-based semantic parser can be applied to the task of structured sentiment analysis, directly predicting sentiment graphs from text. We advance the state of the art on 4 out of 5 standard benchmark sets. We release the source code, models and predictions.

pdf bib
EventGraph: Event Extraction as Semantic Graph Parsing
Huiling You | David Samuel | Samia Touileb | Lilja Øvrelid
Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

Event extraction involves the detection and extraction of both the event triggers and the corresponding arguments. Existing systems often decompose event extraction into multiple subtasks, without considering their possible interactions. In this paper, we propose EventGraph, a joint framework for event extraction, which encodes events as graphs. We represent event triggers and arguments as nodes in a semantic graph. Event extraction therefore becomes a graph parsing problem, which provides the following advantages: 1) performing event detection and argument extraction jointly; 2) detecting and extracting multiple events from a piece of text; 3) capturing the complicated interaction between event arguments and triggers. Experimental results on ACE2005 show that our model is competitive to state-of-the-art systems and has substantially improved the results on argument extraction. Additionally, we create two new datasets from ACE2005 where we keep the entire text spans for event arguments, instead of just the head word(s). Our code and models will be released as open-source.

pdf bib
EventGraph at CASE 2021 Task 1: A General Graph-based Approach to Protest Event Extraction
Huiling You | David Samuel | Samia Touileb | Lilja Øvrelid
Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

This paper presents our submission to the 2022 edition of the CASE 2021 shared task 1, subtask 4. The EventGraph system adapts an end-to-end, graph-based semantic parser to the task of Protest Event Extraction and more specifically subtask 4 on event trigger and argument extraction. We experiment with various graphs, encoding the events as either “labeled-edge” or “node-centric” graphs. We show that the “node-centric” approach yields best results overall, performing well across the three languages of the task, namely English, Spanish, and Portuguese. EventGraph is ranked 3rd for English and Portuguese, and 4th for Spanish.

2021

pdf bib
ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5
David Samuel | Milan Straka
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at https://github.com/ufal/multilexnorm2021 and the fine-tuned models at https://huggingface.co/ufal.

2020

pdf bib
ÚFAL at MRP 2020: Permutation-invariant Semantic Parsing in PERIN
David Samuel | Milan Straka
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

We present PERIN, a novel permutation-invariant approach to sentence-to-graph semantic parsing. PERIN is a versatile, cross-framework and language independent architecture for universal modeling of semantic structures. Our system participated in the CoNLL 2020 shared task, Cross-Framework Meaning Representation Parsing (MRP 2020), where it was evaluated on five different frameworks (AMR, DRG, EDS, PTG and UCCA) across four languages. PERIN was one of the winners of the shared task. The source code and pretrained models are available at http://www.github.com/ufal/perin.