Workshop on Resources and Representations for Under-Resourced Languages and Domains (2023)


pdf (full)
bib (full)
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

pdf bib
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
Nikolai Ilinykh | Felix Morger | Dana Dannélls | Simon Dobnik | Beáta Megyesi | Joakim Nivre

pdf bib
Ableist Language Teching over Sign Language Research
Carl Börstell

The progress made in computer-assisted linguistics has led to huge advances in natural language processing (NLP) research. This research often benefits linguistics in a broader sense, e.g., by digitizing pre-existing data and analyzing ever larger quantities of linguistic data in audio or visual form, such as sign language video data using computer vision methods. A large portion of research conducted on sign languages today is based in computer science and engineering, but much of this research is unfortunately conducted without any input from experts on the linguistics of sign languages or deaf communities. This is obvious from some of the language used in the published research, which regularly contains ableist labels. In this paper, I illustrate this by demonstrating the distribution of words in titles of research papers indexed by Google Scholar. By doing so, we see that the number of tech papers is increasing while the number of linguistics papers is (relatively) decreasing, and that ableist language is more frequent in tech papers. By extension, this suggest that much of the tech-related work on sign languages – heavily under-researched and under-resourced languages – is conducted without collaboration and consultation with deaf communities and experts, against ethical recommendations.

pdf bib
The DA-ELEXIS Corpus - a Sense-Annotated Corpus for Danish with Parallel Annotations for Nine European Languages
Bolette Pedersen | Sanni Nimb | Sussi Olsen | Thomas Troelsgård | Ida Flörke | Jonas Jensen | Henrik Lorentzen

In this paper, we present the newly compiled DA-ELEXIS Corpus, which is one of the largest sense-annotated corpora available for Danish, and the first one to be annotated with the Danish wordnet, DanNet. The corpus is part of a European initiative, the ELEXIS project, and has corresponding parallel annotations in nine other European languages. As such it functions as a cross-lingual evaluative benchmark for a series of low and medium resourced European language. We focus here on the Danish annotation process, i.e. on the annotation scheme including annotation guidelines and a primary sense inventory constituted by DanNet as well as the fall-back sense inventory namely The Danish Dictionary (DDO). We analyse and discuss issues such as out of vocabulary (OOV) problems, problems with sense granularity and missing senses (in particular for verbs), and how to semantically tag multiword expressions (MWE), which prove to occur very frequently in the Danish corpus. Finally, we calculate the inter-annotator agreement (IAA) and show how IAA has improved during the annotation process. The openly available corpus contains 32,524 tokens of which sense annotations are given for all content words, amounting to 7,322 nouns, 3,099 verbs, 2,626 adjectives, and 1,677 adverbs.

pdf bib
Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter

In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.

pdf bib
What Causes Unemployment? Unsupervised Causality Mining from Swedish Governmental Reports
Luise Dürlich | Joakim Nivre | Sara Stymne

Extracting statements about causality from text documents is a challenging task in the absence of annotated training data. We create a search system for causal statements about user-specified concepts by combining pattern matching of causal connectives with semantic similarity ranking, using a language model fine-tuned for semantic textual similarity. Preliminary experiments on a small test set from Swedish governmental reports show promising results in comparison to two simple baselines.

pdf bib
Are There Any Limits to English-Swedish Language Transfer? A Fine-grained Analysis Using Natural Language Inference
Felix Morger

The developments of deep learning in natural language processing (NLP) in recent years have resulted in an unprecedented amount of computational power and data required to train state-of-the-art NLP models. This makes lower-resource languages, such as Swedish, increasingly more reliant on language transfer effects from English since they do not have enough data to train separate monolingual models. In this study, we investigate whether there is any potential loss in English-Swedish language transfer by evaluating two types of language transfer on the GLUE/SweDiagnostics datasets and comparing between different linguistic phenomena. The results show that for an approach using machine translation for training there is no considerable loss in overall performance nor by any particular linguistic phenomena, while relying on pre-training of a multilingual model results in considerable loss in performance. This raises questions about the role of machine translation and the use of natural language inference (NLI) as well as parallel corpora for measuring English-Swedish language transfer.

pdf bib
Word Substitution with Masked Language Models as Data Augmentation for Sentiment Analysis
Larisa Kolesnichenko | Erik Velldal | Lilja Øvrelid

This paper explores the use of masked language modeling (MLM) for data augmentation (DA), targeting structured sentiment analysis (SSA) for Norwegian based on a dataset of annotated reviews. Considering the limited resources for Norwegian language and the complexity of the annotation task, the aim is to investigate whether this approach to data augmentation can help boost the performance. We report on experiments with substituting words both inside and outside of sentiment annotations, and we also present an error analysis, discussing some of the potential pitfalls of using MLM-based DA for SSA, and suggest directions for future work.

pdf bib
A Large Norwegian Dataset for Weak Supervision ASR
Per Erik Solberg | Pierre Beauguitte | Per Egil Kummervold | Freddy Wetjen

With the advent of weakly supervised ASR systems like Whisper, it is possible to train ASR systems on non-verbatim transcriptions. This paper describes an effort to create a large Norwegian dataset for weakly supervised ASR from parliamentary recordings. Audio from Stortinget, the Norwegian parliament, is segmented and transcribed with an existing ASR system. An algorithm retrieves transcripts of these segments from Stortinget’s official proceedings using the Levenshtein edit distance between the ASR output and the proceedings text. In that way, a dataset of more than 5000 hours of transcribed speech is produced with limited human effort. Since parliamentary data is public domain, the dataset can be shared freely without any restrictions.

pdf bib
Lexical Semantics with Vector Symbolic Architectures
Adam Roussel

Conventional approaches to the construction of word vectors typically require very large amounts of unstructured text and powerful computing hardware, and the vectors themselves are also difficult if not impossible to inspect or interpret on their own. In this paper, we introduce a method for building word vectors using the framework of vector symbolic architectures in order to encode the semantic information in wordnets, such as the Open English WordNet or the Open Multilingual Wordnet. Such vectors perform surprisingly well on common word similarity benchmarks, and yet they are transparent, interpretable, and the information contained within them has a clear provenance.

pdf bib
Linked Open Data compliant Representation of the Interlinking of Nordic Wordnets and Sign Language Data
Thierry Declerck | Sussi Olsen

We present ongoing work dealing with a Linked Open Data (LOD) compliant representation of Sign Language (SL) data, with the goal of supporting the cross-lingual linking of SL data, also to Spoken Language data. As the European EASIER research project has already investigated the use of Open Multilingual Wordnet (OMW) datasets for cross-linking German and Greek SL data, we propose a unified RDF-based representation of OMW and SL data. In this context, we experimented with the transformation into RDF of a rich dataset, which links Danish Sign Language data and the wordnet for Danish, DanNet. We extend this work to other Nordic languages, aiming at supporting cross-lingual comparisons of Nordic Sign Languages. This unified formal representation offers a semantic repository of information on SL data that could be accessed for supporting the creation of datasets for training or evaluating NLP applications that involve SLs.

pdf bib
Part-of-Speech tagging Spanish Sign Language data and its applications in Sign Language machine translation
Euan McGill | Luis Chiruzzo | Santiago Egea Gómez | Horacio Saggion

This paper examines the use of manually part-of-speech tagged sign language gloss data in the Text2Gloss and Gloss2Text translation tasks, as well as running an LSTM-based sequence labelling model on the same glosses for automatic part-of-speech tagging. We find that a combination of tag-enhanced glosses and pretraining the neural model positively impacts performance in the translation tasks. The results of the tagging task are limited, but provide a methodological framework for further research into tagging sign language gloss data.

pdf bib
A Diagnostic Dataset for Sentiment and Negation Modeling for Norwegian
Petter Mæhlum | Erik Velldal | Lilja Øvrelid

Negation constitutes a challenging phenomenon for many natural language processing tasks, such as sentiment analysis (SA). In this paper we investigate the relationship between negation and sentiment in the context of Norwegian professional reviews. The first part of this paper includes a corpus study which investigates how negation is tied to sentiment in this domain, based on existing annotations. In the second part, we introduce NoReC-NegSynt, a synthetically augmented test set for negation and sentiment, to allow for a more detailed analysis of the role of negation in current neural SA models. This diagnostic test set, containing both clausal and non-clausal negation, allows for analyzing and comparing models’ abilities to treat several different types of negation. We also present a case-study, applying several neural SA models to the diagnostic data.

pdf bib
Building Okinawan Lexicon Resource for Language Reclamation/Revitalization and Natural Language Processing Tasks such as Universal Dependencies Treebanking
So Miyagawa | Kanji Kato | Miho Zlazli | Salvatore Carlino | Seira Machida

The Open Multilingual Online Lexicon of Okinawan (OMOLO) project aims to create an accessible, user-friendly digital lexicon for the endangered Okinawan language using digital humanities tools and methodologies. The multilingual web application, available in Japanese, English, Portuguese, and Spanish, will benefit language learners, researchers, and the Okinawan community in Japan and diaspora countries such as the U.S., Brazil, and Peru. The project also lays the foundation for an Okinawan UD Treebank, which will support computational analysis and the development of language technology tools such as parsers, machine translation systems, and speech recognition software. The OMOLO project demonstrates the potential of computational linguistics in preserving and revitalizing endangered languages and can serve as a blueprint for similar initiatives.

pdf bib
Bridging the Resource Gap: Exploring the Efficacy of English and Multilingual LLMs for Swedish
Oskar Holmström | Jenny Kunz | Marco Kuhlmann

Large language models (LLMs) have substantially improved natural language processing (NLP) performance, but training these models from scratch is resource-intensive and challenging for smaller languages. With this paper, we want to initiate a discussion on the necessity of language-specific pre-training of LLMs.We propose how the “one model-many models” conceptual framework for task transfer can be applied to language transfer and explore this approach by evaluating the performance of non-Swedish monolingual and multilingual models’ performance on tasks in Swedish.Our findings demonstrate that LLMs exposed to limited Swedish during training can be highly capable and transfer competencies from English off-the-shelf, including emergent abilities such as mathematical reasoning, while at the same time showing distinct culturally adapted behaviour. Our results suggest that there are resourceful alternatives to language-specific pre-training when creating useful LLMs for small languages.

pdf bib
Phonotactics as an Aid in Low Resource Loan Word Detection and Morphological Analysis in Sakha
Petter Mæhlum | Sardana Ivanova

Obtaining information about loan words and irregular morphological patterns can be difficult for low-resource languages. Using Sakha as an example, we show that it is possible to exploit known phonemic regularities such as vowel harmony and consonant distributions to identify loan words and irregular patterns, which can be helpful in rule-based downstream tasks such as parsing and POS-tagging. We evaluate phonemically inspired methods for loanword detection, combined with bi-gram vowel transition probabilities to inspect irregularities in the morphology of loanwords. We show that both these techniques can be useful for the detection of such patterns. Finally, we inspect the plural suffix -ЛАр [-LAr] to observe some of the variation in morphology between native and foreign words.

pdf bib
Vector Norms as an Approximation of Syntactic Complexity
Adam Ek | Nikolai Ilinykh

Internal representations in transformer models can encode useful linguistic knowledge about syntax. Such knowledge could help optimise the data annotation process. However, identifying and extracting such representations from big language models is challenging. In this paper we evaluate two multilingual transformers for the presence of knowledge about the syntactic complexity of sentences and examine different vector norms. We provide a fine-grained evaluation of different norms in different layers and for different languages. Our results suggest that no single part in the models would be the primary source for the knowledge of syntactic complexity. But some norms show a higher degree of sensitivity to syntactic complexity, depending on the language and model used.

pdf bib
Low-Resource Techniques for Analysing the Rhetorical Structure of Swedish Historical Petitions
Ellinor Lindqvist | Eva Pettersson | Joakim Nivre

Natural language processing techniques can be valuable for improving and facilitating historical research. This is also true for the analysis of petitions, a source which has been relatively little used in historical research. However, limited data resources pose challenges for mainstream natural language processing approaches based on machine learning. In this paper, we explore methods for automatically segmenting petitions according to their rhetorical structure. We find that the use of rules, word embeddings, and especially keywords can give promising results for this task.