Tollef Jørgensen


2024

This paper describes a system submitted to the supervised track (Track A) at SemEval-24: Semantic Textual Relatedness for African and Asian Languages. Challenged with datasets of varying sizes, some as small as 800 samples, we observe that the PEAR system, using smaller pre-trained masked language models to process sentence pairs (Pair Encoding), results in models that efficiently adapt to the task.In addition to the simplistic modeling approach, we experiment with hyperparameter optimization and data expansion from the provided training sets using multilingual bi-encoders, sampling a dynamic number of nearest neighbors (Augmented Re-sampling). The final models are lightweight, allowing fast experimentation and integration of new languages.

2022

We present the Norwegian Anaphora Resolution Corpus (NARC), the first publicly available corpus annotated with anaphoric relations between noun phrases for Norwegian. The paper describes the annotated data for 326 documents in Norwegian Bokmål, together with inter-annotator agreement and discussions of relevant statistics. We also present preliminary modelling results which are comparable to existing corpora for other languages, and discuss relevant problems in relation to both modelling and the annotations themselves.