Training and Evaluating Norwegian Sentence Embedding Models

Bernt Ivar Utstøl Nødland


Abstract
We train and evaluate Norwegian sentence embedding models using the contrastive learning methodology SimCSE. We start from pre-trained Norwegian encoder models and train both unsupervised and supervised models. The models are evaluated on a machine-translated version of semantic textual similarity datasets, as well as binary classification tasks. We show that we can train good Norwegian sentence embedding models, that clearly outperform the pre-trained encoder models, as well as the multilingual mBERT, on the task of sentence similarity.
Anthology ID:
2023.nodalida-1.23
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
228–237
Language:
URL:
https://aclanthology.org/2023.nodalida-1.23
DOI:
Bibkey:
Cite (ACL):
Bernt Ivar Utstøl Nødland. 2023. Training and Evaluating Norwegian Sentence Embedding Models. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 228–237, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Training and Evaluating Norwegian Sentence Embedding Models (Nødland, NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.23.pdf