Jonathan Hvithamar Rystrøm
2025
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs
Jonathan Hvithamar Rystrøm
|
Hannah Rose Kirk
|
Scott Hale
Proceedings of Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models
Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. A key concern is US-centric bias, where LLMs reflect US rather than local cultural values. We propose a novel methodology that compares LLM-generated response distributions against population-level opinion data from the World Value Survey across four languages (Danish, Dutch, English, and Portuguese). Using a rigorous linear mixed-effects regression framework, we compare three families of models: Google’s Gemma models (2B-27B parameters), AI2’s OLMo models (7B-32B parameters), and successive iterations of OpenAI’s turbo-series. Across the families of models, we find no consistent relationships between language capabilities and cultural alignment. While the Gemma models have a positive correlation between language capability and cultural alignment across all languages, the OpenAI and OLMo models are inconsistent. Our results demonstrate that achieving meaningful cultural alignment requires dedicated effort beyond improving general language capabilities.
2021
The Danish Gigaword Corpus
Leon Strømberg-Derczynski
|
Manuel Ciosici
|
Rebekah Baglini
|
Morten H. Christiansen
|
Jacob Aarup Dalsgaard
|
Riccardo Fusaroli
|
Peter Juel Henrichsen
|
Rasmus Hvingelby
|
Andreas Kirkedal
|
Alex Speed Kjeldsen
|
Claus Ladefoged
|
Finn Årup Nielsen
|
Jens Madsen
|
Malte Lau Petersen
|
Jonathan Hvithamar Rystrøm
|
Daniel Varab
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.