Shorouq Zahra


2025

pdf bib
SweSAT-1.0: The Swedish University Entrance Exam as a Benchmark for Large Language Models
Murathan Kurfalı | Shorouq Zahra | Evangelia Gogoulou | Luise Dürlich | Fredrik Carlsson | Joakim Nivre
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

This introduces SweSAT-1.0, a new benchmark dataset created from the Swedish university entrance exam (Högskoleprovet) to assess large language models in Swedish. The current version of the benchmark includes 867 questions across six different tasks, including reading comprehension, mathematical problem solving, and logical reasoning. We find that some widely used open-source and commercial models excel in verbal tasks, but we also see that all models, even the commercial ones, struggle with reasoning tasks in Swedish. We hope that SweSAT-1.0 will facilitate research on large language models for Swedish by enriching the breadth of available tasks, offering a challenging evaluation benchmark that is free from any translation biases.

2024

pdf bib
Using LLMs to Build a Database of Climate Extreme Impacts
Ni Li | Shorouq Zahra | Mariana Brito | Clare Flynn | Olof Görnerup | Koffi Worou | Murathan Kurfali | Chanjuan Meng | Wim Thiery | Jakob Zscheischler | Gabriele Messori | Joakim Nivre
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)

To better understand how extreme climate events impact society, we need to increase the availability of accurate and comprehensive information about these impacts. We propose a method for building large-scale databases of climate extreme impacts from online textual sources, using LLMs for information extraction in combination with more traditional NLP techniques to improve accuracy and consistency. We evaluate the method against a small benchmark database created by human experts and find that extraction accuracy varies for different types of information. We compare three different LLMs and find that, while the commercial GPT-4 model gives the best performance overall, the open-source models Mistral and Mixtral are competitive for some types of information.