Zahra Kolagar

2025

Investigating Methods for Mapping Learning Objectives to Bloom’s Revised Taxonomy in Course Descriptions for Higher Education
Zahra Kolagar | Frank Zalkow | Alessandra Zarcone
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Aligning Learning Objectives (LOs) in course descriptions with educational frameworks such as Bloom’s revised taxonomy is an important step in maintaining educational quality, yet it remains a challenging and often manual task. With the growing availability of large language models (LLMs), a natural question arises: can these models meaningfully automate LO classification, or are non-LLM methods still sufficient? In this work, we systematically compare LLM- and non-LLM-based methods for mapping LOs to Bloom’s taxonomy levels, using expert annotations as the gold standard. LLM-based methods consistently outperform non-LLM methods and offer more balanced distributions across taxonomy levels. Moreover, contrary to common concerns, we do not observe significant biases (e.g. verbosity or positional) or notable sensitivity to prompt structure in LLM outputs. Our results suggest that a more consistent and precise formulation of LOs, along with improved methods, could support both automated and expert-driven efforts to better align LOs with taxonomy levels.

2024

pdf bib abs

HumSum: A Personalized Lecture Summarization Tool for Humanities Students Using LLMs
Zahra Kolagar | Alessandra Zarcone
Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024)

Generative AI systems aim to create customizable content for their users, with a subsequent surge in demand for adaptable tools that can create personalized experiences. This paper presents HumSum, a web-based tool tailored for humanities students to effectively summarize their lecture transcripts and to personalize the summaries to their specific needs. We first conducted a survey driven by different potential scenarios to collect user preferences to guide the implementation of this tool. Utilizing Streamlit, we crafted the user interface, while Langchain’s Map Reduce function facilitated the summarization process for extensive lectures using OpenAI’s GPT-4 model. HumSum is an intuitive tool serving various summarization needs, infusing personalization into the tool’s functionality without necessitating the collection of personal user data.

pdf bib abs

Aligning Uncertainty: Leveraging LLMs to Analyze Uncertainty Transfer in Text Summarization
Zahra Kolagar | Alessandra Zarcone
Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)

Automatically generated summaries can be evaluated along different dimensions, one being how faithfully the uncertainty from the source text is conveyed in the summary. We present a study on uncertainty alignment in automatic summarization, starting from a two-tier lexical and semantic categorization of linguistic expression of uncertainty, which we used to annotate source texts and automatically generate summaries. We collected a diverse dataset including news articles and personal blogs and generated summaries using GPT-4. Source texts and summaries were annotated based on our two-tier taxonomy using a markup language. The automatic annotation was refined and validated by subsequent iterations based on expert input. We propose a method to evaluate the fidelity of uncertainty transfer in text summarization. The method capitalizes on a small amount of expert annotations and on the capabilities of Large language models (LLMs) to evaluate how the uncertainty of the source text aligns with the uncertainty expressions in the summary.

2023

pdf bib abs

EduQuick: A Dataset Toward Evaluating Summarization of Informal Educational Content for Social Media
Zahra Kolagar | Sebastian Steindl | Alessandra Zarcone
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

This study explores the capacity of large language models (LLMs) to efficiently generate summaries of informal educational content tailored for platforms like TikTok. It also investigates how both humans and LLMs assess the quality of these summaries, based on a series of experiments, exploring the potential replacement of human evaluation with LLMs. Furthermore, the study delves into how experienced content creators perceive the utility of automatic summaries for TikTok videos. We employ strategic prompt selection techniques to guide LLMs in producing engaging summaries based on the characteristics of viral TikTok content, including hashtags, captivating hooks, storytelling, and user engagement. The study leverages OpenAI’s GPT-4 model to generate TikTok content summaries, aiming to align them with the essential features identified. By employing this model and incorporating human evaluation and expert assessment, this research endeavors to shed light on the intricate dynamics of modern content creation, where AI and human ingenuity converge. Ultimately, it seeks to enhance strategies for disseminating and evaluating educational information effectively in the realm of social media.

2022

pdf bib abs

GiCCS: A German in-Context Conversational Similarity Benchmark
Shima Asaadi | Zahra Kolagar | Alina Liebel | Alessandra Zarcone
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

The Semantic textual similarity (STS) task is commonly used to evaluate the semantic representations that language models (LMs) learn from texts, under the assumption that good-quality representations will yield accurate similarity estimates. When it comes to estimating the similarity of two utterances in a dialogue, however, the conversational context plays a particularly important role. We argue for the need of benchmarks specifically created using conversational data in order to evaluate conversational LMs in the STS task. We introduce GiCCS, a first conversational STS evaluation benchmark for German. We collected the similarity annotations for GiCCS using best-worst scaling and presenting the target items in context, in order to obtain highly-reliable context-dependent similarity scores. We present benchmarking experiments for evaluating LMs on capturing the similarity of utterances. Results suggest that pretraining LMs on conversational data and providing conversational context can be useful for capturing similarity of utterances in dialogues. GiCCS will be publicly available to encourage benchmarking of conversational LMs.

2020

pdf bib abs

PATE: A Corpus of Temporal Expressions for the In-car Voice Assistant Domain
Alessandra Zarcone | Touhidul Alam | Zahra Kolagar
Proceedings of the Twelfth Language Resources and Evaluation Conference

The recognition and automatic annotation of temporal expressions (e.g. “Add an event for tomorrow evening at eight to my calendar”) is a key module for AI voice assistants, in order to allow them to interact with apps (for example, a calendar app). However, in the NLP literature, research on temporal expressions has focused mostly on data from the news, from the clinical domain, and from social media. The voice assistant domain is very different than the typical domains that have been the focus of work on temporal expression identification, thus requiring a dedicated data collection. We present a crowdsourcing method for eliciting natural-language commands containing temporal expressions for an AI voice assistant, by using pictures and scenario descriptions. We annotated the elicited commands (480) as well as the commands in the Snips dataset following the TimeML/TIMEX3 annotation guidelines, reaching a total of 1188 annotated commands. The commands can be later used to train the NLU components of an AI voice assistant.

Co-authors

Frank Zalkow 1

Venues

PERSONALIZE1

UncertaiNLP1

Fix author