Nimet Beyza Bozdag
2026
From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
Parisa Rabbani | Nimet Beyza Bozdag | Dilek Hakkani-Tur
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Parisa Rabbani | Nimet Beyza Bozdag | Dilek Hakkani-Tur
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM’s conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model’s performance on direct factual queries with its assessment of a speaker’s correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?” to "Is this speaker correct?”. Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.”) to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.
2023
Arizonans at SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis with XLM-T
Nimet Beyza Bozdag | Tugay Bilgis | Steven Bethard
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Nimet Beyza Bozdag | Tugay Bilgis | Steven Bethard
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
This paper presents the systems and approaches of the Arizonans team for the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. We finetune the Multilingual RoBERTa model trained with about 200M tweets, XLM-T. Our final model ranked 9th out of 45 overall, 13th in seen languages, and 8th in unseen languages.
Gallagher at SemEval-2023 Task 5: Tackling Clickbait with Seq2Seq Models
Tugay Bilgis | Nimet Beyza Bozdag | Steven Bethard
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Tugay Bilgis | Nimet Beyza Bozdag | Steven Bethard
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
This paper presents the systems and approaches of the Gallagher team for the SemEval-2023 Task 5: Clickbait Spoiling. We propose a method to classify the type of spoiler (phrase, passage, multi) and a question-answering method to generate spoilers that satisfy the curiosity caused by clickbait posts. We experiment with the state-of-the-art Seq2Seq model T5. To identify the spoiler types we used a fine-tuned T5 classifier (Subtask 1). A mixture of T5 and Flan-T5 was used to generate the spoilers for clickbait posts (Subtask 2). Our system officially ranks first in generating phrase type spoilers in Subtask 2, and achieves the highest precision score for passage type spoilers in Subtask 1.