Sean von Bayern

2024

pdf bib abs
Mothman at SemEval-2024 Task 9: An Iterative System for Chain-of-Thought Prompt Optimization
Alvin Po-Chun Chen | Ray Groshan | Sean von Bayern
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system’s ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.

pdf bib abs
“Keep up the good work!”: Using Constraints in Zero Shot Prompting to Generate Supportive Teacher Responses
E. Margaret Perkoff | Angela Maria Ramirez | Sean von Bayern | Marilyn Walker | James Martin
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Educational dialogue systems have been used to support students and teachers for decades. Such systems rely on explicit pedagogically motivated dialogue rules. With the ease of integrating large language models (LLMs) into dialogue systems, applications have been arising that directly use model responses without the use of human-written rules, raising concerns about their use in classroom settings. Here, we explore how to constrain LLM outputs to generate appropriate and supportive teacher-like responses. We present results comparing the effectiveness of different constraint variations in a zero-shot prompting setting on a large mathematics classroom corpus. Generated outputs are evaluated with human annotation for Fluency, Relevance, Helpfulness, and Adherence to the provided constraints. Including all constraints in the prompt led to the highest values for Fluency and Helpfulness, and the second highest value for Relevance. The annotation results also demonstrate that the prompts that result in the highest adherence to constraints do not necessarily indicate higher perceived scores for Fluency, Relevance, or Helpfulness. In a direct comparison, all of the non-baseline LLM responses were ranked higher than the actual teacher responses in the corpus over 50% of the time.

Co-authors

Marilyn Walker 1

Venues

semeval1
sigdial1

Fix author