Ioannis Panagiotopoulos
2025
RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation
Ioannis Panagiotopoulos
|
George Filandrianos
|
Maria Lymperaiou
|
Giorgos Stamou
Proceedings of the 31st International Conference on Computational Linguistics
Riddle-solving requires advanced reasoning skills, pushing Large Language Models (LLMs) to engage in abstract thinking and creative problem-solving, often revealing limitations in their cognitive abilities. In this paper, we examine the riddle-solving capabilities of LLMs using a multiple-choice format, exploring how different prompting techniques impact performance on riddles that demand diverse reasoning skills. To enhance results, we introduce RISCORE (RIddle Solving with COntext REcontruciton) a novel fully automated prompting method that generates and utilizes contextually reconstructed sentence-based puzzles in conjunction with the original examples to create few-shot exemplars. Our experiments demonstrate that RISCORE significantly improves the performance of language models in both vertical and lateral thinking tasks, surpassing traditional exemplar selection strategies across a variety of few-shot settings.
2024
AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles
Ioannis Panagiotopoulos
|
George Filandrianos
|
Maria Lymperaiou
|
Giorgos Stamou
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
In this paper, we outline our submission for the SemEval-2024 Task 9 competition: ‘BRAINTEASER: A Novel Task Defying Common Sense’. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.