FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Jaewoo Ahn; Junseo Kim; Heeseung Yun; Jaehyeon Son; Dongmin Park; Jaewoong Cho; Gunhee Kim

doi:10.18653/v1/2025.emnlp-main.1192

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim

Abstract

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap—the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Anthology ID:: 2025.emnlp-main.1192
Original:: 2025.emnlp-main.1192v1
Version 2:: 2025.emnlp-main.1192v2
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23354–23384
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1192/
DOI:: 10.18653/v1/2025.emnlp-main.1192
Bibkey:
Cite (ACL):: Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, and Gunhee Kim. 2025. FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23354–23384, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games (Ahn et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1192.pdf
Checklist:: 2025.emnlp-main.1192.checklist.pdf

PDF (v2) PDF (v1) Cite Search Checklist Fix data