Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis

Guo Tang; Zheng Chu; Wenxiang Zheng; Ming Liu; Bing Qin (秦兵)

doi:10.18653/v1/2024.findings-emnlp.464

Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis

Guo Tang, Zheng Chu, Wenxiang Zheng, Ming Liu, Bing Qin

Abstract

Situational awareness refers to the capacity to perceive and comprehend the present context and anticipate forthcoming events, which plays a critical role in aiding decision-making, anticipating potential issues, and adapting to dynamic circumstances. Nevertheless, the situational awareness capabilities of large language models have not yet been comprehensively assessed. To address this, we propose SA-Bench, a comprehensive benchmark that covers three tiers of situational awareness capabilities, covering environment perception, situation comprehension and future projection. SA-Bench provides a comprehensive evaluation to explore the situational awareness capabilities of LLMs. We conduct extensive experiments on advanced LLMs, including GPT-4, LLaMA3, Qwen1.5, among others. Our experimental results indicate that even SOTA LLMs still exhibit substantial capability gaps compared to humans. In addition, we thoroughly analysis and examine the challenges encountered by LLMs across various tasks, as well as emphasize the deficiencies they confront. We hope SA-Bench will foster research within the field of situational awareness.

Anthology ID:: 2024.findings-emnlp.464
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7904–7928
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.464/
DOI:: 10.18653/v1/2024.findings-emnlp.464
Bibkey:
Cite (ACL):: Guo Tang, Zheng Chu, Wenxiang Zheng, Ming Liu, and Bing Qin. 2024. Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7904–7928, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis (Tang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.464.pdf

PDF Cite Search Fix data