2024
pdf
bib
abs
Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis
Guo Tang
|
Zheng Chu
|
Wenxiang Zheng
|
Ming Liu
|
Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2024
Situational awareness refers to the capacity to perceive and comprehend the present context and anticipate forthcoming events, which plays a critical role in aiding decision-making, anticipating potential issues, and adapting to dynamic circumstances. Nevertheless, the situational awareness capabilities of large language models have not yet been comprehensively assessed. To address this, we propose SA-Bench, a comprehensive benchmark that covers three tiers of situational awareness capabilities, covering environment perception, situation comprehension and future projection. SA-Bench provides a comprehensive evaluation to explore the situational awareness capabilities of LLMs. We conduct extensive experiments on advanced LLMs, including GPT-4, LLaMA3, Qwen1.5, among others. Our experimental results indicate that even SOTA LLMs still exhibit substantial capability gaps compared to humans. In addition, we thoroughly analysis and examine the challenges encountered by LLMs across various tasks, as well as emphasize the deficiencies they confront. We hope SA-Bench will foster research within the field of situational awareness.
pdf
bib
abs
An Information Bottleneck Perspective for Effective Noise Filtering on Retrieval-Augmented Generation
Kun Zhu
|
Xiaocheng Feng
|
Xiyuan Du
|
Yuxuan Gu
|
Weijiang Yu
|
Haotian Wang
|
Qianglong Chen
|
Zheng Chu
|
Jingchang Chen
|
Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-augmented generation integrates the capabilities of large language models with relevant information retrieved from an extensive corpus, yet encounters challenges when confronted with real-world noisy data. One recent solution is to train a filter module to find relevant content but only achieve suboptimal noise compression. In this paper, we propose to introduce the information bottleneck theory into retrieval-augmented generation. Our approach involves the filtration of noise by simultaneously maximizing the mutual information between compression and ground output, while minimizing the mutual information between compression and retrieved passage. In addition, we derive the formula of information bottleneck to facilitate its application in novel comprehensive evaluations, the selection of supervised fine-tuning data, and the construction of reinforcement learning rewards. Experimental results demonstrate that our approach achieves significant improvements across various question answering datasets, not only in terms of the correctness of answer generation but also in the conciseness with 2.5% compression rate.
pdf
bib
abs
Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future
Zheng Chu
|
Jingchang Chen
|
Qianglong Chen
|
Weijiang Yu
|
Tao He
|
Haotian Wang
|
Weihua Peng
|
Ming Liu
|
Bing Qin
|
Ting Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reasoning, a fundamental cognitive process integral to human intelligence, has garnered substantial interest within artificial intelligence.Notably, recent studies have revealed that chain-of-thought prompting significantly enhances LLM’s reasoning capabilities, which attracts widespread attention from both academics and industry.In this paper, we systematically investigate relevant research, summarizing advanced methods through a meticulous taxonomy that offers novel perspectives.Moreover, we delve into the current frontiers and delineate the challenges and future directions, thereby shedding light on future research.Furthermore, we engage in a discussion about open questions.We hope this paper serves as an introduction for beginners and fosters future research.Resources have been made publicly available at https://github.com/zchuz/CoT-Reasoning-Survey
pdf
bib
abs
TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
Zheng Chu
|
Jingchang Chen
|
Qianglong Chen
|
Weijiang Yu
|
Haotian Wang
|
Ming Liu
|
Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world.Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark.To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena.TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models.We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings.Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning.Besides, LLMs exhibit capability discrepancies across different reasoning categories.Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges.We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning.Code and data are available at https://github.com/zchuz/TimeBench.
pdf
bib
abs
BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering
Zheng Chu
|
Jingchang Chen
|
Qianglong Chen
|
Haotian Wang
|
Kun Zhu
|
Xiyuan Du
|
Weijiang Yu
|
Ming Liu
|
Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated strong reasoning capabilities.Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks.Retrieval-augmented reasoning represents a promising approach.However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge.To address this, we propose Beam Aggregation Reasoning (BeamAggR), a reasoning framework for knowledge-intensive multi-hop QA.BeamAggR explores and prioritizes promising answers at each hop of question.Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning.For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates.For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory.Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%.Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.
2023
pdf
bib
abs
MTGER: Multi-view Temporal Graph Enhanced Temporal Reasoning over Time-Involved Document
Zheng Chu
|
Zekun Wang
|
Jiafeng Liang
|
Ming Liu
|
Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2023
The facts and time in the document are intricately intertwined, making temporal reasoning over documents challenging. Previous work models time implicitly, making it difficult to handle such complex relationships. To address this issue, we propose MTGER, a novel Multi-view Temporal Graph Enhanced Reasoning framework for temporal reasoning over time-involved documents. Concretely, MTGER explicitly models the temporal relationships among facts by multi-view temporal graphs. On the one hand, the heterogeneous temporal graphs explicitly model the temporal and discourse relationships among facts; on the other hand, the multi-view mechanism captures both time-focused and fact-focused information, allowing the two views to complement each other through adaptive fusion. To further improve the implicit reasoning capability of the model, we design a self-supervised time-comparing objective. Extensive experimental results demonstrate the effectiveness of our method on the TimeQA and SituatedQA datasets. Furthermore, MTGER gives more consistent answers under question perturbations.
2022
pdf
bib
abs
HIT at SemEval-2022 Task 2: Pre-trained Language Model for Idioms Detection
Zheng Chu
|
Ziqing Yang
|
Yiming Cui
|
Zhigang Chen
|
Ming Liu
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
The same multi-word expressions may have different meanings in different sentences. They can be mainly divided into two categories, which are literal meaning and idiomatic meaning. Non-contextual-based methods perform poorly on this problem, and we need contextual embedding to understand the idiomatic meaning of multi-word expressions correctly. We use a pre-trained language model, which can provide a context-aware sentence embedding, to detect whether multi-word expression in the sentence is idiomatic usage.