We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite LLMs’ prowess in tasks like writing assistance, code generation, and machine translation, assessing their ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs’ learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. Moreover, we leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs’ formal reasoning capabilities. We make our code, data, and results publicly available(https://github.com/yxwan123/LogicAsker) to facilitate further research and replication of our findings.
Multimodal Large Language Models (MLLMs) demonstrate a strong understanding of the real world and can even handle complex tasks. However, they still fail on some straightforward visual question-answering (VQA) problems. This paper dives deeper into this issue, revealing that models tend to err when answering easy questions (e.g., Yes/No questions) about an image, even though they can correctly describe it.We refer to this model behavior discrepancy between difficult and simple questions as model laziness.To systematically investigate model laziness, we manually construct LazyBench, a benchmark that includes Yes/No, multiple choice, short answer questions, and image description tasks that are related to the same subjects in the images.Based on LazyBench. we observe that laziness widely exists in current advanced MLLMs (e.g., GPT-4o, Gemini-1.5-pro, Claude 3, LLaVA-1.5, LLaVA-1.6, and QWen-VL). We also analyzed the failure cases of LLaVA-1.5-13B on the VQA-v2 benchmark and discovered that about half of these failures are due to the model’s laziness. This further highlights the importance of ensuring that the model fully utilizes its capability.To this end, we conduct a preliminary exploration of how to mitigate laziness and find that chain of thought can effectively avoid this issue. The data can be accessed at https://github.com/Akutagawa1998/LazyBench.
Recently, ChatGPT has demonstrated remarkable performance in various downstream tasks such as open-domain question answering, machine translation, and code generation. As a general-purpose task solver, an intriguing inquiry arises: Does ChatGPT itself know that it does not know, without any access to internal states? In response to this query, we present an initial evaluation of ChatGPT for black-box calibration. We designed three types of proxy confidence, from three perspectives to assess its performance. Experiments are conducted on five datasets, spanning four tasks, and the results show that ChatGPT has a degree of capability for black-box calibration. Specifically, proxy confidence displayed a significantly positive Pearson correlation (95.16%) with accuracy in the TruthfulQA dataset, while revealing a negative correlation in the ModAr dataset. We delved deeper into ChatGPT’s black-box calibration ability by examining failure cases in the ModAr dataset. Our analysis revealed that ChatGPT’s tendency to exhibit overconfidence may stem from its reliance on semantic priors. Furthermore, we investigated why ChatGPT performs relatively well in TruthfulQA. The findings suggest that ChatGPT might implicitly acquire calibration skills during the reinforcement learning process, rather than relying solely on simplistic heuristics.