Identifying Where Large Language Models Struggle in Answering Complex Questions

Xanh Ho; Florian Boudin; Saku Sugawara; Khoa Duong; Akiko Aizawa

Identifying Where Large Language Models Struggle in Answering Complex Questions

Xanh Ho, Florian Boudin, Saku Sugawara, Khoa Duong, Akiko Aizawa

Abstract

We design experiments to identify where Large Language Models (LLMs) struggle when answering complex questions.Our focus is on two key stages, mirroring the human QA process: 1) question decomposition, where the model breaks down a complex question into sub-questions and 2) subproblem solving, where it addresses each sub-question to obtain the final response.We preprocess and expand three multi-hop datasets to create experimental datasets featuring explicit and implicit multi-hop questions, crowdsourced and templated questions, and varying numbers of hops.Our results show that larger models (Llama 3.1 70B and o1) excel at decomposing explicit multi-hop questions but struggle with implicit ones, while smaller models (e.g., Llama 3.1 8B) have difficulty with both.In the sub-problem solving stage, all models perform well on simple questions with context.Furthermore, we found no correlation between accuracy in the question decomposition stage and final QA performance (direct response), highlighting a key difference between human and LLM reasoning.

Anthology ID:: 2026.gem-main.11
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 112–123
Language:
URL:: https://aclanthology.org/2026.gem-main.11/
DOI:
Bibkey:
Cite (ACL):: Xanh Ho, Florian Boudin, Saku Sugawara, Khoa Duong, and Akiko Aizawa. 2026. Identifying Where Large Language Models Struggle in Answering Complex Questions. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 112–123, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Identifying Where Large Language Models Struggle in Answering Complex Questions (Ho et al., GEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.gem-main.11.pdf

PDF Cite Search Fix data