Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Can Large Language Models pass the test?

Paola Merlo


Abstract
How do we evaluate Large Language Models (LLMs) and determine the aspects and limits of their intelligent behaviour? It is currently conjectured that shortcomings of LLMs in multi-linguality and reasoning are due to a lack of ability to generalize. It has been argued that, instead, humans are better at generalization because they have a tendency at extracting rules from complex data. We propose a method to evaluate LLMs ability to rule-based generalization. When exposed to tests of analytic intelligence, for example the visual RAVEN IQ test, human problem-solvers identify the relevant objects in the picture and their relevant attributes and reason based on rules applied to them. Based on the induced rules, they are able to provide a generalisation and a solution to the test. An analogous language task has recently been proposed (called BLM) for LLM. In this paper, we argue that we can use this task to investigate what linguistic reasoning LLM develop, by asking them to solve some simple variants of the BLM task. We find that current state-of-the-art generative models, such as ChatGPT, can handle the task in the sense that they easily understand the instructions and can provide step-by-step reasoning that shows that it can solve two of the main cognitive hurdles: correspondence finding (object and attribute identification) and item novelty. However, overall they cannot find the correct answer, even with considerable help. In particular, they never identify the structure of the problem, exhibiting, we hypothesize, a lack of goal and subgoal management abilities, an ability that has been argued to measure differential abilities in humans. We argue that this finding supports the usefulness of the task as a method to test the limits and specific properties of generalisation ability in Large Language Models, providing an intrinsic evaluation method inspired by tests of human intelligence.
Anthology ID:
2023.findings-emnlp.546
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8119–8152
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.546
DOI:
10.18653/v1/2023.findings-emnlp.546
Bibkey:
Cite (ACL):
Paola Merlo. 2023. Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Can Large Language Models pass the test?. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8119–8152, Singapore. Association for Computational Linguistics.
Cite (Informal):
Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Can Large Language Models pass the test? (Merlo, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.546.pdf