Large Language Models Can Not Perform Well in Understanding and Manipulating Natural Language at Both Character and Word Levels?

Yidan Zhang, Zhenan He


Abstract
Despite their promising performance across various tasks, recent studies reveal that Large language models (LLMs) still exhibit significant deficiencies in handling several word-level and character-level tasks, e.g., word unscrambling and sentence editing, indicating urgent needs for substantial improvements in basic language understanding and manipulation. To address these challenges, it is crucial to develop large-scale benchmarks that can comprehensively assess the performance of LLMs in basic language tasks. In this paper, we introduce a bilingual benchmark, CWUM, to investigate the capabilities and limitations of LLMs in understanding and manipulating natural language at both character and word levels. CWUM consists of 15 simple text editing tasks, e.g., letter counting, word reversing, Chinese character inserting, etc. We conduct extensive experiments on eight advanced LLMs, including base models and instruction-tuned (chat) variants. The experimental results highlight significant failures of existing LLMs on CWUM tasks that humans can solve perfectly with 100% accuracy. On English tasks of CWUM, the average accuracy of GPT-4, LLaMA-3-70B, and Qwen-72B is 66.64%, 39.32%, and 33.16%, respectively, which lags far behind human performance. Instruction-tuning the base model does not lead to a distinct performance improvement, as the average accuracy of LLaMA-3-70B-Instruct on English tasks is only 1.44% higher than that of the base LLaMA-3-70B. Ultimately, we show that supervised fine-tuning (SFT) can enhance model performance on CWUM without compromising its ability to generalize across general tasks.
Anthology ID:
2024.findings-emnlp.691
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11826–11842
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.691
DOI:
Bibkey:
Cite (ACL):
Yidan Zhang and Zhenan He. 2024. Large Language Models Can Not Perform Well in Understanding and Manipulating Natural Language at Both Character and Word Levels?. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11826–11842, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Large Language Models Can Not Perform Well in Understanding and Manipulating Natural Language at Both Character and Word Levels? (Zhang & He, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.691.pdf