Evaluating Chinese Large Language Models on Discipline Knowledge Acquisition via Memorization and Robustness Assessment

Chuang Liu, Renren Jin, Mark Steedman, Deyi Xiong


Abstract
Chinese LLMs demonstrate impressive performance on NLP tasks, particularly on discipline knowledge benchmarks, with some results approaching those of GPT-4. Previous research has viewed these advancements as potential outcomes of data contamination or leakage, prompting efforts to create new detection methods and address evaluation issues in LLM benchmarks. However, there has been a lack of comprehensive assessment of the evolution of Chinese LLMs. To address this gap, this paper offers a thorough investigation of Chinese LLMs on discipline knowledge evaluation, delving into the advancements of various LLMs, including a group of related models and others. Specifically, we have conducted six assessments ranging from knowledge memorization to comprehension for robustness, encompassing tasks like predicting incomplete questions and options, identifying behaviors by the contaminational fine-tuning, and answering rephrased questions. Experimental findings indicate a positive correlation between the release time of LLMs and their memorization capabilities, but they struggle with variations in original question-options pairs. Additionally, our findings suggest that question descriptions have a more significant impact on LLMs’ performance.
Anthology ID:
2024.conda-1.1
Volume:
Proceedings of the 1st Workshop on Data Contamination (CONDA)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Oscar Sainz, Iker García Ferrero, Eneko Agirre, Jon Ander Campos, Alon Jacovi, Yanai Elazar, Yoav Goldberg
Venues:
CONDA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://aclanthology.org/2024.conda-1.1
DOI:
10.18653/v1/2024.conda-1.1
Bibkey:
Cite (ACL):
Chuang Liu, Renren Jin, Mark Steedman, and Deyi Xiong. 2024. Evaluating Chinese Large Language Models on Discipline Knowledge Acquisition via Memorization and Robustness Assessment. In Proceedings of the 1st Workshop on Data Contamination (CONDA), pages 1–12, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Evaluating Chinese Large Language Models on Discipline Knowledge Acquisition via Memorization and Robustness Assessment (Liu et al., CONDA-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conda-1.1.pdf