A Comprehensive Literary Chinese Reading Comprehension Dataset with an Evidence Curation Based Solution

Dongning Rao; Rongchu Zhou; Peng Chen; Zhihua Jiang

A Comprehensive Literary Chinese Reading Comprehension Dataset with an Evidence Curation Based Solution

Dongning Rao, Rongchu Zhou, Peng Chen, Zhihua Jiang

Abstract

Low-resource language understanding is challenging, even for large language models (LLMs). An epitome of this problem is the CompRehensive lIterary chineSe readIng comprehenSion (CRISIS), whose difficulties include limited linguistic data, long input, and insight-required questions. Besides the compelling necessity of providing a larger dataset for CRISIS, excessive information, order bias, and entangled conundrums still haunt the CRISIS solutions. Thus, we present the eVIdence cuRation with opTion shUffling and Abstract meaning representation-based cLauses segmenting (VIRTUAL) procedure for CRISIS, with the largest dataset. While the dataset is also named CRISIS, it results from a three-phase construction process, including question selection, data cleaning, and a silver-standard data augmentation step, which augments translations, celebrity profiles, government jobs, reign mottos, and dynasty to CRISIS. The six steps of VIRTUAL include embedding, shuffling, abstract beaning representation based option segmenting, evidence extracting, solving, and voting. Notably, the evidence extraction algorithm facilitates literary Chinese evidence sentences, translated evidence sentences, and annotations of keywords with a similarity-based ranking strategy. While CRISIS congregates understanding-required questions from seven sources, the experiments on CRISIS substantiate the effectiveness of VIRTUAL, with a 7 percent hike in accuracy compared with the baseline. Interestingly, both non-LLMs and LLMs have order bias, and abstract beaning representation based option segmenting is constructive for CRISIS.

Anthology ID:: 2025.emnlp-main.177
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3583–3603
Language:
URL:: https://aclanthology.org/2025.emnlp-main.177/
DOI:
Bibkey:
Cite (ACL):: Dongning Rao, Rongchu Zhou, Peng Chen, and Zhihua Jiang. 2025. A Comprehensive Literary Chinese Reading Comprehension Dataset with an Evidence Curation Based Solution. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3583–3603, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: A Comprehensive Literary Chinese Reading Comprehension Dataset with an Evidence Curation Based Solution (Rao et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.177.pdf
Checklist:: 2025.emnlp-main.177.checklist.pdf

PDF Cite Search Checklist Fix data