Yixuan Zhang

Also published as: 艺璇张

2025

"本文为第五届中文抽象语义表示解析评测(CAMRP 2025)的总结报告。CAMRP2025包含两个子任务:中文抽象语义表示(CAMR)句子级解析任务,和CAMR篇章共指解析任务。评测任务共有96支队伍报名,4支队伍提交结果,最终总计26份有效成绩。哈尔滨工业大学 ( 深圳 ) 团队在开放测试下 , 取得了84.72%的F值 ,为CAMRP评测系列五年来的历史最好成绩。该团队在篇章共指消解任务中同样获得了最高61.15%的好成绩,相比baseline有较大提升。参赛队伍的实验结果表明,尽管基于监督微调和图聚合的策略在句子级解析任务中展现出了较好的性能,但大模型对于细粒度的篇章共指关系识别仍然存在挑战。如何有效利用CAMR结构化信息来提升大模型篇章共指解析的性能,仍是未来研究的重要方向。"

pdf bib abs

Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal components order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.

pdf bib abs

NAT: Enhancing Agent Tuning with Negative Samples
Renxi Wang | Xudong Han | Yixuan Zhang | Timothy Baldwin | Haonan Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Interaction trajectories between agents and environments have proven effective in tuning LLMs into task-specific agents. However, constructing these trajectories, especially successful trajectories, is often computationally and time intensive due to the relatively low success rates of even the most advanced LLMs, such as GPT-4 and Claude. Additionally, common training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL) not only require large volumes of data but also have specific demands regarding the trajectories used. For instance, existing SFT approaches typically utilize only positive examples, limiting their efficiency in low-resource scenarios. To address this, we introduce Negative-Aware Training (NAT), a straightforward yet effective method that leverages both successful and failed trajectories for fine-tuning, maximizing the utility of limited resources. Experimental results demonstrate that NAT consistently surpasses existing methods, including SFT, DPO, and PPO, across various tasks.

2024

pdf bib abs

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance is becoming more important and more challenging. This paper aims to address this issue for Mandarin Chinese in the form of CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural sciences, social sciences, engineering, and the humanities. We conduct a thorough evaluation of more than 20 contemporary multilingual and Chinese LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an accuracy of even 60%, which is the pass mark for Chinese exams. This highlights that there is substantial room for improvement in the capabilities of LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models’ performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models for Chinese.

pdf bib abs

The Fourth Chinese Abstract Meaning Representation Parsing Evaluation
Zhixing Xu | Yixuan Zhang | Bin Li | Junsheng Zhou | Weiguang Qu
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“Abstract Meaning Representation has become a key research area in sentence-level semantic parsing within natural language processing. Substantial progress has been achieved in various NLP tasks using AMR. This paper presents the fourth Chinese Abstract Meaning Representation parsing evaluation, held during the technical evaluation task workshop at CCL 2024. The evaluation also introduced a new test set comprising Ancient Chinese sentences. Results indicated decent performance, with the top team achieving an F1 of 0.8382 in the open modality, surpassing the previous record at CoNLL 2020 by 3.30 percentage points under the MRP metric. However, current large language models perform poorly in AMR parsing of Ancient Chinese, highlighting the need for effective training strategies. The complex syntax and semantics of Ancient Chinese pose significant challenges. Additionally, optimizing transfer learning techniques to better apply knowledge from Chinese Mandarin to Ancient Chinese parsing is crucial. Only through continuous innovation and collaboration can significant advancements in both Ancient Chinese and Chinese Mandarin AMR parsing be achieved.”

pdf bib abs

从句子图到篇章图——基于抽象语义表示的篇章级共指标注体系设计(Discourse-Level Anaphora Annotation System Based on Abstract Semantic Representation)
Yixuan Zhang (张艺璇) | Bin Li (李斌) | Zhixing Xu (许智星) | Pengxiu Lu (卢芃秀)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“篇章共指体现篇章概念的动态转移,成为近年研究热点。本文在梳理共指理论研究的基础上,综述了相关语料库及解析方法,发现共指语料库仍存在以下两个问题:共指关系标注粗疏与基本不考虑整句语义表示的融合。本文以句子级语义标注体系(中文抽象语义表示)为基础构建篇章共指体系,构建了 100 篇共指语料库。本体系涵盖 52 种句内语义关系和 8 种篇章共指关系,二者相结合构建的篇章共指语义图,为篇章级语义分析提供新的框架和数据资源。”

2023

pdf bib abs

Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE
Yixuan Zhang | Haonan Li
Proceedings of the Ancient Language Processing Workshop

Large language models (LLMs) have demonstrated exceptional language understanding and generation capabilities. However, their ability to comprehend ancient languages, specifically ancient Chinese, remains largely unexplored. To bridge this gap, we introduce ACLUE, an evaluation benchmark designed to assess the language abilities of models in relation to ancient Chinese. ACLUE consists of 15 tasks that cover a range of skills, including phonetic, lexical, syntactic, semantic, inference and knowledge. By evaluating 8 state-of-the-art multilingual and Chinese LLMs, we have observed a significant divergence in their performance between modern Chinese and ancient Chinese. Among the evaluated models, ChatGLM2 demonstrates the highest level of performance, achieving an average accuracy of 37.45%. We have established a leaderboard for communities to assess their models.

pdf bib abs

Overview of CCL23-Eval Task 2: The Third Chinese Abstract Meaning Representation Parsing Evaluation
Zhixing Xu | Yixuan Zhang | Bin Li | Zhou Junsheng | Weiguang Qu
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“Abstract Meaning Representation has emerged as a prominent area of research in sentence-levelsemantic parsing within the field of natural language processing in recent years. Substantialprogress has been made in various NLP subtasks through the application of AMR. This paperpresents the third Chinese Abstract Meaning Representation Parsing Evaluation, held as part ofthe Technical Evaluation Task Workshop at the 22nd Chinese Computational Linguistics Confer-ence. The evaluation was specifically tailored for the Chinese and utilized the Align-smatch met-ric as the standard evaluation criterion. Building upon high-quality semantic annotation schemesand annotated corpora, this evaluation introduced a new test set comprising interrogative sen-tences for comprehensive evaluation. The results of the evaluation, as measured by the F-score,indicate notable performance achievements. The top-performing team attained a score of 0.8137in the closed test and 0.8261 in the open test, respectively, using the Align-smatch metric. No-tably, the leading result surpassed the SOTA performance at CoNLL 2020 by 3.64 percentagepoints when evaluated using the MRP metric. Further analysis revealed that this significantprogress primarily stemmed from improved relation prediction between concepts. However, thechallenge of effectively utilizing semantic relation alignments remains an area that requires fur-ther enhancement.”