Yuan Sun - ACL Anthology

Yuan Sun

Also published as: 媛孙, YUan Sun

2025

CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages
Wenhao Zhuang | Yuan Sun
Proceedings of the 31st International Conference on Computational Linguistics

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE (Chinese, Uyghur, Tibetan, English) dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs’ ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs
Yijie Li | Yuan Sun
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

Recently, there has been a growing trend of employing large language models (LLMs) to judge the quality of other LLMs. Many studies have adopted closed-source models, mainly using GPT-4 as the evaluator. However, due to the closed-source nature of the GPT-4 model, employing it as an evaluator has resulted in issues including transparency, controllability, and cost-effectiveness. Some researchers have turned to using fine-tuned open-source LLMs as evaluators. However, existing open-source evaluation LLMs generally lack a user-friendly visualization tool, and they have not been optimized for accelerated model inference, which causes inconvenience for researchers with limited resources and those working across different fields. This paper presents EasyJudge, a model developed to evaluate significant language model responses. It is lightweight, precise, efficient, and user-friendly, featuring an intuitive visualization interface for ease of deployment and use. EasyJudge uses detailed datasets and refined prompts for model optimization, achieving strong consistency with human and proprietary model evaluations. The model optimized with quantitative methods enables EasyJudge to run efficiently on consumer-grade GPUs or even CPUs.

基于思维链和知识迁移的多语言问答推理研究
Jian Luo | Yuan Sun
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"近年来,大型语言模型如ChatGPT显著提高了机器对自然语言的理解能力,其中,问答推理任务在推动语言理解能力和人机交互智能化方面具有重要意义,但目前仍面临诸多挑战。本文针对现有大模型资源消耗大、小模型推理能力弱,低资源语言推理能力受限等问题,提出了融合思维链和微调技术的方法,通过Human-Thinking提示策略优化大模型推理能力,并借助大模型指令微调提升小模型推理性能,引入多角色协作机制进一步优化推理步骤质量。通过探索跨语言思维链提示方法,利用高资源语言知识弥补低资源语言不足,采用双通道机制和投票打分机制整合不同语言推理知识,提升模型在低资源语言的推理表现。实验结果表明,本文方法能有效提升小型模型在多语言问答推理的能力,具有一定的研究价值。"

Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script
Xi Cao | Yuan Sun | Jiajun Li | Quzong Gesang | Nuo Qun | Nyima Tashi
Proceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

DNN-based language models excel across various NLP tasks but remain highly vulnerable to textual adversarial attacks. While adversarial text generation is crucial for NLP security, explainability, evaluation, and data augmentation, related work remains overwhelmingly English-centric, leaving the problem of constructing high-quality and sustainable adversarial robustness benchmarks for lower-resourced languages both difficult and understudied. First, method customization for lower-resourced languages is complicated due to linguistic differences and limited resources. Second, automated attacks are prone to generating invalid or ambiguous adversarial texts. Last but not least, language models continuously evolve and may be immune to parts of previously generated adversarial texts. To address these challenges, we introduce HITL-GAT, an interactive system based on a general approach to human-in-the-loop generation of adversarial texts. Additionally, we demonstrate the utility of HITL-GAT through a case study on Tibetan script, employing three customized adversarial text generation methods and establishing its first adversarial robustness benchmark, providing a valuable reference for other lower-resourced languages.

TibLex:一种基于拉丁编码的藏文词表优化策略
更尕多杰更尕多杰 | Yuan Sun
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"预训练语言模型通过大规模无监督学习在多任务场景展现卓越性能,但其研究多集中于中英文等高资源语言。藏语等低资源语言因数据稀缺及形态复杂(黏着语特性、音节结构多样),导致主流子词分词方法存在语义割裂与形态失配问题,制约模型训练效率与表征质量。为此,本文提出基于拉丁化编码的藏文扩展分词策略TibLex(Tibetan Latinization-based Extended Tokenizer)该方法通过将输入文本进行编码转写,将每个藏文音节根据其字形或发音转换为一个短序列,然后基于编码文本使用子词分词构建词汇表。实验表明,TibLex相较主流分词器具有双重优势:(1)通过拉丁化降维处理,使词表不规则组合减少15%,输入序列长度平均缩短36.10%,显著提升计算效率。(2)音译分词器可将同音异形字编码为相同音译序列并输出一致的分词结果,从而实现对同音错别字的鲁棒性处理。与此同时,基于TibLex训练的预训练模型在下游任务中保持竞争力,验证了该方法在低资源语言场景的有效性。本工作为解决形态复杂语言的分词瓶颈提供了新范式,其编码框架可扩展至蒙古文、梵文等文字系统,为跨语言NLP研究提供技术支撑。"

Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages
Wenhao Zhuang | Yuan Sun | Xiaobing Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As large language models (LLMs) are trained on increasingly diverse and extensive multilingual corpora, they demonstrate cross-lingual transfer capabilities. However, these capabilities often fail to effectively extend to low-resource languages, particularly those utilizing non-Latin scripts. While transliterating low-resource languages into Latin script presents a natural solution, there currently lacks a comprehensive framework for integrating transliteration into LLMs training and deployment. Taking a pragmatic approach, this paper innovatively combines character transliteration with Huffman coding to design a complete transliteration framework. Our proposed framework offers the following advantages: 1) Compression: Reduces storage requirements for low-resource language content, achieving up to 50% reduction in file size and 50-80% reduction in token count. 2) Accuracy: Guarantees 100% lossless conversion from transliterated text back to the source language. 3) Efficiency: Eliminates the need for vocabulary expansion for low-resource languages, improving training and inference efficiency. 4) Scalability: The framework can be extended to other low-resource languages. We validate the effectiveness of our framework across multiple downstream tasks, including text classification, machine reading comprehension, and machine translation. Experimental results demonstrate that our method significantly enhances the model’s capability to process low-resource languages while maintaining performance on high-resource languages. Our data and code are publicly available at https://github.com/CMLI-NLP/HuffmanTranslit.

2024

TiLamb:基于增量预训练的藏文大语言模型(TiLamb: A Tibetan Large Language Model Based on Incremental Pre-training)
Wenhao Zhuang (庄文浩) | Yuan Sun (孙媛) | Xiaobing Zhao (赵小兵)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“基于“预训练+微调”范式的语言模型展现了卓越的性能,随着模型规模和训练数据量的扩增,其解决多种自然语言处理任务的能力得到了显著的提高。当前的大语言模型主要支持英汉等主流语言,这限制了藏语等低资源语言在该领域的研究。针对藏语数据稀缺、现有藏语预训练模型效果不够好、下游任务可扩展性差等问题,本文汇总清洗得到26.43GB藏文数据,以开源的LLaMA2-7B作为基座模型,扩充LLaMA2现有词表,增加了约30,000个藏文tokens,提高其藏文编码效率和对藏文的语义理解能力,通过增量预训练得到藏文大语言模型基座TiLamb。根据多种藏文下游任务分别制作数千到几万条不等的微调数据集,微调后的TiLamb在藏文新闻分类、藏文实体关系分类、藏文机器阅读理解、藏文分词、藏文摘要、藏文问题回答、藏文问题生成共七个下游任务中进行验证,多项指标结果相较传统方法和其他藏文预训练模型有大幅提升。本文将TiLamb和部分资源开放供研究使用,https://github.com/NLP-Learning/TiLamb。”

面向对话式阅读理解的高质量藏语数据集构建(Construction of high-quality Tibetan language dataset for conversational reading comprehension)
Cairen Dawa (达哇才仁) | Cairang Pengmao (朋毛才让) | Yuan Sun (孙媛)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“对话式阅读理解作为对话式人工智能领域的重要研究方向,旨在使机器能够理解自然语言文本,并能够进行多轮对话以解答与文本相关的问题。随着生成式大模型的发展,该任务也成为评测大模型性能的重要指标之一。在此过程中,高质量数据集的构建成为该领域的关键任务。目前,相关算法模型在许多英语数据集上取得了显著进展,甚至超过了人类表现。然而,对于低资源语言,尤其是缺乏相应数据集的藏语,对话式阅读理解研究尚处于起步阶段。本文采用了一种人工与半自动结合的方法策略,构建了藏语对话式阅读理解数据集TiconvQA(Tibetan Conversational QuestionAnswering)。该数据集共包含了20,358个对话对,涵盖了人物、地理和新闻三个领域。每一轮对话包括对话依据文本以及根据文本生成的多轮连续问答对。本文从对话数据的多样性、相关性、语言现象等方面对TiconvQA进行了详尽的分析与质量评估。并对藏文对话式阅读理解任务中存在影响评价指标的五种因素进行了优化。最终,我们采用了三种经典的对话式阅读理解模型以及藏文大模型TiLamb对数据集进行实验评估,实验结果验证了数据集的质量,并表明TiconvQA可用于模型在对话式阅读理解任务中的性能评测。”

TiComR:基于提示的藏文对话型阅读理解模型(TiComR: A Prompt-based Tibetan Conversational Reading Comprehension Model)
Cairang Pengmao (朋毛才让) | Yuan Sun (孙媛)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“现有的对话型阅读模型在中英文对话型阅读理解任务中表现出色,但由于藏文在语法结构、表达方式等方面同中英文有显著差异,导致这些模型在对藏文对话型阅读理解的对话历史进行建模时存在困难。鉴于此,本文利用当前大模型的优越能力,提出了一种基于提示的对话历史建模方法-TicomR,以解决藏文对话型阅读理解任务中模型性能受限的问题。该方法通过引入基于提示的学习机制,直接在段落文本中添加提示来突显对话历史,而非修改段落标记嵌入,从而在微调过程中实现对对话历史的精确建模,以增强模型对问题的理解能力。实验结果表明,TiComR模型在藏文对话型阅读理解任务上取得了显著的性能提升,并在英文数据集CoQA上也有较好的表现。本文将TicomR开放供研究使用,http://github.com/Tshor/TicomR。”

2023

TiKEM:基于知识增强的藏文预训练语言模型(TiKEM: Knowledge Enhanced Tibetan Pre-trained Language Model)
Junjie Deng (邓俊杰) | Long Chen (陈龙) | Yan Zhang (张廷) | YUan Sun (孙媛) | Xiaobin Zhao (赵小兵)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“预训练语言模型在中英文领域有着优异的表现,而低资源语言数据获取难度大,预训练语言模型在低资源语言如藏文上的研究刚取得初步进展。现有的藏文预训练语言模型,使用大规模无结构的文本语料库进行自监督学习,缺少外部知识指导,知识记忆能力和知识推理能力受限。为了解决以上问题,本文构建含有50万个三元组知识的藏文知识增强预训练数据集,联合结构化的知识表示和无结构化的文本表示,训练基于知识增强的藏文预训练语言模型TiKEM,以提高模型的知识记忆和推理能力。最后,本文在文本分类、实体关系分类和机器阅读理解三个下游任务中验证了模型的有效性。”

基于数据增强的藏文机器阅读有难度问题的生成(Difficult Question Generation of Tibetan Machine Reading Based on Data Enhancement)
Zhengcuo Dan (旦正错) | Long Chen (陈龙) | Junjie Deng (邓俊杰) | Xian Pang (庞仙) | Yuan Sun (孙媛)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“问题生成是机器阅读理解数据集构建的子任务,指让计算机根据给定有(无)答案的上下文,生成流利通顺的问题集。在中英文领域,以端到端为基础的问题生成模型已经得到了很好的发展,并且构建了大批高质量的问答对。但是在低资源语言(藏文)领域,以机器阅读理解、智能问答系统为代表的数据驱动型任务中仍然普遍存在数据量较少和问答对过于简单的问题。因此,本文提出了三种面向藏文机器阅读的有难度问题的生成方法:(1)基于藏文预训练语言模型进行掩码、替换关键词生成不可回答问题。(2)根据相似段落的问题交叉生成不可回答的问题。(3)根据三元组生成具有知识推理的问题。最后,本文在构建的数据集上进行了实验,结果表明,包含不可回答、知识推理等类型的机器阅读理解数据集对模型的理解能力提出了更高的要求。另外,对构建的不可回答问题,从数据集的可读性、关联性和可回答性三个层面验证了数据集的质量。”

Improving Low-resource Question Answering by Augmenting Question Information
Andong Chen | Yuan Sun | Xiaobing Zhao | Rosella Galindo Esparza | Kehai Chen | Yang Xiang | Tiejun Zhao | Min Zhang
Findings of the Association for Computational Linguistics: EMNLP 2023

In the era of large models, low-resource question-answering tasks lag, emphasizing the importance of data augmentation - a key research avenue in natural language processing. The main challenges include leveraging the large model’s internal knowledge for data augmentation, determining which QA data component - the question, passage, or answer - benefits most from augmentation, and retaining consistency in the augmented content without inducing excessive noise. To tackle these, we introduce PQQ, an innovative approach for question data augmentation consisting of Prompt Answer, Question Generation, and Question Filter. Our experiments reveal that ChatGPT underperforms on the experimental data, yet our PQQ method excels beyond existing augmentation strategies. Further, its universal applicability is validated through successful tests on high-resource QA tasks like SQUAD1.1 and TriviaQA.

TiKG-30K:基于表示学习的藏语知识图谱数据集(TiKG-30K: A Tibetan Knowledge Graph Dataset Based on Representation Learning)
Wenhao Zhuang (庄文浩) | Ge Gao (高歌) | Yuan Sun (孙媛)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“知识图谱的表示学习旨在通过将实体和关系映射到低维向量空间中来学习知识图谱数据之间的复杂语义关联,为信息检索、智能问答、知识推理等研究提供了支撑。目前知识图谱的表示学习研究主要集中在英、汉等语言,公开高质量数据集(如FB15k-237,WN18RR)对其研究起到非常重要的作用。但是,对于低资源语言(如藏语),由于缺少公开的知识图谱数据集,相关研究任务还处于起步阶段。基于此,本文提出一个公开的藏语知识图谱数据集TiKG-30K,包含了146679个三元组,30986个实体和641种关系,可应用于知识图谱的表示学习及下游任务。针对现有藏语知识图谱数据量少、数据稀疏的问题,本文利用藏文三元组中实体的同指关系,借助其他语言丰富的知识库和非文本介质对知识库进行扩充,通过跨语言近义词检索、合并同义实体和关系、修正错误三元组等技术对知识图谱进行多层优化,最终构建了藏语知识图谱数据集TiKG-30K。最后,本文采用多种经典表示学习模型在TiKG-30K进行了实验,并与英文数据集FB15k-237、WN18RR以及藏文数据集TD50K进行了对比,结果表明,TiKG-30K可以与FB15k-237、WN18RR数据集相媲美。本文将TiKG-30K数据集公开,http://tikg-30k.cmli-nlp.com”

2022

Question Generation Based on Grammar Knowledge and Fine-grained Classification
Yuan Sun | Sisi Liu | Zhengcuo Dan | Xiaobing Zhao
Proceedings of the 29th International Conference on Computational Linguistics

Question generation is the task of automatically generating questions based on given context and answers, and there are problems that the types of questions and answers do not match. In minority languages such as Tibetan, since the grammar rules are complex and the training data is small, the related research on question generation is still in its infancy. To solve the above problems, this paper constructs a question type classifier and a question generator. We perform fine-grained division of question types and integrate grammatical knowledge into question type classifiers to improve the accuracy of question types. Then, the types predicted by the question type classifier are fed into the question generator. Our model improves the accuracy of interrogative words in generated questions, and the BLEU-4 on SQuAD reaches 17.52, the BLEU-4 on HotpotQA reaches 19.31, the BLEU-4 on TibetanQA reaches 25.58.

2021

基于枢轴语言系统融合的词汇混淆网络神经机器翻译(Neural Machine Translation for Vocabulary Confusion Network Based on Pivotal Language System Fusion)
Xiaobing Zhao (赵小兵) | Bo Jin (金波) | Yuan Sun (孙媛)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

神经机器翻译在低资源语言的翻译任务中存在翻译难度大、译文质量不佳的问题。本文针对低资源语言与汉语之间没有双语平行语料的情况,采用正反向枢轴翻译的方法,生成了三种低资源语言到汉语的平行句对,采用词汇级的系统融合技术,将Transformer模型和对偶学习模型翻译生成的目标语言译文进行融合,然后通过混淆神经网络进行词汇选择,生成了更为优质的目标语言译文。实验证明,本文提出的多模型融合方法在爱沙尼亚语-汉语、拉脱维亚语-汉语、罗马尼亚语-汉语这三种低资源语言翻译任务中均优于独立模型的翻译效果,进一步提升了低资源语言神经机器翻译的译文质量。

Ti-Reader: 基于注意力机制的藏文机器阅读理解端到端网络模型(Ti-Reader: An End-to-End Network Model Based on Attention Mechanisms for Tibetan Machine Reading Comprehension)
Yuan Sun (孙媛) | Chaofan Chen (陈超凡) | Sisi Liu (刘思思) | Xiaobing Zhao (赵小兵)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

机器阅读理解旨在教会机器去理解一篇文章并且回答与之相关的问题。为了解决低资源语言上机器阅读理解模型性能低的问题,本文提出了一种基于注意力机制的藏文机器阅读理解端到端网络模型Ti-Reader。首先,为了编码更细粒度的藏文文本信息,本文将音节和词相结合进行词表示,然后采用词级注意力机制去关注文本中的关键词,采用重读机制去捕捉文章和问题之间的语义信息,采用自注意力机制去匹配问题与答案的隐变量本身,为答案预测提供更多的线索。最后,实验结果表明,Ti-Reader模型提升了藏文机器阅读理解的性能,并且在英文数据集SQuAD上也有较好的表现。

JCapsR: 一种联合胶囊神经网络的藏语知识图谱表示学习模型(JCapsR: A Joint Capsule Neural Network for Tibetan Knowledge Graph Representation Learning)
Yuan Sun (孙媛) | Jiaya Liang (梁家亚) | Andong Chen (陈安东) | Xiaobing Zhao (赵小兵)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

知识图谱表示学习是自然语言处理的一项关键技术,现有的知识图谱表示研究主要集中在英语、汉语等语言,而低资源语言的知识图谱表示学习研究还处于探索阶段,例如藏语。本文基于前期构建的藏语知识图谱,提出了一种联合胶囊神经网络(JCapsR)的藏语知识图谱表示学习模型。首先,我们使用TransR模型生成藏语知识图谱的结构化信息表示。其次,采用融合多头注意力和关系注意力的Transformer模型表示藏语实体的文本描述信息。最后,采用JCapsR进一步提取三元组在知识图谱语义空间中的关系,将实体文本描述信息和结构化信息融合,得到藏语知识图谱的表示。实验结果表明,相比基线系统,联合胶囊神经网络JCapsR模型提高了藏语知识图谱表示学习的效果,相关研究为其它低资源语言知识图谱表示学习的拓展优化提供了参考借鉴意义。

面向机器阅读理解的高质量藏语数据集构建(Construction of High-quality Tibetan Dataset for Machine Reading Comprehension)
Yuan Sun (孙媛) | Sisi Liu (刘思思) | Chaofan Chen (陈超凡) | Zhengcuo Dan (旦正错) | Xiaobing Zhao (赵小兵)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

机器阅读理解是通过算法让机器根据给定的上下文回答问题,从而测试机器理解自然语言的程度。其中,数据集的构建是机器阅读理解的主要任务。目前,相关算法模型在大多数流行的英语数据集上都取得了显著的成绩,甚至超过了人类的表现。但对于低资源语言,由于缺乏相应的数据集,机器阅读理解研究还处于起步阶段。本文以藏语为例,人工构建了藏语机器阅读理解数据集(TibetanQA),其中包含20000个问题答案对和1513篇文章。本数据集的文章均来自云藏网,涵盖了自然、文化和教育等12个领域的知识,问题形式多样且具有一定的难度。另外,该数据集在文章收集、问题构建、答案验证、回答多样性和推理能力等方面,均采用严格的流程以确保数据的质量,同时采用基于语言特征消融输入的验证方法说明了数据集的质量。最后,本文初步探索了三种经典的英语阅读理解模型在TibetanQA数据集上的表现,其结果难以媲美人类,这表明在藏语机器阅读理解任务上还需要更进一步的探索。

Co-authors

Venues