Jiu Sha
Also published as: 九 沙
2025
VEEF-Multi-LLM: Effective Vocabulary Expansion and Parameter Efficient Finetuning Towards Multilingual Large Language Models
Jiu Sha
|
Mengxiao Zhu
|
Chong Feng
|
Yuming Shang
Proceedings of the 31st International Conference on Computational Linguistics
Large Language Models(LLMs) have brought significant transformations to various aspects of human life and productivity. However, the heavy reliance on vast amounts of data in developing these models has resulted in a notable disadvantage for low-resource languages, such as Nuosu and others, which lack large datasets. Moreover, many LLMs exhibit significant performance discrepancies between high-and lowresource languages, thereby restricting equitable access to technological advances for all linguistic communities. To address these challenges, this paper propose a low-resource multilingual large language model, termed VEEF-Multi-LLM, constructed through effective vocabulary expansion and parameter-efficient fine-tuning. We introduce a series of innovative methods to address challenges in low-resource languages. First, we adopt Byte-level Byte-Pair Encoding to expand the vocabulary for broader multilingual support. We separate input and output embedding weights to boost performance, and apply RoPE for long-context handling, as well as RMSNorm for efficient training. To generate high-quality supervised fine-tuning (SFT) data, we use self-training and selective translation, and refine the resulting dataset with the assistance of native speakers to ensure cultural and linguistic accuracy. Our model, VEEF-Multi-LLM-8B, is trained on 600 billion tokens across 50 natural and 16 programming languages. Experimental results show that the model excels in multilingual instruction-following tasks, particularly in translation, outperforming competing models in benchmarks such as XCOPA and XStoryCloze. Although it lags slightly behind English-centric models in some tasks (e.g., m-MMLU), it prioritizes safety, reliability, and inclusivity, making it valuable for diverse linguistic communities. We open-source our models on GitHub and Huggingface.
2020
面向司法领域的高质量开源藏汉平行语料库构建(A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain)
Jiu Sha (沙九)
|
Luqin Zhou (周鹭琴)
|
Chong Feng (冯冲)
|
Hongzheng Li (李洪政)
|
Tianfu Zhang (张天夫)
|
Hui Hui (慧慧)
Proceedings of the 19th Chinese National Conference on Computational Linguistics
面向司法领域的藏汉机器翻译面临严重的数据稀疏问题。本文将从两个方面展录研究:第一,相比于通用领域,司法领域的藏语要有更严谨的逻辑表达和更多的专业术语。然而,目前藏语资源在司法领域内缺乏对应的语料,稀缺专业术语词以及句法结构。第二,藏语的特殊词汇表达方式和特定句法结构使得通用语料构建方法难以构建藏汉平行语料库。为此,本文提出仺种针对司法领域藏汉平行语料的轻量级构建方法。首先,我们采取人工标注获取一个中等规模的司法领域藏汉专业术语表作为先验知识库,以避免领域越界而产生的语料逻辑表达问题和领域术语缺失问题;其次,我们从全国的地方法庭官网采集实例语料数据,例如裁判文书。我们优先寻找藏文实例数据,其次是汉语,以避免后续构造藏语句子而丢失特殊的词汇表达和句式结构。我们基于以上原则采集藏汉语料构建高质量的藏汉平行语料库,具体方法包括:爬虫获取语料,规则断章对齐检测,语句边界识别,语料库自动清洗。朂终,我们构建了16万级规模的藏汉司法领域语料库,并通过多种翻译模型和交叉实验验证了构建的语料库的高质量特点和鲁棒性。另外,此语料库会弚源以便于相关研究人员用于科研工作。
Search
Fix data
Co-authors
- Chong Feng (冯冲) 2
- Hui Hui (慧慧) 1
- Hongzheng Li (李洪政) 1
- Yuming Shang 1
- Tianfu Zhang (张天夫) 1
- show all...