Mengxiao Zhu

Also published as: 孟笑


2025

Large Language Models(LLMs) have brought significant transformations to various aspects of human life and productivity. However, the heavy reliance on vast amounts of data in developing these models has resulted in a notable disadvantage for low-resource languages, such as Nuosu and others, which lack large datasets. Moreover, many LLMs exhibit significant performance discrepancies between high-and lowresource languages, thereby restricting equitable access to technological advances for all linguistic communities. To address these challenges, this paper propose a low-resource multilingual large language model, termed VEEF-Multi-LLM, constructed through effective vocabulary expansion and parameter-efficient fine-tuning. We introduce a series of innovative methods to address challenges in low-resource languages. First, we adopt Byte-level Byte-Pair Encoding to expand the vocabulary for broader multilingual support. We separate input and output embedding weights to boost performance, and apply RoPE for long-context handling, as well as RMSNorm for efficient training. To generate high-quality supervised fine-tuning (SFT) data, we use self-training and selective translation, and refine the resulting dataset with the assistance of native speakers to ensure cultural and linguistic accuracy. Our model, VEEF-Multi-LLM-8B, is trained on 600 billion tokens across 50 natural and 16 programming languages. Experimental results show that the model excels in multilingual instruction-following tasks, particularly in translation, outperforming competing models in benchmarks such as XCOPA and XStoryCloze. Although it lags slightly behind English-centric models in some tasks (e.g., m-MMLU), it prioritizes safety, reliability, and inclusivity, making it valuable for diverse linguistic communities. We open-source our models on GitHub and Huggingface.
"在信息爆炸的时代背景下,大模型每天都需处理庞大的知识与数据量。面对缺乏大规模工业级训练设施的现实,小参数模型成为了一种必要选择。然而,这些模型的信息处理需求远远超出其自然存储能力,这引发了一个核心问题:小参数模型应该记住什么,又应该忘记什么?传统的全记忆学习方法由于模型参数容量有限而不再高效,尝试记住一切不仅效率低,还可能引起过重的认知负担,降低思考质量。本文旨在重新定义有限记忆资源下的大语言模型记忆策略。本文首先将模型的记忆划分为内部记忆与外部记忆两个维度,并系统探讨了哪些知识应被优先内化为内部记忆。基于此,我们提出一种个性化记忆策略,针对不同类型的内部知识构建对应的对齐机制,使模型记忆更符合人类偏好与推理需求。这一策略不仅显著增强了小参数模型的理解能力与深度推理能力,也从根本上挑战了坜记得越多越好圢的传统假设,展示了战略性记忆选择在提升学习效率方面的巨大潜力。此外,本文还构建了关于内部记忆的训练集和评测数据集,并在仅使用3B参数规模的模型上进行了系统实验。实验结果显示,本文方法在该评测数据上实现了最佳效果,甚至在多个指标上超越了闭源模型及参数规模达70B的大型模型。为推动行业发展,我们已开源整个训练策略、模型权重及对应的评测数据集和评测方法。"
Text-Centric Visual Question Answering (TEC-VQA) is a critical research area that requires semantic interactions between objects and scene texts. However, most existing TEC-VQA benchmarks focus on high-resource languages like English and Chinese. Although few works expanding multilingual QA pairs in non-text-centric VQA datasets through translation, which encounters a substantial “visual-textual misalignment” problem when applied to TEC-VQA. Moreover, the open-source nature of these benchmarks and the broad sources of training data for MLLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging TEC-VQA benchmark called Text-Centric Visual Question Answering in Multilingual Chinese Minority Languages(TVQACML), which involves eight languages, including Standard Chinese, Korean, and six minority languages. TVQACML supports a wide range of tasks, such as Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER), featuring 32,000 question-answer pairs across 8,000 images. Extensive experiments on TVQACML across multiple MLLMs demonstrate the effectiveness of evaluating the MLLMs and enhancing multilingual TEC-VQA performance with fine-tuning.

2024

“焦虑、抑郁已成为人们常见的心理障碍,适度的疏导对于缓解人们精神、心理压力具有重要意义。然而由于病耻感等原因,很多人得不到及时的疏导和治疗。随着人工智能的发展,大语言模型(LLMs)优越的知识融会贯通能力和思维链能力,使得其成为心理疏导的有效工具。然而,现有少量面向心理健康咨询的大语言模型通常针对英文、中文等资源丰富的语种,而对于低资源语言,LLMs在心理咨询领域的应用尚缺少研究。本文以藏语作为低资源语言的代表,研究藏语心理咨询数据集的构建和藏语心理健康大语言模型的构建方法。首先,通过收集现有高质量的中文心理咨询对话数据,并对数据进行处理,生成心理健康多轮对话数据集;其次,构建汉藏翻译工具将其翻译成藏语多轮对话数据,并结合多种机制对数据进行筛选、过滤生成高质量藏语心理健康多轮对话数据;基于构造的数据,采用现有通用大语言模型Baichuan2和LLaMA2模型进行指令调优训练,形成藏语心理健康大语言模型,并将开源用于科学研究。最后通过实验验证了本文发布的藏语心理健康多轮对话数据集以及藏语心理健康咨询大语言模型的有效性。”