Ziyin Zhang

2026

Vision–language models (VLMs) have progressed rapidly, but Tibetan remains largely underserved due to the lack of infrastructure for reproducible training and evaluation. To help address this gap, we introduce FTibSuite, a resource-centric foundation for Tibetan VLM research that provides an end-to-end training-and-evaluation workflow and includes human-verified multimodal annotations, partially filling a long-standing shortage of Tibetan multimodal resources. FTibSuite comprises FTibData, FTibBench, and a reproducible baseline model, FTibVLM, built on Qwen3-VL-8B-Instruct. FTibVLM adopts a three-stage adaptation pipeline consisting of Tibetan continual pretraining, image–text alignment, and multimodal instruction tuning. For systematic evaluation, FTibBench adapts five established multimodal benchmarks to Tibetan and offers a reproducible evaluation protocol to support consistent comparisons across models. Specifically, FTibBench includes Tibetan versions of MMBench, MME, POPE, BinaryVQA, and COREVQA. Experiments on FTibBench demonstrate that FTibVLM consistently improves Tibetan multimodal performance. For instance, FTibVLM attains 76.01 accuracy on BinaryVQA, indicating that Tibetan performance can be competitive with high-resource settings on this diagnostic task. We also observe substantial gains on other benchmarks, including an improvement on MMBench (dev) from 42.97 to 67.78 and an increase in POPE-random accuracy from 47.53 to 80.56, underscoring the practical value of the proposed workflow and resources.

pdf bib abs

Extending large language models (LLMs) to low-resource languages often incurs an “align- ment tax”: improvements in the target lan- guage come at the cost of catastrophic forget- ting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimiza- tion (GRPO), where the model is optimized us- ing embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flex- ible realizations, enabling controlled updates that reduce destructive interference with pre- trained knowledge. We evaluate our approach on Tibetan–Chinese machine translation and Ti- betan headline generation. Experiments show that our method acquires low-resource capa- bilities while markedly mitigating alignment tax, preserving general competence more effec- tively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher se- mantic quality and preference in open-ended generation, and few-shot transfer results indi- cate that it learns more transferable and ro- bust representations under limited supervision. Overall, our study demonstrates that reinforce- ment learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.

pdf bib abs

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
Jun Wang | Ziyin Zhang | Rui Wang | Hang Yu | Peng Di | Rui Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event unification engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

2025

pdf bib abs

Conventional speculative decoding (SD) methods utilize a predefined length policy for proposing drafts, which implies the premise that the target model smoothly accepts the proposed draft tokens. However, reality deviates from this assumption: the oracle draft length varies significantly, and the fixed-length policy hardly satisfies such a requirement. Moreover, such discrepancy is further exacerbated in scenarios involving complex reasoning and long-form generation, particularly under test-time scaling for reasoning-specialized models. Through both theoretical and empirical estimation, we establish that the discrepancy between the draft and target models can be approximated by the draft model’s prediction entropy: a high entropy indicates a low acceptance rate of draft tokens, and vice versa. Based on this insight, we propose SVIP: Self-Verification Length Policy for Long-Context Speculative Decoding, which is a training-free dynamic length policy for speculative decoding systems that adaptively determines the lengths of draft sequences by referring to the draft entropy. Experimental results on mainstream SD benchmarks as well as reasoning-heavy benchmarks demonstrate the superior performance of SVIP, achieving up to 17% speedup on MT-Bench at 8K context compared with fixed draft lengths, and 22% speedup for QwQ in long-form reasoning.

pdf bib abs

Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.

pdf bib abs

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
Zeli Su | Ziyin Zhang | Guixian Xu | Jianing Liu | Xu Han | Ting Zhang | Yushuang Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

pdf bib abs

GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding
Ziyin Zhang | Hang Yu | Sage Lee | Peng Di | Jianguo Li | Rui Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Programming languages possess rich semantic information - such as data flow - that is represented by graphs and not available from the surface form of source code. Recent code language models have scaled to billions of parameters, but model source code solely as text tokens while ignoring any other structural information. Conversely, models that do encode structural information of code make modifications to the Transformer architecture, limiting their scale and compatibility with pretrained LLMs. In this work, we take the best of both worlds with GALLa - Graph Aligned Large Language Models. GALLa utilizes graph neural networks and cross-modal alignment technologies to inject the structural information of code into LLMs as an auxiliary task during finetuning. This framework is both model-agnostic and task-agnostic, as it can be applied to any code LLM for any code downstream task, and requires the structural graph data only at training time from a corpus unrelated to the finetuning data, while incurring no cost at inference time over the baseline LLM. Experiments on five code tasks with six different baseline LLMs ranging in size from 350M to 14B validate the effectiveness of GALLa, demonstrating consistent improvement over the baseline, even for powerful models such as LLaMA3 and Qwen2.5-Coder.

2024

pdf bib abs

In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability—MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language—Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks.