Yongxiang Li

Also published as: 永翔


2026

Large Language Models have shown strong performance in Machine Translation, yet they often suffer from paraphrasing errors, omissions, or hallucinations when the input contains translation-specific elements (e.g., URLs, slang, and idioms) that require strict preservation or controlled transformation, undermining the reliability of critical details.We propose CEMT, a Controllable Element-Oriented Machine Translation framework inspired by the analysis–strategy–generation paradigm in human translation. CEMT first employs an Element Detection Module to identify translation-specific elements, and then introduces a Translation Module that decomposes the translation process into linguistically grounded analysis, strategy formulation, and final generation, thereby guiding the reliable translation of these elements. We further introduce a CoT Judge model during training that provides step-wise supervision over the accuracy and consistency of the translation process.On the WMT23/24 Chinese–English benchmarks, CEMT improves performance over existing Machine Translation models while significantly reducing element-level constraint violations.
Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the number of parameters, while TARPO significantly reduces the reasoning token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.
Supervised Fine-Tuning (SFT) accelerates task-specific large language models (LLMs) development, but the resulting proliferation of fine-tuned models incurs substantial memory overhead. Delta compression addresses this by retaining a single pre-trained LLM with multiple compressed delta weights. However, existing methods fail on models fine-tuned with large-scale datasets. We find that larger SFT data scale amplifies delta parameter magnitude, singular values, and entropy, exacerbating compression errors. To tackle this, we propose D-QRELO ( Delta Compression via Quantization and Rsidual Low-Rank), a novel training- and data-free delta compression method. It combines coarse-grained one-bit quantization to capture the dominant structure of the delta, followed by compensated residual low-rank approximation to recover fine-grained details from the smaller residual error. Experiments on various LLMs spanning dense and MoE architectures across multiple domains under this challenging setting demonstrate that D-QRELO outperforms existing methods. Moreover, we establish key design principles for delta compression through extensive empirical analysis, demonstrating how task difficulty, architecture, and layer positioning create predictable patterns that can guide optimal compression strategies in production systems.

2025

In this paper, we present a novel pipeline for the XLLM Shared Task-III: Large Language Model for Structural Reasoning (LLM-SR). Our pipeline addresses key challenges in automatic process-reward training data construction, such as high manual annotation costs, limited accuracy of large models in structured data processing, and dependency on auxiliary information for validation. To overcome these limitations, we first decompose the construction process into extraction and validation phases. Leveraging model-generated annotations, we produce pseudo-labeled data and iteratively refine model performance. Second, by analyzing structured data patterns, we encode structural constraints into a rule-based module and fine-tune the model with Gradient Reward Policy Optimization (GRPO), significantly improving structured data extraction success rates. Finally, we train the model to generate critical responses that assess evidence-conclusion relationships, thus enhancing validation reliability. Experimental results demonstrate that our pipeline outperforms models with an order of magnitude more parameters and achieves the first position on the task.
The paper presents our system developed for SemEval-2025 Task 8, which focuses on table question answering (TQA). The TQA tasks face challenges due to the characteristics of real-world tabular data, such as large size, incomplete column semantics, and entity ambiguity. To address these issues, we propose a large language model (LLM)-powered and programming-based framework, named Flow-of-Table-Reasoning. We introduce the table schema integrating verbalized structure and semantics for query decomposition and programming, enabling a holistic understanding of tables and the ability to process large-size tables. We design a multi-step schema linking plan to derive a focused table schema that retains only information relevant to the query, aiming to eliminate ambiguity and reduce hallucinations. Furthermore, we incorporate reasoning workflow into an iterative thinking architecture, allowing incremental cycles of thinking, reasoning and reflection. Our system achieves first place on both TQA and Lite TQA subtasks.
This paper presents the approach we employed in SemEval-2025 Task 11: “Bridging the Gap in Text-Based Emotion Detection.” The core objective of this shared task is emotion perception, focusing on determining the emotion the speaker is likely expressing when uttering a sentence or short text fragment, as perceived by the majority. In this task, we applied a prompt optimization strategy based on in-context learning, combined with data augmentation and ensemble voting techniques, to significantly enhance the model’s performance. Through these optimizations, the model demonstrated improved accuracy and stability in emotion detection. Ultimately, in both Track A (Multi-label Emotion Detection) and Track B (Emotion Intensity Prediction), our approach achieved top-3 rankings across multiple languages, showcasing the effectiveness and cross-lingual adaptability of our method.

2024

This paper describes the participation of team “TeleAI” in the third International Chinese Ancient Chinese Language Information Processing Evaluation (EvalHan24). The competition comprises a joint task of sentence segmentation and punctuation, categorized into open and closed tracks based on the models and data used. In the final evaluation, our system achieved significantly better results than the baseline. Specifically, in the closed-track sentence segmentation task, we obtained an F1 score of 0.8885, while in the sentence punctuation task, we achieved an F1 score of 0.7129.
“本文描述了队伍“TeleAI”在CCL2024古文历史事件类型抽取评测任务(CHED2024)中提交的参赛系统。该任务旨在自动识别出古代文本中的事件触发词与事件类型,其中事件类型判别被分为粗粒度和细粒度的事件类型判别两部分。为了提高古文历史事件类型抽取的性能,我们结合了大模型和小模型,并采用了半监督自训练的方法。在最终的评估中,我们在触发词识别任务得分0.763,粗粒度事件类型判别任务得分0.842,细粒度事件类型判别任务得分0.779,综合得分0.791,在所有单项任务和综合评分上均排名第一。”
“本技术报告详细介绍了我们团队参加第四届中文空间语义理解评测(SpaCE2024)的方法和成果。SpaCE2024旨在全面测试机器对中文空间语义的理解能力,包括空间信息实体识别、空间信息实体识别、空间信息异常识别、空间方位信息推理和空间异形同义识别五个不同的任务。我们团队采用精心设计的prompt并结合微调的方式激发大语言模型的空间语义理解能力,构建了一个高效的空间语义理解系统。在最终的评估中,我们在空间信息实体识别题目中准确率为0.8947,在空间信息实体识别题目中准确率为0.9364,在空间信息异常识别题目中准确率为0.8480,在空间方位信息推理题目中准确率为0.3471,在空间异形同义识别题目中准确率为0.5631,测试集综合准确率为0.6024,排名第一。”