Zeyu Yang


2026

Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and word-level monotonicity, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese, English-German and English-Japanese benchmarks show that our framework consistently improves semantic metrics and achieves lower delay compared to reference translations and salami-based baselines. Notably, combining Drop and Sentence_Cut leads to consistent improvements in the balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.
We present the CUHKSZ Team submission to the IWSLT 2026 Simultaneous Speech Translation evaluation, targeting the main and Extra Context tracks for English→Chinese, German on unsegmented speech. Our system is built upon Qwen3-Omni-30B-A3B, a natively aligned audio-text LLM. Under the Constrained condition, we apply LoRA adaptation exclusively to the LLM. Specifically, we construct syntax-aware, chunk-aligned supervision from existing ASR corpora, using Qwen3-30B-Instruct to synthesize target translations. This enables the model to internalize the simultaneous read/write policy by autonomously predicting <wait> tokens at semantically incomplete boundaries. With the policy internalized, execution is delegated to a lightweight streaming agent served via vLLM. This agent feeds audio in fixed chunks, manages a bounded dialogue history, and enforces strict emission controls to minimize computation-aware delay. For the sub-track, contextual priors are dynamically injected into the prompt. On the official dev set, our 0–2 s latency regime submissions achieve 40.5 BLEU (1.95 s) for En→Zh and 27.7 BLEU (1.72 s) for En→De. In the 2–4 s regime, performance scales to 42.1 BLEU (2.16 s) and 30.5 BLEU (2.29 s) respectively.
Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: **Sentence_Cut**, **Drop**, **Partial_Summarization** and **Pronominalization**, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and word-level monotonicity, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese, English-German and English-Japanese benchmarks show that our framework consistently improves semantic metrics and achieves lower delay compared to reference translations and salami-based baselines. Notably, combining **Drop** and **Sentence_Cut** leads to consistent improvements in the balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.

2025

Detecting user frustration in modern-day task-oriented dialog (TOD) systems is imperative for maintaining overall user satisfaction, engagement, and retention. However, most recent research is focused on sentiment and emotion detection in academic settings, thus failing to fully encapsulate implications of real-world user data. To mitigate this gap, in this work, we focus on user frustration in a deployed TOD system, assessing the feasibility of out-of-the-box solutions for user frustration detection. Specifically, we compare the performance of our deployed keyword-based approach, open-source approaches to sentiment analysis, dialog breakdown detection methods, and emerging in-context learning LLM-based detection. Our analysis highlights the limitations of open-source methods for real-world frustration detection, while demonstrating the superior performance of the LLM-based approach, achieving a 16% relative improvement in F1 score on an internal benchmark. Finally, we analyze advantages and limitations of our methods and provide an insight into user frustration detection task for industry practitioners.