Kyudan Jung

2026

Editing presentation slides is a frequent yet tedious task, ranging from creative layout design to repetitive text maintenance. While recent GUI-based agents powered by Multimodal LLMs (MLLMs) excel at tasks requiring visual perception, such as spatial layout adjustments, they often incur high computational costs and latency when handling structured, text-centric, or batch processing tasks. In this paper, we propose Talk-to-Your-Slides, a high-efficiency slide editing agent that operates via language-driven structured data manipulation rather than relying on the image modality. By leveraging the underlying object model instead of screen pixels, our approach ensures precise content modification while preserving style fidelity, addressing the limitations of OCR-based visual agents. Our system features a hierarchical architecture that effectively bridges high-level user instructions with low-level execution codes. Experiments demonstrate that for text-centric and formatting tasks, our method enables 34% faster processing, achieves 34% better instruction fidelity, and operates at an 87% lower cost compared to GUI-based baselines. Furthermore, we introduce TSBench, a human-verified benchmark dataset comprising 379 instructions, including a Hard subset designed to evaluate robustness against complex and visually dependent queries. Our code and benchmark are available at https://drive.google.com/drive/folders/1onwp5m7t3207xZu7HEBTMpdivsiOuqG8?usp=share_link

pdf bib abs

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
Kyudan Jung | Jihwan Kim | Soyoon Kim | Jeonghoon Kim | Jaegul Choo | Cheonbok Park
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction.However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume.Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations.To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.Our code and project page are publicly available at https://anonymous-2001-j.github.io/sommelier.github.io/.

Co-authors

Venues

ACL1
Findings1

Fix author