Jinkwon Hwang
2026
See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval
Mingyu Jeon | Sungjin Han | Jinkwon Hwang | Minchol Kwon | Jonghee Kim | Junyeong Kim
Findings of the Association for Computational Linguistics: EACL 2026
Mingyu Jeon | Sungjin Han | Jinkwon Hwang | Minchol Kwon | Jonghee Kim | Junyeong Kim
Findings of the Association for Computational Linguistics: EACL 2026
Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
2025
Learning to See through Sound: From VggCaps to Multi2Cap for Richer Automated Audio Captioning
Sangyeon Cho | Mingi Kim | Jinkwon Hwang | Jaehoon Go | Minuk Ma | Sunjae Yoon | Junyeong Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Sangyeon Cho | Mingi Kim | Jinkwon Hwang | Jaehoon Go | Minuk Ma | Sunjae Yoon | Junyeong Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Automated Audio Captioning (AAC) aims to generate natural language descriptions of audio content, enabling machines to interpret and communicate complex acoustic scenes. However, current AAC datasets often suffer from short and simplistic captions, limiting model expressiveness and semantic depth. To address this, we introduce **VggCaps**, a new multi-modal dataset that pairs audio with corresponding video and leverages large language models (LLMs) to generate rich, descriptive captions. VggCaps significantly outperforms existing benchmarks in caption length, lexical diversity, and human-rated quality. Furthermore, we propose **Multi2Cap**, a novel AAC framework that learns audio-visual representations through a AV-grounding module during pre-training and reconstructs visual semantics using audio alone at inference. This enables visually grounded captioning in audio-only scenarios. Experimental results on Clotho and AudioCaps demonstrate that Multi2Cap achieves state-of-the-art performance across multiple metrics, validating the effectiveness of cross-modal supervision and LLM-based generation in advancing AAC.