Jihwan Kim
2026
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method
Taehee Kim | Seungbin Yang | Jihwan Kim | Jaegul Choo
Findings of the Association for Computational Linguistics: ACL 2026
Taehee Kim | Seungbin Yang | Jihwan Kim | Jaegul Choo
Findings of the Association for Computational Linguistics: ACL 2026
Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text-to-SQL. Existing table retrieval approaches select a pre-determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding-window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top-k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at https://anonymous.4open.science/r/Adaptive-Table-Retrieval.
Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
Kyudan Jung | Jihwan Kim | Soyoon Kim | Jeonghoon Kim | Jaegul Choo | Cheonbok Park
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Kyudan Jung | Jihwan Kim | Soyoon Kim | Jeonghoon Kim | Jaegul Choo | Cheonbok Park
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction.However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume.Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations.To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.Our code and project page are publicly available at https://anonymous-2001-j.github.io/sommelier.github.io/.