Xuanjun Chen


2025

pdf bib
A Preliminary Study of RAG for Taiwanese Historical Archives
Claire Lin | Bo-Han Feng | Xuanjun Chen | Te-Lun Yang | Hung-Yi Lee | Jyh-Shing Roger Jang
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.

2024

pdf bib
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Haibin Wu | Ho-Lam Chung | Yi-Cheng Lin | Yuan-Kuei Wu | Xuanjun Chen | Yu-Chi Pai | Hsiu-Hsuan Wang | Kai-Wei Chang | Alexander Liu | Hung-yi Lee
Findings of the Association for Computational Linguistics: ACL 2024

The sound codec’s dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance.Recent years have witnessed significant developments in codec models.The ideal sound codec should preserve content, paralinguistics, speakers, and audio information.However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings.This study introduces Codec-SUPERB, an acronym for Codec sound processing Universal PERformance Benchmark.It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge.Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a community-driven benchmark database, thereby stimulating new development cycles for codecs.Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons.Finally, we will release codes, the leaderboard, and data to accelerate progress within the community.