Ruixuan Tu


2025

pdf bib
Is Semantic Chunking Worth the Computational Cost?
Renyi Qu | Ruixuan Tu | Forrest Sheng Bao
Findings of the Association for Computational Linguistics: NAACL 2025

Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.

pdf bib
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
Forrest Sheng Bao | Miaoran Li | Renyi Qu | Ge Luo | Erana Wan | Yujia Tang | Weisi Fan | Manveer Singh Tamber | Suleman Kazi | Vivek Sourabh | Mike Qi | Ruixuan Tu | Chenyu Xu | Matthew Gonzales | Ofer Mendelevitch | Amin Ahmad
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. “Challenging” here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, most state-of-the-art hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.

2023

pdf bib
DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics
Forrest Bao | Ruixuan Tu | Ge Luo | Yinfei Yang | Hebi Li | Minghui Qiu | Youbiao He | Cen Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of <0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.