Hongzhi Li
2026
Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
Junjie Li | Xinrui Guo | Yuhao Wu | Roy Ka-Wei Lee | Hongzhi Li | Yutao Xie
Findings of the Association for Computational Linguistics: ACL 2026
Junjie Li | Xinrui Guo | Yuhao Wu | Roy Ka-Wei Lee | Hongzhi Li | Yutao Xie
Findings of the Association for Computational Linguistics: ACL 2026
What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation.
Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning
Jie Deng | Hanshuang Tong | Jun Li | Shining Liang | Ning Wu | Hongzhi Li | Yutao Xie
Findings of the Association for Computational Linguistics: ACL 2026
Jie Deng | Hanshuang Tong | Jun Li | Shining Liang | Ning Wu | Hongzhi Li | Yutao Xie
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling, which retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of the fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.
Quantifying and Improving the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Shiping Yang | Jie Wu | Wenbiao Ding | Ning Wu | Shining Liang | Ming Gong | Hongzhi Li | Hengyuan Zhang | Angel X Chang | Dongmei Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shiping Yang | Jie Wu | Wenbiao Ding | Ning Wu | Shining Liang | Ming Gong | Hongzhi Li | Hengyuan Zhang | Angel X Chang | Dongmei Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks implicit noise (spurious features). Moreover, previous studies on spurious features in LLMs are limited to specific types (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we identify and study spurious features in the RAG paradigm, a robustness issue caused by the sensitivity of LLMs to semantic-agnostic features. We then propose a novel framework,SURE, to empirically quantify the robustness of RALMs against spurious features. Beyond providing a comprehensive taxonomy and metrics for evaluation, the framework’s data synthesis pipeline facilitates training-based strategies to improve robustness. Further analysis suggests that spurious features are a widespread and challenging problem in the field of RAG. Our code is available at https://anonymous.4open.science/r/RAG-SpuriousFeatures-62B3.
2016
Cross-media Event Extraction and Recommendation
Di Lu | Clare Voss | Fangbo Tao | Xiang Ren | Rachel Guan | Rostyslav Korolov | Tongtao Zhang | Dongang Wang | Hongzhi Li | Taylor Cassidy | Heng Ji | Shih-fu Chang | Jiawei Han | William Wallace | James Hendler | Mei Si | Lance Kaplan
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
Di Lu | Clare Voss | Fangbo Tao | Xiang Ren | Rachel Guan | Rostyslav Korolov | Tongtao Zhang | Dongang Wang | Hongzhi Li | Taylor Cassidy | Heng Ji | Shih-fu Chang | Jiawei Han | William Wallace | James Hendler | Mei Si | Lance Kaplan
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
2015
Search
Fix author
Co-authors
- Shih-Fu Chang 2
- Heng Ji 2
- Shining Liang 2
- Ning Wu 2
- Yutao Xie 2
- Tongtao Zhang 2
- Taylor Cassidy 1
- Angel X Chang 1
- Jie Deng 1
- Wenbiao Ding 1
- Ming Gong 1
- Rachel Guan 1
- Xinrui Guo 1
- Jiawei Han 1
- James Hendler 1
- Lance Kaplan 1
- Rostyslav Korolov 1
- Roy Ka-Wei Lee 1
- Jun Li (李俊) 1
- Junjie Li 1
- Di Lu 1
- Xiang Ren 1
- Mei Si 1
- Fangbo Tao 1
- Hanshuang Tong 1
- Clare Voss 1
- William Wallace 1
- Dongang Wang 1
- Jie Wu 1
- Yuhao Wu 1
- Shiping Yang 1
- Dongmei Zhang 1
- Hengyuan Zhang 1