Lei Yu

Other people with similar names: Lei Yu

Unverified author pages with similar names: Lei Yu

2026

Long context large language models exhibit the “lost in the middle” problem, where models struggle to effectively utilize information located in the middle of long contexts. Although existing workflow-based long context methods (e.g., RAG) alleviate this problem and perform well on specific datasets, can their effectiveness generalize to all types of datasets? In this work, we systematically investigate the cross-dataset generalization of long context methods. Our evaluation reveals that these methods are not universally effective. Such substantial performance variability underscores the risks of performance degradation associated with the indiscriminate application of long context methods. We investigated the reason for the failure of long context methods. We found that the intrinsic decomposition mechanisms of long context methods hinder context dependency modeling, causing these methods to suffer performance declines on documents with strong context dependency. To address this issue, We propose CoDaR (Context Dependency-aware Routing), a training-free adaptive routing strategy. By analyzing the context dependency strength of documents, CoDaR adaptively invokes long context methods, thereby significantly enhancing their overall robustness across different types of datasets.

pdf bib abs

Large embedding models have become the backbone of modern retrieval systems, offering strong semantic representations at the cost of substantial storage and computation. While recent work explores quantizing embeddings into discrete document identifiers for generative retrieval, most existing approaches rely on Euclidean quantization, which is poorly aligned with the angular geometry induced by contrastive embedding training and often requires long identifier sequences to preserve semantic fidelity. In this work, we propose Hyperspherical Householder Quantization (HHQ), a geometry-aware distillation method that compresses large embeddings into short discrete representations via iterative Householder transformations on the unit hypersphere. By explicitly preserving cosine similarity at each step, HHQ distills semantic structure into compact identifiers that remain faithful to the original embedding space. To support reliable generation of these identifiers, we introduce constrained supervised fine-tuning and tree-aware dynamic masking to enforce structural validity during training and inference. Experiments on NQ and MS MARCO show that HHQ achieves competitive or superior retrieval performance using only five tokens per document, substantially reducing decoding cost while retaining strong semantic retrieval accuracy.

pdf bib abs

Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model’s inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment.

Co-authors

Yiqi Du 1

Bin Wu 1

Venues

ACL2
Findings1

Fix author