Zilong Zheng

2025

pdf bib abs
Look Both Ways and No Sink: Converting LLMs into Text Encoders without Training
Ziyong Lin | Haoyi Wu | Shu Wang | Kewei Tu | Zilong Zheng | Zixia Jia
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements have demonstrated the advantage of converting pretrained large language models into powerful text encoders by enabling bidirectional attention in transformer layers. However, existing methods often require extensive training on large-scale datasets, posing challenges in low-resource, domain-specific scenarios. In this work, we show that a pretrained large language model can be converted into a strong text encoder without additional training. We first conduct a comprehensive empirical study to investigate different conversion strategies and identify the impact of the attention sink phenomenon on the performance of converted encoder models. Based on our findings, we propose a novel approach that enables bidirectional attention and suppresses the attention sink phenomenon, resulting in superior performance. Extensive experiments on multiple domains demonstrate the effectiveness of our approach. Our work provides new insights into the training-free conversion of text encoders in low-resource scenarios and contributes to the advancement of domain-specific text representation generation. Our code is available at https://github.com/bigai-nlco/Look-Both-Ways-and-No-Sink.

We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs’ reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.

As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained approach to value steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.

2024

pdf bib abs
Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels
Zixia Jia | Junpeng Li | Shichuan Zhang | Anji Liu | Zilong Zheng
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Traditional supervised learning heavily relies on human-annotated datasets, especially in data-hungry neural approaches. However, various tasks, especially multi-label tasks like document-level relation extraction, pose challenges in fully manual annotation due to the specific domain knowledge and large class sets. Therefore, we address the multi-label positive-unlabelled learning (MLPUL) problem, where only a subset of positive classes is annotated. We propose Mixture Learner for Partially Annotated Classification (MLPAC), an RL-based framework combining the exploration ability of reinforcement learning and the exploitation ability of supervised learning. Experimental results across various tasks, including document-level relation extraction, multi-label image classification, and binary PU learning, demonstrate the generalization and effectiveness of our framework.

pdf bib abs
LooGLE: Can Long-Context Language Models Understand Long Contexts?
Jiaqi Li | Mengmeng Wang | Zilong Zheng | Muhan Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) are typically limited to processing texts within context window size, which has spurred significant research efforts into enhancing LLMs’ long-context understanding as well as developing high-quality benchmarks to evaluate the ability. However, prior datasets suffer from short comings like short length compared to the context window of modern LLMs; outdated documents that might have data leakage problems; and an emphasis on short dependency tasks only. In this paper, we present LooGLE , a Long Context Generic Language Evaluation benchmark. It features documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning varying dependency ranges in diverse domains. Human annotators meticulously crafted over 1,100 high-quality question-answer (QA) pairs with thorough cross-validation for a most precise assessment of LLMs’ long dependency capabilities. We conduct a comprehensive evaluation of representative LLMs on LooGLE . The results indicate that most LLMs have shockingly bad long context ability and fail to capture long dependencies in the context, even when their context window size is enough to fit the entire document. Our results shed light on enhancing the “true long-context understanding” ability of LLMs instead of merely enlarging their context window.

Despite progress in multimodal large language models (MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation. We validate TGB across seven video benchmarks and demonstrate substantial performance improvements compared with prior MLLMs. Notably, our model, initially trained on sequences of four frames, effectively handles sequences up to 16 longer without sacrificing performance, highlighting its scalability and effectiveness in real-world applications. Our code is publicly available.

pdf bib abs
Varying Sentence Representations via Condition-Specified Routers
Ziyong Lin | Quansen Wang | Zixia Jia | Zilong Zheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Semantic similarity between two sentences is inherently subjective and can vary significantly based on the specific aspects emphasized. Consequently, traditional sentence encoders must be capable of generating conditioned sentence representations that account for diverse conditions or aspects. In this paper, we propose a novel yet efficient framework based on transformer-style language models that facilitates advanced conditioned sentence representation while maintaining model parameters and computational efficiency. Empirical evaluations on the Conditional Semantic Textual Similarity and Knowledge Graph Completion tasks demonstrate the superiority of our proposed framework.

Large Foundation Models (LFMs) can perform complex scheduling in a multi-agent system and can coordinate agents to complete sophisticated tasks that require extensive collaboration.However, despite the introduction of numerous gaming frameworks, the community lacks adequate benchmarks that support the implementation of a general multi-agent infrastructure encompassing collaboration between LFMs and human-NPCs. We propose a novel infrastructure—Mindagent—for evaluating planning and coordination capabilities in the context of gaming interaction. In particular, our infrastructure leverages an existing gaming framework to (i) act as the coordinator for a multi-agent system, (ii) collaborate with human players via instructions, and (iii) enable in-context learning based on few-shot prompting with feedback.Furthermore, we introduce “Cuisineworld”, a new gaming scenario and its related benchmark that supervises multiple agents playing the game simultaneously and measures multi-agent collaboration efficiency. We have conducted comprehensive evaluations with a new auto-metric Collaboration Score: CoS for assessing the collaboration efficiency. Finally, Mindagent can be deployed in real-world gaming scenarios in a customized VR version of Cuisineworld and adapted in the “Minecraft” domain. Our work involving LFMs within our new infrastructure for general-purpose scheduling and coordination can elucidate how such skills may be obtained by learning from large language corpora.

Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, which fall short in providing a holistic assessment of the LLMs’ math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model’s mathematical proficiency and its ability to apply concepts in practical scenarios. MathBench aims to enhance the evaluation of LLMs’ mathematical abilities, providing a nuanced view of their knowledge understanding levels and problem solving skills in a bilingual context.

Recent advances in large language models (LLMs) have led to significant success in using LLMs as agents. Nevertheless, a common assumption that LLMs always process honest information neglects the widespread deceptive or misleading content in human and AI-generated material. This oversight might expose LLMs to malicious manipulations. To enhance LLMs’ ability to identify and counteract deceptive information, in this paper, inspired by humans’ recursive thinking and perspective-taking, we introduce a novel cognitive framework, Recursive Contemplation (ReCon). ReCon combines formulation and refinement contemplation processes; formulation contemplation produces initial thoughts and speech, while refinement contemplation further polishes them. Additionally, we incorporate first-order and second-order perspective transitions into these processes respectively. Specifically, the first-order allows an LLM agent to infer others’ mental states, and the second-order involves understanding how others perceive the agent’s mental state. After integrating ReCon with various LLMs, extensive experiment results from the Avalon game and BigTom benchmark indicate ReCon’s efficacy in aiding LLMs to discern and maneuver around deceptive information without extra fine-tuning and data. Finally, we demonstrate ReCon’s scaling trend with model parameters, and explore the current limitations of LLMs in terms of safety and reasoning, potentially furnishing insights for subsequent research. Our project page can be found at https://shenzhi-wang.github.io/avalon_recon.

pdf bib abs
LangSuit·E: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments
Zixia Jia | Mengmeng Wang | Baichen Tong | Song-Chun Zhu | Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2024

Recent advances in Large Language Models (LLMs) have shown inspiring achievements in constructing autonomous agents that rely onlanguage descriptions as inputs. However, it remains unclear how well LLMs can function as few-shot or zero-shot embodied agents in dynamic interactive environments. To address this gap, we introduce LangSuit·E, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuit·E (i) offers adaptability to diverse environments without multiple simulation engines, (ii) evaluates agents’ capacity to develop “internalized world knowledge” with embodied observations, and (iii) allows easy customization of communication and action strategies. To address the embodiment challenge, we devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information. Comprehensive benchmark results illustrate challenges and insights of embodied planning. LangSuit·E represents a significant step toward building embodied generalists in the context of language models.

Large Language Models (LLMs) hold considerable promise for artificial general intelligence, given their intrinsic abilities to accomplish a wide range of open-domain tasks either independently or in tandem with specialized expert models. However, despite these capabilities, the performance of LLMs has yet to be comprehensively evaluated in realistic scenarios. To this end, in this work, we introduce a novel task, the Realistic Chinese Spell Checking (RCSC), to evaluate the effectiveness of existing methods comprehensively. In contrast to existing works that solely address Chinese character misspellings or pinyin conversions, our task aims to convert the realistic Chinese text into the corresponding correct text. The realistic Chinese text may potentially contain both Chinese misspellings and pinyin conversions. We first present the Realistic Chinese Spell Checking Benchmark (RCSCB), which consists of two subsets and contains a total of 581,657 samples. Then, we benchmark the performance of various baselines and find that all the existing methods, including instruction-based LLMs, achieve unsatisfactory results on RCSCB. To further improve the performance on RCSCB, we propose Pinyin-Enhanced Spell Checker (PESC), which is specifically designed to address pinyin-related misspellings. Experimental results demonstrate that PESC can achieve state-of-the-art performance on RCSCB. Despite the progress made, the current state-of-the-art performance is still far from satisfactory. We expect further progress on this crucial and challenging task.

pdf bib abs
MindDial: Enhancing Conversational Agents with Theory-of-Mind for Common Ground Alignment and Negotiation
Shuwen Qiu | Mingdian Liu | Hengli Li | Song-Chun Zhu | Zilong Zheng
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Humans talk in daily conversations while aligning and negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to align and negotiate common ground. We design an explicit mind module that can track three-level beliefs – the speaker’s belief, the speaker’s prediction of the listener’s belief, and the belief gap between the first two. Then the next response is generated to resolve the belief difference and take task-related action. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation. Experiments show that models with mind modeling can generate more human-like responses when aligning and negotiating common ground. The ablation study further validates the three-level belief design can aggregate information and improve task outcomes in both cooperative and negotiating settings.

2023

pdf bib abs
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang | Zilong Zheng | Xueliang Zhao | Jinpeng Li | Yueqian Wang | Dongyan Zhao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.

pdf bib abs
Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field
Zixia Jia | Zhaohui Yan | Wenjuan Han | Zilong Zheng | Kewei Tu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Prior works on joint Information Extraction (IE) typically model instance (e.g., event triggers, entities, roles, relations) interactions by representation enhancement, type dependencies scoring, or global decoding. We find that the previous models generally consider binary type dependency scoring of a pair of instances, and leverage local search such as beam search to approximate global solutions. To better integrate cross-instance interactions, in this work, we introduce a joint IE framework (CRFIE) that formulates joint IE as a high-order Conditional Random Field. Specifically, we design binary factors and ternary factors to directly model interactions between not only a pair of instances but also triplets. Then, these factors are utilized to jointly predict labels of all instances. To address the intractability problem of exact high-order inference, we incorporate a high-order neural decoder that is unfolded from a mean-field variational inference method, which achieves consistent learning and inference. The experimental results show that our approach achieves consistent improvements on three IE tasks compared with our baseline and prior work.

pdf bib abs
Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models
Junpeng Li | Zixia Jia | Zilong Zheng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Document-level Relation Extraction (DocRE), which aims to extract relations from a long context, is a critical challenge in achieving fine-grained structural comprehension and generating interpretable document representations. Inspired by recent advances in in-context learning capabilities emergent from large language models (LLMs), such as ChatGPT, we aim to design an automated annotation method for DocRE with minimum human effort. Unfortunately, vanilla in-context learning is infeasible for DocRE due to the plenty of predefined fine-grained relation types and the uncontrolled generations of LLMs. To tackle this issue, we propose a method integrating an LLM and a natural language inference (NLI) module to generate relation triples, thereby augmenting document-level relation datasets. We demonstrate the effectiveness of our approach by introducing an enhanced dataset known as DocGNRE, which excels in re-annotating numerous long-tail relation types. We are confident that our method holds the potential for broader applications in domain-specific relation type definitions and offers tangible benefits in advancing generalized language semantic comprehension.

pdf bib abs
Rethinking Dictionaries and Glyphs for Chinese Language Pre-training
Yuxuan Wang | Jack Wang | Dongyan Zhao | Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2023

We introduce CDBert, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBert as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters’ glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e.„ Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.

2022

pdf bib abs
SHARP: Search-Based Adversarial Attack for Structured Prediction
Liwen Zhang | Zixia Jia | Wenjuan Han | Zilong Zheng | Kewei Tu
Findings of the Association for Computational Linguistics: NAACL 2022

Adversarial attack of structured prediction models faces various challenges such as the difficulty of perturbing discrete words, the sentence quality issue, and the sensitivity of outputs to small perturbations. In this work, we introduce SHARP, a new attack method that formulates the black-box adversarial attack as a search-based optimization problem with a specially designed objective function considering sentence fluency, meaning preservation and attacking effectiveness. Additionally, three different searching strategies are analyzed and compared, i.e., Beam Search, Metropolis-Hastings Sampling, and Hybrid Search. We demonstrate the effectiveness of our attacking strategies on two challenging structured prediction tasks: Pos-tagging and dependency parsing. Through automatic and human evaluations, we show that our method performs a more potent attack compared with pioneer arts. Moreover, the generated adversarial examples can be used to successfully boost the robustness and performance of the victim model via adversarial training.

Zilong Zheng

2025

2024

2023

2022

2021

Co-authors

Venues