Zilong Zheng


2024

pdf bib
MindAgent: Emergent Gaming Interaction
Ran Gong | Qiuyuan Huang | Xiaojian Ma | Yusuke Noda | Zane Durante | Zilong Zheng | Demetri Terzopoulos | Li Fei-Fei | Jianfeng Gao | Hoi Vo
Findings of the Association for Computational Linguistics: NAACL 2024

Large Foundation Models (LFMs) can perform complex scheduling in a multi-agent system and can coordinate agents to complete sophisticated tasks that require extensive collaboration.However, despite the introduction of numerous gaming frameworks, the community lacks adequate benchmarks that support the implementation of a general multi-agent infrastructure encompassing collaboration between LFMs and human-NPCs. We propose a novel infrastructure—Mindagent—for evaluating planning and coordination capabilities in the context of gaming interaction. In particular, our infrastructure leverages an existing gaming framework to (i) act as the coordinator for a multi-agent system, (ii) collaborate with human players via instructions, and (iii) enable in-context learning based on few-shot prompting with feedback.Furthermore, we introduce “Cuisineworld”, a new gaming scenario and its related benchmark that supervises multiple agents playing the game simultaneously and measures multi-agent collaboration efficiency. We have conducted comprehensive evaluations with a new auto-metric Collaboration Score: CoS for assessing the collaboration efficiency. Finally, Mindagent can be deployed in real-world gaming scenarios in a customized VR version of Cuisineworld and adapted in the “Minecraft” domain. Our work involving LFMs within our new infrastructure for general-purpose scheduling and coordination can elucidate how such skills may be obtained by learning from large language corpora.

pdf bib
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
Hongwei Liu | Zilong Zheng | Yuxuan Qiao | Haodong Duan | Zhiwei Fei | Fengzhe Zhou | Wenwei Zhang | Songyang Zhang | Dahua Lin | Kai Chen
Findings of the Association for Computational Linguistics ACL 2024

Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, which fall short in providing a holistic assessment of the LLMs’ math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model’s mathematical proficiency and its ability to apply concepts in practical scenarios. MathBench aims to enhance the evaluation of LLMs’ mathematical abilities, providing a nuanced view of their knowledge understanding levels and problem solving skills in a bilingual context.

pdf bib
Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling
Shenzhi Wang | Chang Liu | Zilong Zheng | Siyuan Qi | Shuo Chen | Qisen Yang | Andrew Zhao | Chaofei Wang | Shiji Song | Gao Huang
Findings of the Association for Computational Linguistics ACL 2024

Recent advances in large language models (LLMs) have led to significant success in using LLMs as agents. Nevertheless, a common assumption that LLMs always process honest information neglects the widespread deceptive or misleading content in human and AI-generated material. This oversight might expose LLMs to malicious manipulations. To enhance LLMs’ ability to identify and counteract deceptive information, in this paper, inspired by humans’ recursive thinking and perspective-taking, we introduce a novel cognitive framework, Recursive Contemplation (ReCon). ReCon combines formulation and refinement contemplation processes; formulation contemplation produces initial thoughts and speech, while refinement contemplation further polishes them. Additionally, we incorporate first-order and second-order perspective transitions into these processes respectively. Specifically, the first-order allows an LLM agent to infer others’ mental states, and the second-order involves understanding how others perceive the agent’s mental state. After integrating ReCon with various LLMs, extensive experiment results from the Avalon game and BigTom benchmark indicate ReCon’s efficacy in aiding LLMs to discern and maneuver around deceptive information without extra fine-tuning and data. Finally, we demonstrate ReCon’s scaling trend with model parameters, and explore the current limitations of LLMs in terms of safety and reasoning, potentially furnishing insights for subsequent research. Our project page can be found at https://shenzhi-wang.github.io/avalon_recon.

pdf bib
LangSuit·E: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments
Zixia Jia | Mengmeng Wang | Baichen Tong | Song-Chun Zhu | Zilong Zheng
Findings of the Association for Computational Linguistics ACL 2024

Recent advances in Large Language Models (LLMs) have shown inspiring achievements in constructing autonomous agents that rely onlanguage descriptions as inputs. However, it remains unclear how well LLMs can function as few-shot or zero-shot embodied agents in dynamic interactive environments. To address this gap, we introduce LangSuit·E, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuit·E (i) offers adaptability to diverse environments without multiple simulation engines, (ii) evaluates agents’ capacity to develop “internalized world knowledge” with embodied observations, and (iii) allows easy customization of communication and action strategies. To address the embodiment challenge, we devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information. Comprehensive benchmark results illustrate challenges and insights of embodied planning. LangSuit·E represents a significant step toward building embodied generalists in the context of language models.

pdf bib
Towards More Realistic Chinese Spell Checking with New Benchmark and Specialized Expert Model
Yue Wang | Zilong Zheng | Juntao Li | Zhihui Liu | Jinxiong Chang | Qishen Zhang | Zhongyi Liu | Guannan Zhang | Min Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) hold considerable promise for artificial general intelligence, given their intrinsic abilities to accomplish a wide range of open-domain tasks either independently or in tandem with specialized expert models. However, despite these capabilities, the performance of LLMs has yet to be comprehensively evaluated in realistic scenarios. To this end, in this work, we introduce a novel task, the Realistic Chinese Spell Checking (RCSC), to evaluate the effectiveness of existing methods comprehensively. In contrast to existing works that solely address Chinese character misspellings or pinyin conversions, our task aims to convert the realistic Chinese text into the corresponding correct text. The realistic Chinese text may potentially contain both Chinese misspellings and pinyin conversions. We first present the Realistic Chinese Spell Checking Benchmark (RCSCB), which consists of two subsets and contains a total of 581,657 samples. Then, we benchmark the performance of various baselines and find that all the existing methods, including instruction-based LLMs, achieve unsatisfactory results on RCSCB. To further improve the performance on RCSCB, we propose Pinyin-Enhanced Spell Checker (PESC), which is specifically designed to address pinyin-related misspellings. Experimental results demonstrate that PESC can achieve state-of-the-art performance on RCSCB. Despite the progress made, the current state-of-the-art performance is still far from satisfactory. We expect further progress on this crucial and challenging task.

pdf bib
Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels
Zixia Jia | Junpeng Li | Shichuan Zhang | Anji Liu | Zilong Zheng
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Traditional supervised learning heavily relies on human-annotated datasets, especially in data-hungry neural approaches. However, various tasks, especially multi-label tasks like document-level relation extraction, pose challenges in fully manual annotation due to the specific domain knowledge and large class sets. Therefore, we address the multi-label positive-unlabelled learning (MLPUL) problem, where only a subset of positive classes is annotated. We propose Mixture Learner for Partially Annotated Classification (MLPAC), an RL-based framework combining the exploration ability of reinforcement learning and the exploitation ability of supervised learning. Experimental results across various tasks, including document-level relation extraction, multi-label image classification, and binary PU learning, demonstrate the generalization and effectiveness of our framework.

pdf bib
LooGLE: Can Long-Context Language Models Understand Long Contexts?
Jiaqi Li | Mengmeng Wang | Zilong Zheng | Muhan Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) are typically limited to processing texts within context window size, which has spurred significant research efforts into enhancing LLMs’ long-context understanding as well as developing high-quality benchmarks to evaluate the ability. However, prior datasets suffer from short comings like short length compared to the context window of modern LLMs; outdated documents that might have data leakage problems; and an emphasis on short dependency tasks only. In this paper, we present LooGLE , a Long Context Generic Language Evaluation benchmark. It features documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning varying dependency ranges in diverse domains. Human annotators meticulously crafted over 1,100 high-quality question-answer (QA) pairs with thorough cross-validation for a most precise assessment of LLMs’ long dependency capabilities. We conduct a comprehensive evaluation of representative LLMs on LooGLE . The results indicate that most LLMs have shockingly bad long context ability and fail to capture long dependencies in the context, even when their context window size is enough to fit the entire document. Our results shed light on enhancing the “true long-context understanding” ability of LLMs instead of merely enlarging their context window.

2023

pdf bib
Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models
Junpeng Li | Zixia Jia | Zilong Zheng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Document-level Relation Extraction (DocRE), which aims to extract relations from a long context, is a critical challenge in achieving fine-grained structural comprehension and generating interpretable document representations. Inspired by recent advances in in-context learning capabilities emergent from large language models (LLMs), such as ChatGPT, we aim to design an automated annotation method for DocRE with minimum human effort. Unfortunately, vanilla in-context learning is infeasible for DocRE due to the plenty of predefined fine-grained relation types and the uncontrolled generations of LLMs. To tackle this issue, we propose a method integrating an LLM and a natural language inference (NLI) module to generate relation triples, thereby augmenting document-level relation datasets. We demonstrate the effectiveness of our approach by introducing an enhanced dataset known as DocGNRE, which excels in re-annotating numerous long-tail relation types. We are confident that our method holds the potential for broader applications in domain-specific relation type definitions and offers tangible benefits in advancing generalized language semantic comprehension.

pdf bib
Rethinking Dictionaries and Glyphs for Chinese Language Pre-training
Yuxuan Wang | Jack Wang | Dongyan Zhao | Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2023

We introduce CDBert, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBert as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters’ glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e.„ Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.

pdf bib
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang | Zilong Zheng | Xueliang Zhao | Jinpeng Li | Yueqian Wang | Dongyan Zhao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.

pdf bib
Modeling Instance Interactions for Joint Information Extraction with Neural High-Order Conditional Random Field
Zixia Jia | Zhaohui Yan | Wenjuan Han | Zilong Zheng | Kewei Tu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Prior works on joint Information Extraction (IE) typically model instance (e.g., event triggers, entities, roles, relations) interactions by representation enhancement, type dependencies scoring, or global decoding. We find that the previous models generally consider binary type dependency scoring of a pair of instances, and leverage local search such as beam search to approximate global solutions. To better integrate cross-instance interactions, in this work, we introduce a joint IE framework (CRFIE) that formulates joint IE as a high-order Conditional Random Field. Specifically, we design binary factors and ternary factors to directly model interactions between not only a pair of instances but also triplets. Then, these factors are utilized to jointly predict labels of all instances. To address the intractability problem of exact high-order inference, we incorporate a high-order neural decoder that is unfolded from a mean-field variational inference method, which achieves consistent learning and inference. The experimental results show that our approach achieves consistent improvements on three IE tasks compared with our baseline and prior work.

2022

pdf bib
SHARP: Search-Based Adversarial Attack for Structured Prediction
Liwen Zhang | Zixia Jia | Wenjuan Han | Zilong Zheng | Kewei Tu
Findings of the Association for Computational Linguistics: NAACL 2022

Adversarial attack of structured prediction models faces various challenges such as the difficulty of perturbing discrete words, the sentence quality issue, and the sensitivity of outputs to small perturbations. In this work, we introduce SHARP, a new attack method that formulates the black-box adversarial attack as a search-based optimization problem with a specially designed objective function considering sentence fluency, meaning preservation and attacking effectiveness. Additionally, three different searching strategies are analyzed and compared, i.e., Beam Search, Metropolis-Hastings Sampling, and Hybrid Search. We demonstrate the effectiveness of our attacking strategies on two challenging structured prediction tasks: Pos-tagging and dependency parsing. Through automatic and human evaluations, we show that our method performs a more potent attack compared with pioneer arts. Moreover, the generated adversarial examples can be used to successfully boost the robustness and performance of the victim model via adversarial training.

pdf bib
Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)
Wenjuan Han | Zilong Zheng | Zhouhan Lin | Lifeng Jin | Yikang Shen | Yoon Kim | Kewei Tu
Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)

2021

pdf bib
GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning
Zilong Zheng | Shuwen Qiu | Lifeng Fan | Yixin Zhu | Song-Chun Zhu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021