Hongqiu Wu

2025

pdf bib abs
Game Development as Human-LLM Interaction
Jiale Hong | Hongqiu Wu | Hai Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Game development is a highly specialized task that relies on a complex game engine powered by complex programming languages, preventing many gaming enthusiasts from handling it. This paper introduces the Chat Game Engine (ChatGE) powered by LLM, which allows everyone to develop a custom game using natural language through Human-LLM interaction. To enable an LLM to function as a ChatGE, we instruct it to perform the following processes in each turn: (1) P_script: configure the game script segment based on the user’s input; (2) P_code: generate the corresponding code snippet based on the game script segment; (3) P_utter: interact with the user, including guidance and feedback. We propose a data synthesis pipeline based on LLM to generate game script-code pairs and interactions from a few manually crafted seed data. We propose a three-stage training strategy following curriculum learning principles to transfer the dialogue-based LLM to our ChatGE smoothly. We construct a ChatGE for poker games as a case study and comprehensively evaluate it from two perspectives: interaction quality and code correctness.

pdf bib abs
X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents
Weiqi Wu | Hongqiu Wu | Hai Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes X-Turing, which enhances the original test with a burst dialogue pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the pseudo-dialogue history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the X-Turn Pass-Rate metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.

pdf bib abs
Towards Enhanced Immersion and Agency for LLM-based Interactive Drama
Hongqiu Wu | Weiqi Wu | Tianyang Xu | Jiameng Zhang | Hai Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

LLM-based Interactive Drama is a novel AI-based dialogue scenario, where the user (i.e. the player) plays the role of a character in the story, has conversations with characters played by LLM agents, and experiences an unfolding story. This paper begins with understanding interactive drama from two aspects: Immersion—the player’s feeling of being present in the story—and Agency—the player’s ability to influence the story world. Both are crucial to creating an enjoyable interactive experience, while they have been underexplored in previous work. To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality. Additionally, we introduce Plot-based Reflection for LLM agents to refine their reactions to align with the player’s intentions. Our evaluation relies on human judgment to assess the gains of our methods in terms of immersion and agency.

pdf bib abs
Driving Chinese Spelling Correction from a Fine-Grained Perspective
Linfeng Liu | Hongqiu Wu | Hai Zhao
Proceedings of the 31st International Conference on Computational Linguistics

This paper explores the task: Chinese spelling correction (CSC), from a fine-grained perspec- tive by recognizing that existing evaluations lack nuanced typology for the spelling errors. This deficiency can create a misleading impres- sion of model performance, incurring an “in- visible” bottleneck hindering the advancement of CSC research. In this paper, we first cate- gorize spelling errors into six types and con- duct a fine-grained evaluation across a wide variety of models, including BERT-based mod- els and LLMs. Thus, we are able to pinpoint the underlying weaknesses of existing state-of- the-art models - utilizing contextual clues and handling co-existence of multiple typos, asso- ciated to contextual errors and multi-typo er- rors. However, these errors occur infrequently in conventional training corpus. Therefore, we introduce new error generation methods to aug- ment their occurrence, which can be leveraged to enhance the training of CSC models. We hope this work could provide fresh insight for future CSC research.

pdf bib abs
Evolving Chinese Spelling Correction with Corrector-Verifier Collaboration
Linfeng Liu | Hongqiu Wu | Hai Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent methods address Chinese Spelling Correction (CSC) with either BERT-based models or large language models (LLMs) independently. However, both of them face challenges. BERT-based models are efficient for this task but struggle with limited generalizability to error patterns, thus failing in open-domain CSC. LLMs are advantageous in their extensive knowledge but fall into low efficiency in character-level editing. To address this dilemma, we propose Automatic Corrector Iteration (ACI), a novel model collaboration pipeline to iteratively optimize a BERT-based corrector. This pipeline is free of human annotation, by leveraging an LLM verifier to provide useful signals for the corrector. Experimental results demonstrate that our pipeline consistently improves the model performance across iterations and significantly outperforms existing data augmentation methods, achieving comparable performance with human annotation.

pdf bib abs
Open-Theatre: An Open-Source Toolkit for LLM-based Interactive Drama
Tianyang Xu | Hongqiu Wu | Weiqi Wu | Hai Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

LLM-based Interactive Drama introduces a novel dialogue scenario in which the player immerses into a character and engages in a dramatic story by interacting with LLM agents. Despite the fact that this emerging area holds significant promise, it remains largely underexplored due to the lack of a well-designed playground to develop a complete drama. This makes a significant barrier for researchers to replicate, extend, and study such systems. Hence, we present Open-Theatre, the first open-source toolkit for experiencing and customizing LLM-based interactive drama. It refines prior work with an efficient multi-agent architecture and a hierarchical retrieval-based memory system, designed to enhance narrative coherence and realistic long-term behavior in complex interactions. In addition, we provide a highly configurable pipeline, making it easy for researchers to develop and optimize new approaches.

2024

pdf bib abs
Instruction-Driven Game Engine: A Poker Case Study
Hongqiu Wu | Xingyuan Liu | Yan Wang | Hai Zhao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

The Instruction-Driven Game Engine (IDGE) project aims to democratize game development by enabling a large language model (LLM) to follow free-form game descriptions and generate game-play processes. The IDGE allows users to create games simply by natural language instructions, which significantly lowers the barrier for game development. We approach the learning process for IDGEs as a Next State Prediction task, wherein the model autoregressively predicts the game states given player actions. The computation of game states must be precise; otherwise, slight errors could corrupt the game-play experience. This is challenging because of the gap between stability and diversity. To address this, we train the IDGE in a curriculum manner that progressively increases its exposure to complex scenarios.Our initial progress lies in developing an IDGE for Poker, which not only supports a wide range of poker variants but also allows for highly individualized new poker games through natural language inputs. This work lays the groundwork for future advancements in transforming how games are created and played.

Drama is a form of storytelling inspired by human creativity, proceeding with a predefined storyline, carrying emotions and thoughts.This paper introduces LLM-based interactive drama, which endows traditional drama with an unprecedented immersion, where a person is allowed to walk into it and interact with the characters and scenes.We define this new artistic genre by 6 essential elements—plot, character, thought, diction, spectacle and interaction—and study the entire pipeline to forge a backbone drama LLM to drive the playing process, which is challenged by limited drama resources, uncontrollable narrative development, and complicated instruction following.We propose Narrative Chain to offer finer control over the narrative progression during interaction with players;Auto-Drama to synthesize drama scripts given arbitrary stories;Sparse Instruction Tuning to allow the model to follow sophisticated instructions.We manually craft 3 scripts, Detective Conan, Harry Potter, Romeo and Juliet, and design a 5-dimension principle to evaluate the drama LLM comprehensively.

pdf bib abs
Chinese Spelling Corrector Is Just a Language Learner
Lai Jiang | Hongqiu Wu | Hai Zhao | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2024

This paper emphasizes Chinese spelling correction by means of self-supervised learning, which means there are no annotated errors within the training data. Our intuition is that humans are naturally good correctors with exposure to error-free sentences, which contrasts with current unsupervised methods that strongly rely on the usage of confusion sets to produce parallel sentences. In this paper, we demonstrate that learning a spelling correction model is identical to learning a language model from error-free data alone, with decoding it in a greater search space. We propose Denoising Decoding Correction (D2C), which selectively imposes noise upon the source sentence to determine the underlying correct characters. Our method is largely inspired by the ability of language models to perform correction, including both BERT-based models and large language models (LLMs). We show that the self-supervised learning manner generally outperforms the confusion set in specific domains because it bypasses the need to introduce error characters to the training data which can impair the error patterns not included in the introduced error characters.

pdf bib abs
Attack Named Entity Recognition by Entity Boundary Interference
Yifei Yang | Hongqiu Wu | Hai Zhao
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Named Entity Recognition (NER) is a cornerstone natural language processing task while its robustness has been given little attention. This paper rethinks the principles of the conventional text attack, as they can easily violate the label consistency between the original and adversarial NER samples. This is due to the fine-grained nature of NER, as even minor word changes in the sentence can result in the emergence or mutation of any entity, producing invalid adversarial samples. To this end, we propose a novel one-word modification NER attack based on a key insight, NER models are always vulnerable to the boundary position of an entity to make their decision. We thus strategically insert a new boundary into the sentence and trigger the victim model to make a wrong recognition either on this boundary word or on other words in the sentence. We call this attack Virtual Boundary Attack (ViBA), which is shown to be remarkably effective when attacking both English and Chinese models with a 70%-90% attack success rate on state-of-the-art language models, and also significantly faster than previous methods.

pdf bib abs
Unveiling Vulnerability of Self-Attention
Khai Jiet Liong | Hongqiu Wu | Hai Zhao
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes, which poses a significant threat to real-world systems. While previous studies directly focus on manipulating word inputs, they are limited by their means of generating adversarial samples, lacking generalization to versatile real-world attacks. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. (1) We propose a powerful perturbation technique named ‘HackAttend,’ which perturbs the attention scores within the SA matrices via meticulously crafted attention masks. We show that state-of-the-art PLMs fall into heavy vulnerability, with minor attention perturbations (1%) resulting in a very high attack success rate (98%). Our paper extends the conventional text attack of word perturbations to more general structural perturbations. (2) We introduce ‘S-Attend,’ a novel smoothing technique that effectively makes SA robust via structural perturbations. We empirically demonstrate that this simple yet effective technique achieves robust performance on par with adversarial training when facing various text attackers.

2023

pdf bib abs
Rethinking Masked Language Modeling for Chinese Spelling Correction
Hongqiu Wu | Shaohua Zhang | Yuchen Zhang | Hai Zhao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model. Through empirical analysis, we find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns. Given that BERT is the backbone of most CSC models, this phenomenon has a significant negative impact. To address this issue, we are releasing a multi-domain benchmark LEMON, with higher quality and diversity than existing benchmarks, to allow a comprehensive assessment of the open domain generalization of CSC models. Then, we demonstrate that a very simple strategy – randomly masking 20% non-error tokens from the input sequence during fine-tuning – is sufficient for learning a much better language model without sacrificing the error model. This technique can be applied to any model architecture and achieves new state-of-the-art results on SIGHAN, ECSpell, and LEMON.

pdf bib abs
Empower Nested Boolean Logic via Self-Supervised Curriculum Learning
Hongqiu Wu | Linfeng Liu | Hai Zhao | Min Zhang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Beyond the great cognitive powers showcased by language models, it is crucial to scrutinize whether their reasoning capabilities stem from strong generalization or merely exposure to relevant data. As opposed to constructing increasingly complex logic, this paper probes into the boolean logic, the root capability of a logical reasoner. We find that any pre-trained language models even including large language models only behave like a random selector in the face of multi-nested boolean logic, a task that humans can handle with ease. To empower language models with this fundamental capability, this paper proposes a new self-supervised learning method Curriculum Logical Reasoning (Clr), where we augment the training data with nested boolean logic chain step-by-step, and program the training from simpler logical patterns gradually to harder ones. This new training paradigm allows language models to effectively generalize to much harder and longer-hop logic, which can hardly be learned through naive training. Furthermore, we show that boolean logic is a great foundation for improving the subsequent general logical tasks.

2022

pdf bib abs
Semantic-Preserving Adversarial Code Comprehension
Yiyang Li | Hongqiu Wu | Hai Zhao
Proceedings of the 29th International Conference on Computational Linguistics

Based on the tremendous success of pre-trained language models (PrLMs) for source code comprehension tasks, current literature studies either ways to further improve the performance (generalization) of PrLMs, or their robustness against adversarial attacks. However, they have to compromise on the trade-off between the two aspects and none of them consider improving both sides in an effective and practical way. To fill this gap, we propose Semantic-Preserving Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving attacks while forcing the model to predict the correct labels under these worst cases. Experiments and analysis demonstrate that SPACE can stay robust against state-of-the-art attacks while boosting the performance of PrLMs for code.

Multiple pre-training objectives fill the vacancy of the understanding capability of single-objective language modeling, which serves the ultimate purpose of pre-trained language models (PrLMs), generalizing well on a mass of scenarios. However, learning multiple training objectives in a single model is challenging due to the unknown relative significance as well as the potential contrariety between them. Empirical studies have shown that the current objective sampling in an ad-hoc manual setting makes the learned language representation barely converge to the desired optimum. Thus, we propose MOMETAS, a novel adaptive sampler based on meta-learning, which learns the latent sampling pattern on arbitrary pre-training objectives. Such a design is lightweight with negligible additional training overhead. To validate our approach, we adopt five objectives and conduct continual pre-training with BERT-base and BERT-large models, where MOMETAS demonstrates universal performance gain over other rule-based sampling strategies on 14 natural language processing tasks.

Hongqiu Wu

2025

2024

2023

2022

2021

Co-authors

Venues