Baoxun Wang

2025

pdf bib abs
RAIDEN Benchmark: Evaluating Role-playing Conversational Agents with Measurement-Driven Custom Dialogues
Bowen Wu | Kaili Sun | Ziwei Bai | Ying Li | Baoxun Wang
Proceedings of the 31st International Conference on Computational Linguistics

As Large-scale Language Models (LLMs) advance, the development of engaging Role-Playing Conversational Agents (RPCAs) has gained prominence. Despite this progress, there is a notable absence of benchmarks designed around dialogues, rather than question-answering formats, to assess the effectiveness of RPCA interactions. This paper introduces the RAIDEN benchmark, containing a comprehensive dataset specifically developed for RPCA evaluation, comprising over 40,000 multi-turn utterances across 135 characters. The benchmark focuses on assessing particular dimensions at different stages of a conversation, facilitated through interactions conducted by annotators. This approach allows the evaluation phase to concentrate on specific response dimensions, and thus subjectivity in dialogue evaluation is reduced. To further enhance objectivity, evaluators compare responses from two different models rather than assessing a single response in isolation. Besides, we introduce RPCAJudger, a specialized judging LLM tailored for automatic RPCA evaluation. The evaluations conducted by RPCAJudger closely mirror human judgments, and its API-free methodology serves to prevent potential data leakage. All the models and all non-private leaderboard data will be made publicly available.

Proactive dialogue systems aim to empower chatbots with the capability of leading conversations towards specific targets, thereby enhancing user engagement and service autonomy. Existing systems typically target pre-defined keywords or entities, neglecting user attributes and preferences implicit in dialogue history, hindering the development of long-term user intimacy. To address these challenges, we take a radical step towards building a more human-like conversational agent by integrating proactive dialogue systems with long-term memory into a unified framework. Specifically, we define a novel task named Memory-aware Proactive Dialogue (MapDia). By decomposing the task, we then propose an automatic data construction method and create the first Chinese Memory-aware Proactive Dataset (ChMapData). Furthermore, we introduce a joint framework based on Retrieval Augmented Generation (RAG), featuring three modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting Detection and Generation, designed to steer dialogues towards relevant historical topics at the right time. The effectiveness of our dataset and models is validated through both automatic and human evaluations. We release the open-source framework and dataset at https://github.com/FrontierLabs/MapDia.

Large Language Models (LLMs) have demonstrated significant advancements in various fields, notably in Role-Playing Conversational Agents (RPCAs). However, when confronted with role-specific professional inquiries, LLMs-based RPCAs tend to underperform due to their excessive emphasis on the conversational abilities of characters rather than effectively invoking and integrating relevant expert knowledge. This often results in inaccurate responses. We refer to this phenomenon as the “Knowledge Misalignment” which underscores the limitations of RPCAs in integrating expert knowledge. To mitigate this issue, we have introduced an Anchoring-Guidance Fine-Tuning (AnGFT) Framework into the RPCAs’ training process. This involves initially linking the Anchoring-Based System Prompt (ASP) with the LLM’s relevant expert domains through diverse prompt construction strategies and supervised fine-tuning (SFT). Following the role-play enriched SFT, the integration of ASP enables LLMs to better associate with relevant expert knowledge, thus enhancing their response capabilities in role-specific expert domains. Moreover, we have developed four comprehensive metrics—helpfulness, thoroughness, credibility, and feasibility—to evaluate the proficiency of RPCAs in responding to professional questions. Our method was tested across four professional fields, and the experimental outcomes suggest that the proposed AnGFT Framework substantially improves the RPCAs’ performance in handling role-specific professional queries, while preserving their robust role-playing abilities.

2023

Multimodal Sentiment Analysis leverages multimodal signals to detect the sentiment of a speaker. Previous approaches concentrate on performing multimodal fusion and representation learning based on general knowledge obtained from pretrained models, which neglects the effect of domain-specific knowledge. In this paper, we propose Contrastive Knowledge Injection (ConKI) for multimodal sentiment analysis, where specific-knowledge representations for each modality can be learned together with general knowledge representations via knowledge injection based on an adapter architecture. In addition, ConKI uses a hierarchical contrastive learning procedure performed between knowledge types within every single modality, across modalities within each sample, and across samples to facilitate the effective learning of the proposed representations, hence improving multimodal sentiment predictions. The experiments on three popular multimodal sentiment analysis benchmarks show that ConKI outperforms all prior methods on a variety of performance metrics.

2020

pdf bib abs
Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation
Bowen Wu | Huan Zhang | MengYuan Li | Zongsheng Wang | Qihang Feng | Junhong Huang | Baoxun Wang
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Recently, BERT has become an essential ingredient of various NLP deep models due to its effectiveness and universal-usability. However, the online deployment of BERT is often blocked by its large-scale parameters and high computational cost. There are plenty of studies showing that the knowledge distillation is efficient in transferring the knowledge from BERT into the model with a smaller size of parameters. Nevertheless, current BERT distillation approaches mainly focus on task-specified distillation, such methodologies lead to the loss of the general semantic knowledge of BERT for universal-usability. In this paper, we propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model without specifying tasks. Consistent with BERT, our distilled model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task. Besides, our model can further cooperate with task-specific distillation procedures. The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods or even much larger models, i.e., ELMO, with efficiency well-improved.

Leveraging persona information of users in Neural Response Generators (NRG) to perform personalized conversations has been considered as an attractive and important topic in the research of conversational agents over the past few years. Despite of the promising progress achieved by recent studies in this field, persona information tends to be incorporated into neural networks in the form of user embeddings, with the expectation that the persona can be involved via End-to-End learning. This paper proposes to adopt the personality-related characteristics of human conversations into variational response generators, by designing a specific conditional variational autoencoder based deep model with two new regularization terms employed to the loss function, so as to guide the optimization towards the direction of generating both persona-aware and relevant responses. Besides, to reasonably evaluate the performances of various persona modeling approaches, this paper further presents three direct persona-oriented metrics from different perspectives. The experimental results have shown that our proposed methodology can notably improve the performance of persona-aware response generation, and the metrics are reasonable to evaluate the results.

2019

pdf bib abs
Improving the Robustness of Deep Reading Comprehension Models by Leveraging Syntax Prior
Bowen Wu | Haoyang Huang | Zongsheng Wang | Qihang Feng | Jingsong Yu | Baoxun Wang
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

Despite the remarkable progress on Machine Reading Comprehension (MRC) with the help of open-source datasets, recent studies indicate that most of the current MRC systems unfortunately suffer from weak robustness against adversarial samples. To address this issue, we attempt to take sentence syntax as the leverage in the answer predicting process which previously only takes account of phrase-level semantics. Furthermore, to better utilize the sentence syntax and improve the robustness, we propose a Syntactic Leveraging Network, which is designed to deal with adversarial samples by exploiting the syntactic elements of a question. The experiment results indicate that our method is promising for improving the generalization and robustness of MRC models against the influence of adversarial samples, with performance well-maintained.

2018

pdf bib abs
A Prospective-Performance Network to Alleviate Myopia in Beam Search for Response Generation
Zongsheng Wang | Yunzhi Bai | Bowen Wu | Zhen Xu | Zhuoran Wang | Baoxun Wang
Proceedings of the 27th International Conference on Computational Linguistics

Generative dialog models usually adopt beam search as the inference method to generate responses. However, small-width beam search only focuses on the limited current optima. This deficiency named as myopic bias ultimately suppresses the diversity and probability of generated responses. Although increasing the beam width mitigates the myopic bias, it also proportionally slows down the inference efficiency. To alleviate the myopic bias in small-width beam search, this paper proposes a Prospective-Performance Network (PPN) to predict the future reward of the given partially-generated response, and the future reward is defined by the expectation of the partial response appearing in the top-ranked responses given by a larger-width beam search. Enhanced by PPN, the decoder can promote the results with great potential during the beam search phase. The experimental results on both Chinese and English corpora show that our method is promising to increase the quality and diversity of generated responses, with inference efficiency well maintained.

pdf bib abs
LSDSCC: a Large Scale Domain-Specific Conversational Corpus for Response Generation with Diversity Oriented Evaluation Metrics
Zhen Xu | Nan Jiang | Bingquan Liu | Wenge Rong | Bowen Wu | Baoxun Wang | Zhuoran Wang | Xiaolong Wang
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

It has been proven that automatic conversational agents can be built up using the Endto-End Neural Response Generation (NRG) framework, and such a data-driven methodology requires a large number of dialog pairs for model training and reasonable evaluation metrics for testing. This paper proposes a Large Scale Domain-Specific Conversational Corpus (LSDSCC) composed of high-quality queryresponse pairs extracted from the domainspecific online forum, with thorough preprocessing and cleansing procedures. Also, a testing set, including multiple diverse responses annotated for each query, is constructed, and on this basis, the metrics for measuring the diversity of generated results are further presented. We evaluate the performances of neural dialog models with the widely applied diversity boosting strategies on the proposed dataset. The experimental results have shown that our proposed corpus can be taken as a new benchmark dataset for the NRG task, and the presented metrics are promising to guide the optimization of NRG models by quantifying the diversity of the generated responses reasonably.

2017

This paper presents a Generative Adversarial Network (GAN) to model single-turn short-text conversations, which trains a sequence-to-sequence (Seq2Seq) network for response generation simultaneously with a discriminative classifier that measures the differences between human-produced responses and machine-generated ones. In addition, the proposed method introduces an approximate embedding layer to solve the non-differentiable problem caused by the sampling-based output decoding procedure in the Seq2Seq generative model. The GAN setup provides an effective way to avoid noninformative responses (a.k.a “safe responses”), which are frequently observed in traditional neural response generators. The experimental results show that the proposed approach significantly outperforms existing neural response generation models in diversity metrics, with slight increases in relevance scores as well, when evaluated on both a Mandarin corpus and an English corpus.

pdf bib abs
Predicting Users’ Negative Feedbacks in Multi-Turn Human-Computer Dialogues
Xin Wang | Jianan Wang | Yuanchao Liu | Xiaolong Wang | Zhuoran Wang | Baoxun Wang
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

User experience is essential for human-computer dialogue systems. However, it is impractical to ask users to provide explicit feedbacks when the agents’ responses displease them. Therefore, in this paper, we explore to predict users’ imminent dissatisfactions caused by intelligent agents by analysing the existing utterances in the dialogue sessions. To our knowledge, this is the first work focusing on this task. Several possible factors that trigger negative emotions are modelled. A relation sequence model (RSM) is proposed to encode the sequence of appropriateness of current response with respect to the earlier utterances. The experimental results show that the proposed structure is effective in modelling emotional risk (possibility of negative feedback) than existing conversation modelling approaches. Besides, strategies of obtaining distance supervision data for pre-training are also discussed in this work. Balanced sampling with respect to the last response in the distance supervision data are shown to be reliable for data augmentation.

For practical chatbots, one of the essential factor for improving user experience is the capability of customizing the talking style of the agents, that is, to make chatbots provide responses meeting users’ preference on language styles, topics, etc. To address this issue, this paper proposes to incorporate linguistic biases, which implicitly involved in the conversation corpora generated by human groups in the Social Network Services (SNS), into the encoder-decoder based response generator. By attaching a specially designed neural component to dynamically control the impact of linguistic biases in response generation, a Group Linguistic Bias Aware Neural Response Generation (GLBA-NRG) model is eventually presented. The experimental results on the dataset from the Chinese SNS show that the proposed architecture outperforms the current response generating models by producing both meaningful and vivid responses with customized styles.

2016

pdf bib abs
Ranking Responses Oriented to Conversational Relevance in Chat-bots
Bowen Wu | Baoxun Wang | Hui Xue
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

For automatic chatting systems, it is indeed a great challenge to reply the given query considering the conversation history, rather than based on the query only. This paper proposes a deep neural network to address the context-aware response ranking problem by end-to-end learning, so as to help to select conversationally relevant candidate. By combining the multi-column convolutional layer and the recurrent layer, our model is able to model the semantics of the utterance sequence by grasping the semantic clue within the conversation, on the basis of the effective representation for each sentence. Especially, the network utilizes attention pooling to further emphasis the importance of essential words in conversations, thus the representations of contexts tend to be more meaningful and the performance of candidate ranking is notably improved. Meanwhile, due to the adoption of attention pooling, it is possible to visualize the semantic clues. The experimental results on the large amount of conversation data from social media have shown that our approach is promising for quantifying the conversational relevance of responses, and indicated its good potential for building practical IR based chat-bots.

Baoxun Wang

2025

2023

2020

2019

2018

2017

2016

2015

2013

2012

2010

Co-authors

Venues