Xiaojie Wang - ACL Anthology

Xiaojie Wang

2025

CoTD-PO: Chain-of-Thought Distillation with Preference Optimization
Lujie Niu | Haochen Sun | Fangkun Zhao | Sheng Chen | Zimeng Bai | Jiawei Zhang | Caixia Yuan | Xiaojie Wang
Findings of the Association for Computational Linguistics: EMNLP 2025

Chain-of-Thought (CoT) distillation has emerged as a promising paradigm to enhance the reasoning ability of small language models by imitating the reasoning and outputs of larger teacher models. However, existing approaches suffer from a critical limitation: a distribution mismatch between teacher-generated training trajectories and the student model’s own generative distribution. This mismatch leads to exposure bias during inference and often induces mode collapse or mode averaging, thereby degrading the student model’s generative diversity and robustness. To address these issues, we propose CoTD-PO (Chain-of-Thought Distillation with Preference Optimization), a reinforcement learning framework that shifts the training paradigm from passive imitation to active trajectory exploration. Instead of forcing the student to imitate exact teacher traces, our method enables the student to sample its own answer paths. To support training with non-open-source teacher models, we approximate the teacher’s output distribution through preference-based scoring. Furthermore, we adopt an offline iterative training procedure that enables stable and efficient optimization. Experiments on diverse open-ended generation tasks demonstrate that CoTD-PO significantly outperforms standard CoT distillation baselines, achieving higher output quality while mitigating mode collapse and preserving semantic diversity.

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
Haochen Sun | Shuwen Zhang | Lujie Niu | Lei Ren | Hao Xu | Hao Fu | Fangkun Zhao | Caixia Yuan | Xiaojie Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.

Multimodal Aspect-Based Sentiment Analysis under Conditional Relation
Xinjing Liu | Ruifan Li | Shuqin Ye | Guangwei Zhang | Xiaojie Wang
Proceedings of the 31st International Conference on Computational Linguistics

Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect terms from text-image pairs and identify their sentiments. Previous methods are based on the premise that the image contains the objects referred by the aspects within the text. However, this condition cannot always be met, resulting in a suboptimal performance. In this paper, we propose COnditional Relation based Sentiment Analysis framework (CORSA). Specifically, we design a conditional relation detector (CRD) to mitigate the impact of the unmet conditional image. Moreover, we design a visual object localizer (VOL) to locate the exact condition-related visual regions associated with the aspects. With CRD and VOL, our CORSA framework takes a multi-task form. In addition, to effectively learn CORSA we conduct two types of annotations. One is the conditional relation using a pretrained referring expression comprehension model; the other is the bounding boxes of visual objects by a pretrained object detection model. Experiments on our built C-MABSA dataset show that CORSA consistently outperforms existing methods. The code and data are available at https://github.com/Liuxj-Anya/CORSA.

基于特征融合的大模型生成文本作者检测
Xiying Zhao | Zimeng Bai | Yan Zhang | Caixia Yuan | Xiaojie Wang
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"大语言模型在高效生成文本的同时也带来了文本滥用的问题,如何有效地区分不同大模型生成的文本成为了关键的挑战。为了解决这个问题,本文首先构建了一个面向多分类的大模型生成文本检测任务的数据集LGT-AA,包含7个领域的人类和10个常用大模型生成的94k条文本;其次,本文提出了一种提取不同大模型生成文本的全局性区分性特征的方案,并与分布特征进行融合构建文本检测器,提升了对生成文本的检测能力。实验结果表明,本文提出的方法在不同模型组合下和不同生成模型类别下都取得了更优的性能。"

Controlled Low-Rank Adaptation with Subspace Regularization for Continued Training on Large Language Models
Yuheng Lu | Bingshuo Qian | Caixia Yuan | Huixing Jiang | Xiaojie Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) exhibit remarkable capabilities in natural language processing but face catastrophic forgetting when learning new tasks, where adaptation to a new domain leads to a substantial decline in performance on previous tasks. In this paper, we propose Controlled LoRA (CLoRA), a subspace regularization method on LoRA structure. Aiming to reduce the scale of output change while introducing minimal constraint on model capacity, CLoRA imposes constraints on the direction of updating matrix’s null space. Experimental results on one-stage LLM finetuning tasks and continual learning settings highlight the superiority of CLoRA as an effective parameter-efficient finetuning method with catastrophic forgetting mitigating. Further investigation for model parameters indicates that CLoRA effectively balances the trade-off between model capacity and degree of forgetting. The code for implementing CLoRA will be publicly available.

2024

Dual-Stage Multi-Task Syntax-Oriented Pre-Training for Syntactically Controlled Paraphrase Generation
Hongxu Liu | Xiaojie Wang | Jiashen Sun | Ke Zeng | Wan Guanglu
Findings of the Association for Computational Linguistics: ACL 2024

Syntactically Controlled Paraphrase Generation (SCPG), which aims at generating sentences having syntactic structures resembling given exemplars, is attracting more research efforts in recent years. We took an empirical survey on previous SCPG datasets and methods and found three tacitly approved while seldom mentioned intrinsic shortcomings/trade-offs in terms of data obtaining, task formulation, and pre-training strategies. As a mitigation to these shortcomings, we proposed a novel Dual-Stage Multi-Task (DSMT) pre-training scheme, involving a series of structure-oriented and syntax-oriented tasks, which, in our opinion, gives sequential text models the ability of com-prehending intrinsically non-sequential structures like Linearized Constituency Trees (LCTs), understanding the underlying syntactics, and even generating them by parsing sentences. We performed further pre-training of the popular T5 model on these novel tasks and fine-tuned the trained model on every possible variant of SCPG task in literature, finding that our models significantly outperformed (up to 10+ BLEU-4) previous state-of-the-art methods. Finally, we carried out ablation studies which demonstrated the effectiveness of our DSMT methods and emphasized on the SCPG performance gains compared to vanilla T5 models, especially on hard samples or under few-shot settings.

Phased Instruction Fine-Tuning for Large Language Models
Wei Pang | Chuan Zhou | Xiao-Hua Zhou | Xiaojie Wang
Findings of the Association for Computational Linguistics: ACL 2024

Instruction Fine-Tuning, a method enhancing pre-trained language models’ capabilities from mere next-word prediction to complex instruction following, often employs a one-off training approach on diverse instruction dataset. However, this method may not effectively enhance models’ adherence to instructions due to the simultaneous handling of varying instruction complexities. To address this, we propose a novel phased instruction fine-tuning (Phased IFT) method, grounded in the hypothesis of progressive alignment, which posits that the transition of a pre-trained language model from simple next-word prediction to sophisticated instruction following is a gradual learning process. Specifically, we obtain the score of difficulty for each instruction via GPT-4, stratify the instruction data into subsets of increasing difficulty, and sequentially uptrain on these subsets using the standard supervised loss. Through extensive experiments on the pre-trained models Llama-2 7B/13B, and Mistral-7B using the 52K Alpaca instruction data, we demonstrate that Phased IFT significantly surpasses traditional one-off instruction fine-tuning (One-off IFT) method in win rate, empirically validating the progressive alignment hypothesis. Our findings suggest that Phased IFT offers a simple yet effective pathway for elevating the instruction-following capabilities of pre-trained language models.

KG-Adapter: Enabling Knowledge Graph Integration in Large Language Models through Parameter-Efficient Fine-Tuning
Shiyu Tian | Yangyang Luo | Tianze Xu | Caixia Yuan | Huixing Jiang | Chen Wei | Xiaojie Wang
Findings of the Association for Computational Linguistics: ACL 2024

Although large language models (LLMs) show remarkable capabilities and generalizability across various tasks, they are criticized for lack of expertise. One promising solution is to combine knowledge graphs (KGs) with LLMs, and recent studies focus on integrating KGs into LLMs through prompt-based methods. However, these approaches fail to use the structural information of the KGs, suffer from the problem of knowledge conflict, and over-reliance on super LLMs. To address these challenges, we propose KG-Adapter, a parameter-level KG integration method based on parameter-efficient fine-tuning (PEFT). Specifically, we introduce a novel adapter structure designed for decoder-only LLMs, which can encode KGs from both node-centered and relation-centered perspectives, and then perform joint reasoning with LLMs to generate responses end-to-end. Experiments with diverse models on four datasets for two different tasks all demonstrate significant improvements. With only 28M parameters trained, we make the 7B-parameter LLM outperform the previous full-parameter fine-tuned state-of-the-art method and comparable to the prompt-based ChatGPT methods.

2023

Explicit Alignment and Many-to-many Entailment Based Reasoning for Conversational Machine Reading
Yangyang Luo | Shiyu Tian | Caixia Yuan | Xiaojie Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

Conversational Machine Reading (CMR) requires answering a user’s initial question through multi-turn dialogue interactions based on a given document. Although there exist many effective methods, they largely neglected the alignment between the document and the user-provided information, which significantly affects the intermediate decision-making and subsequent follow-up question generation. To address this issue, we propose a pipeline framework that (1) aligns the aforementioned two sides in an explicit way, (2) makes decisions using a lightweight many-to-many entailment reasoning module, and (3) directly generates follow-up questions based on the document and previously asked questions. Our proposed method achieves state-of-the-art in micro-accuracy and ranks the first place on the public leaderboard of the CMR benchmark dataset ShARC.

Multimodal Recommendation Dialog with Subjective Preference: A New Challenge and Benchmark
Yuxing Long | Binyuan Hui | Caixia Yuan | Fei Huang | Yongbin Li | Xiaojie Wang
Findings of the Association for Computational Linguistics: ACL 2023

Existing multimodal task-oriented dialog data fails to demonstrate the diverse expressions of user subjective preferences and recommendation acts in the real-life shopping scenario. This paper introduces a new dataset SURE (Multimodal Recommendation Dialog with Subjective Preference), which contains 12K shopping dialogs in complex store scenes. The data is built in two phases with human annotations to ensure quality and diversity. SURE is well-annotated with subjective preferences and recommendation acts proposed by sales experts. A comprehensive analysis is given to reveal the distinguishing features of SURE. Three benchmark tasks are then proposed on the data to evaluate the capability of multimodal recommendation agents. Basing on the SURE, we propose a baseline model, powered by a state-of-the-art multimodal model, for these tasks.

Improving Situated Conversational Agents with Step-by-Step Multi-modal Logic Reasoning
Yuxing Long | Huibin Zhang | Binyuan Hui | Zhenglu Yang | Caixia Yuan | Xiaojie Wang | Fei Huang | Yongbin Li
Proceedings of the Eleventh Dialog System Technology Challenge

To fulfill complex user requirements in a situated conversational scenario, the agent needs to conduct step-by-step multi-modal logic reasoning, which includes locating objects, querying information and searching objects. However, existing methods omit this multi-step procedure and therefore constitutes the risk of shortcuts when making predictions. For example, they may directly copy the information from the dialogue history or simply use the textual description without perform visual reasoning. To address this issue and further boost the system performance, we apply the dual process theory to plug a reasoner into the original transformer based model for step-by-step reasoning. When system 2 completes multi-step reasoning, its output is regarded as final prediction. Our proposed method achieved the 1st rank on the summing scores across all four DSTC-11 SIMMC 2.1 sub-tasks.

USSA: A Unified Table Filling Scheme for Structured Sentiment Analysis
Zepeng Zhai | Hao Chen | Ruifan Li | Xiaojie Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most previous studies on Structured Sentiment Analysis (SSA) have cast it as a problem of bi-lexical dependency parsing, which cannot address issues of overlap and discontinuity simultaneously. In this paper, we propose a niche-targeting and effective solution. Our approach involves creating a novel bi-lexical dependency parsing graph, which is then converted to a unified 2D table-filling scheme, namely USSA. The proposed scheme resolves the kernel bottleneck of previous SSA methods by utilizing 13 different types of relations. In addition, to closely collaborate with the USSA scheme, we have developed a model that includes a proposed bi-axial attention module to effectively capture the correlations among relations in the rows and columns of the table. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed framework, outperforming state-of-the-art methods consistently.

2022

A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots
Sai Zhang | Yuwei Hu | Yuchuan Wu | Jiaman Wu | Yongbin Li | Jian Sun | Caixia Yuan | Xiaojie Wang
Findings of the Association for Computational Linguistics: ACL 2022

A slot value might be provided segment by segment over multiple-turn interactions in a dialog, especially for some important information such as phone numbers and names. It is a common phenomenon in daily life, but little attention has been paid to it in previous work. To fill the gap, this paper defines a new task named Sub-Slot based Task-Oriented Dialog (SSTOD) and builds a Chinese dialog dataset SSD for boosting research on SSTOD. The dataset includes a total of 40K dialogs and 500K utterances from four different domains: Chinese names, phone numbers, ID numbers and license plate numbers. The data is well annotated with sub-slot values, slot values, dialog states and actions. We find some new linguistic phenomena and interactive manners in SSTOD which raise critical challenges of building dialog agents for the task. We test three state-of-the-art dialog models on SSTOD and find they cannot handle the task well on any of the four domains. We also investigate an improved model by involving slot knowledge in a plug-in manner. More work should be done to meet the new challenges raised from SSTOD which widely exists in real-life applications. The dataset and code are publicly available via https://github.com/shunjiu/SSTOD.

Co-VQA : Answering by Interactive Sub Question Sequence
Ruonan Wang | Yuxi Qian | Fangxiang Feng | Xiaojie Wang | Huixing Jiang
Findings of the Association for Computational Linguistics: ACL 2022

Most existing approaches to Visual Question Answering (VQA) answer questions directly, however, people usually decompose a complex question into a sequence of simple sub questions and finally obtain the answer to the original question after answering the sub question sequence(SQS). By simulating the process, this paper proposes a conversation-based VQA (Co-VQA) framework, which consists of three components: Questioner, Oracle, and Answerer. Questioner raises the sub questions using an extending HRED model, and Oracle answers them one-by-one. An Adaptive Chain Visual Reasoning Model (ACVRM) for Answerer is also proposed, where the question-answer pair is used to update the visual representation sequentially. To perform supervised learning for each model, we introduce a well-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2 datasets. Experimental results show that our method achieves state-of-the-art on VQA-CP v2. Further analyses show that SQSs help build direct semantic connections between questions and images, provide question-adaptive variable-length reasoning chains, and with explicit interpretability as well as error traceability.

Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng | Tao Kong | Ya Jing | Jiaan Wang | Xiaojie Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM and TRP are directly related to REG and REC, respectively, but could help each other. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms previous state-of-the-art methods on both REG and REC.

COM-MRC: A COntext-Masked Machine Reading Comprehension Framework for Aspect Sentiment Triplet Extraction
Zepeng Zhai | Hao Chen | Fangxiang Feng | Ruifan Li | Xiaojie Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Aspect Sentiment Triplet Extraction (ASTE) aims to extract sentiment triplets from sentences, which was recently formalized as an effective machine reading comprehension (MRC) based framework. However, when facing multiple aspect terms, the MRC-based methods could fail due to the interference from other aspect terms. In this paper, we propose a novel COntext-Masked MRC (COM-MRC) framework for ASTE. Our COM-MRC framework comprises three closely-related components: a context augmentation strategy, a discriminative model, and an inference method. Specifically, a context augmentation strategy is designed by enumerating all masked contexts for each aspect term. The discriminative model comprises four modules, i.e., aspect and opinion extraction modules, sentiment classification and aspect detection modules. In addition, a two-stage inference method first extracts all aspects and then identifies their opinions and sentiment through iteratively masking the aspects. Extensive experimental results on benchmark datasets show the effectiveness of our proposed COM-MRC framework, which outperforms state-of-the-art methods consistently.

A Simple Model for Distantly Supervised Relation Extraction
Ziqin Rao | Fangxiang Feng | Ruifan Li | Xiaojie Wang
Proceedings of the 29th International Conference on Computational Linguistics

Distantly supervised relation extraction is challenging due to the noise within data. Recent methods focus on exploiting bag representations based on deep neural networks with complex de-noising scheme to achieve remarkable performance. In this paper, we propose a simple but effective BERT-based Graph convolutional network Model (i.e., BGM). Our BGM comprises of an instance embedding module and a bag representation module. The instance embedding module uses a BERT-based pretrained language model to extract key information from each instance. The bag representaion module constructs the corresponding bag graph then apply a convolutional operation to obtain the bag representation. Our BGM model achieves a considerable improvement on two benchmark datasets, i.e., NYT10 and GDS.

Learn to Adapt for Generalized Zero-Shot Text Classification
Yiwen Zhang | Caixia Yuan | Xiaojie Wang | Ziwei Bai | Yongbin Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generalized zero-shot text classification aims to classify textual instances from both previously seen classes and incrementally emerging unseen classes. Most existing methods generalize poorly since the learned parameters are only optimal for seen classes rather than for both classes, and the parameters keep stationary in predicting procedures. To address these challenges, we propose a novel Learn to Adapt (LTA) network using a variant meta-learning framework. Specifically, LTA trains an adaptive classifier by using both seen and virtual unseen classes to simulate a generalized zero-shot learning (GZSL) scenario in accordance with the test time, and simultaneously learns to calibrate the class prototypes and sample representations to make the learned parameters adaptive to incoming unseen classes. We claim that the proposed model is capable of representing all prototypes and samples from both classes to a more consistent distribution in a global space. Extensive experiments on five text classification datasets show that our model outperforms several competitive previous approaches by large margins. The code and the whole datasets are available at https://github.com/Quareia/LTA.

Enhanced Multi-Channel Graph Convolutional Network for Aspect Sentiment Triplet Extraction
Hao Chen | Zepeng Zhai | Fangxiang Feng | Ruifan Li | Xiaojie Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Aspect Sentiment Triplet Extraction (ASTE) is an emerging sentiment analysis task. Most of the existing studies focus on devising a new tagging scheme that enables the model to extract the sentiment triplets in an end-to-end fashion. However, these methods ignore the relations between words for ASTE task. In this paper, we propose an Enhanced Multi-Channel Graph Convolutional Network model (EMC-GCN) to fully utilize the relations between words. Specifically, we first define ten types of relations for ASTE task, and then adopt a biaffine attention module to embed these relations as an adjacent tensor between words in a sentence. After that, our EMC-GCN transforms the sentence into a multi-channel graph by treating words and the relation adjacent tensor as nodes and edges, respectively. Thus, relation-aware node representations can be learnt. Furthermore, we consider diverse linguistic features to enhance our EMC-GCN model. Finally, we design an effective refining strategy on EMC-GCN for word-pair representation refinement, which considers the implicit results of aspect and opinion extraction when determining whether word pairs match or not. Extensive experimental results on the benchmark datasets demonstrate that the effectiveness and robustness of our proposed model, which outperforms state-of-the-art methods significantly.

2021

Task-Oriented Clustering for Dialogues
Chenxu Lv | Hengtong Lu | Shuyu Lei | Huixing Jiang | Wei Wu | Caixia Yuan | Xiaojie Wang
Findings of the Association for Computational Linguistics: EMNLP 2021

A reliable clustering algorithm for task-oriented dialogues can help developer analysis and define dialogue tasks efficiently. It is challenging to directly apply prior normal text clustering algorithms for task-oriented dialogues, due to the inherent differences between them, such as coreference, omission and diversity expression. In this paper, we propose a Dialogue Task Clustering Network model for task-oriented clustering. The proposed model combines context-aware utterance representations and cross-dialogue utterance cluster representations for task-oriented dialogues clustering. An iterative end-to-end training strategy is utilized for dialogue clustering and representation learning jointly. Experiments on three public datasets show that our model significantly outperform strong baselines in all metrics.

DialogueTRM: Exploring Multi-Modal Emotional Dynamics in a Conversation
Yuzhao Mao | Guang Liu | Xiaojie Wang | Weiguo Gao | Xuan Li
Findings of the Association for Computational Linguistics: EMNLP 2021

Emotion dynamics formulates principles explaining the emotional fluctuation during conversations. Recent studies explore the emotion dynamics from the self and inter-personal dependencies, however, ignoring the temporal and spatial dependencies in the situation of multi-modal conversations. To address the issue, we extend the concept of emotion dynamics to multi-modal settings and propose a Dialogue Transformer for simultaneously modeling the intra-modal and inter-modal emotion dynamics. Specifically, the intra-modal emotion dynamics is to not only capture the temporal dependency but also satisfy the context preference in every single modality. The inter-modal emotional dynamics aims at handling multi-grained spatial dependency across all modalities. Our models outperform the state-of-the-art with a margin of 4%-16% for most of the metrics on three benchmark datasets.

Grouped-Attention for Content-Selection and Content-Plan Generation
Bayu Distiawan Trisedya | Xiaojie Wang | Jianzhong Qi | Rui Zhang | Qingjun Cui
Findings of the Association for Computational Linguistics: EMNLP 2021

Content-planning is an essential part of data-to-text generation to determine the order of data mentioned in generated texts. Recent neural data-to-text generation models employ Pointer Networks to explicitly learn content-plan given a set of attributes as input. They use LSTM to encode the input, which assumes a sequential relationship in the input. This may be sub-optimal to encode a set of attributes, where the attributes have a composite structure: the attributes are disordered while each attribute value is an ordered list of tokens. We handle this problem by proposing a neural content-planner that can capture both local and global contexts of such a structure. Specifically, we propose a novel attention mechanism called GSC-attention. A key component of the GSC-attention is grouped-attention, which is token-level attention constrained within each input attribute that enables our proposed model captures both local and global context. Moreover, our content-planner explicitly learns content-selection, which is integrated into the content-planner to select the most important data to be included in the generated text via an attention masking procedure. Experimental results show that our model outperforms the competitors by 4.92%, 4.70%, and 16.56% in terms of Damerau-Levenshtein Distance scores on three real-world datasets.

Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser
Duo Zheng | Zipeng Xu | Fandong Meng | Xiaojie Wang | Jiaan Wang | Jie Zhou
Findings of the Association for Computational Linguistics: EMNLP 2021

Considering the importance of building a good Visual Dialog (VD) Questioner, many researchers study the topic under a Q-Bot-A-Bot image-guessing game setting, where the Questioner needs to raise a series of questions to collect information of an undisclosed image. Despite progress has been made in Supervised Learning (SL) and Reinforcement Learning (RL), issues still exist. Firstly, previous methods do not provide explicit and effective guidance for Questioner to generate visually related and informative questions. Secondly, the effect of RL is hampered by an incompetent component, i.e., the Guesser, who makes image predictions based on the generated dialogs and assigns rewards accordingly. To enhance VD Questioner: 1) we propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs; 2) we propose an Augmented Guesser that is strong and is optimized for VD especially. Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-the-art performance on both image-guessing task and question diversity. Human study further verifies that our model generates more visually related, informative and coherent questions.

Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization
Junpeng Liu | Yanyan Zou | Hainan Zhang | Hongshen Chen | Zhuoye Ding | Caixia Yuan | Xiaojie Wang
Findings of the Association for Computational Linguistics: EMNLP 2021

Unlike well-structured text, such as news reports and encyclopedia articles, dialogue content often comes from two or more interlocutors, exchanging information with each other. In such a scenario, the topic of a conversation can vary upon progression and the key information for a certain topic is often scattered across multiple utterances of different speakers, which poses challenges to abstractly summarize dialogues. To capture the various topic information of a conversation and outline salient facts for the captured topics, this work proposes two topic-aware contrastive learning objectives, namely coherence detection and sub-summary generation objectives, which are expected to implicitly model the topic change and handle information scattering challenges for the dialogue summarization task. The proposed contrastive objectives are framed as auxiliary tasks for the primary dialogue summarization task, united via an alternative parameter updating strategy. Extensive experiments on benchmark datasets demonstrate that the proposed simple method significantly outperforms strong baselines and achieves new state-of-the-art performance. The code and trained models are publicly available via .

Slot Transferability for Cross-domain Slot Filling
Hengtong Lu | Zhuoxin Han | Caixia Yuan | Xiaojie Wang | Shuyu Lei | Huixing Jiang | Wei Wu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Dual Graph Convolutional Networks for Aspect-based Sentiment Analysis
Ruifan Li | Hao Chen | Fangxiang Feng | Zhanyu Ma | Xiaojie Wang | Eduard Hovy
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Aspect-based sentiment analysis is a fine-grained sentiment classification task. Recently, graph neural networks over dependency trees have been explored to explicitly model connections between aspects and opinion words. However, the improvement is limited due to the inaccuracy of the dependency parsing results and the informal expressions and complexity of online reviews. To overcome these challenges, in this paper, we propose a dual graph convolutional networks (DualGCN) model that considers the complementarity of syntax structures and semantic correlations simultaneously. Particularly, to alleviate dependency parsing errors, we design a SynGCN module with rich syntactic knowledge. To capture semantic correlations, we design a SemGCN module with self-attention mechanism. Furthermore, we propose orthogonal and differential regularizers to capture semantic correlations between words precisely by constraining attention scores in the SemGCN module. The orthogonal regularizer encourages the SemGCN to learn semantically correlated words with less overlap for each word. The differential regularizer encourages the SemGCN to learn semantic features that the SynGCN fails to capture. Experimental results on three public datasets show that our DualGCN model outperforms state-of-the-art methods and verify the effectiveness of our model.

Multi-stage Pre-training over Simplified Multimodal Pre-training Models
Tongtong Liu | Fangxiang Feng | Xiaojie Wang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Multimodal pre-training models, such as LXMERT, have achieved excellent results in downstream tasks. However, current pre-trained models require large amounts of training data and have huge model sizes, which make them impossible to apply in low-resource situations. How to obtain similar or even better performance than a larger model under the premise of less pre-training data and smaller model size has become an important problem. In this paper, we propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train a model in stages. We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus. We take a Simplified LXMERT (LXMERT-S) which is with 45.9% parameters of the original LXMERT model and only 11.44% of the original pre-training data as the testbed of our MSP method. Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.

2020

Connecting Embeddings for Knowledge Graph Entity Typing
Yu Zhao | Anxiang Zhang | Ruobing Xie | Kang Liu | Xiaojie Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Knowledge graph (KG) entity typing aims at inferring possible missing entity type instances in KG, which is a very significant but still under-explored subtask of knowledge graph completion. In this paper, we propose a novel approach for KG entity typing which is trained by jointly utilizing local typing knowledge from existing entity type assertions and global triple knowledge in KGs. Specifically, we present two distinct knowledge-driven effective mechanisms of entity type inference. Accordingly, we build two novel embedding models to realize the mechanisms. Afterward, a joint model via connecting them is used to infer missing entity type instances, which favors inferences that agree with both entity type instances and triple knowledge in KGs. Experimental results on two real-world datasets (Freebase and YAGO) demonstrate the effectiveness of our proposed mechanisms and models for improving KG entity typing. The source code and data of this paper can be obtained from: https://github.com/Adam1679/ConnectE .

2019

MrMep: Joint Extraction of Multiple Relations and Multiple Entity Pairs Based on Triplet Attention
Jiayu Chen | Caixia Yuan | Xiaojie Wang | Ziwei Bai
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

This paper focuses on how to extract multiple relational facts from unstructured text. Neural encoder-decoder models have provided a viable new approach for jointly extracting relations and entity pairs. However, these models either fail to deal with entity overlapping among relational facts, or neglect to produce the whole entity pairs. In this work, we propose a novel architecture that augments the encoder and decoder in two elegant ways. First, we apply a binary CNN classifier for each relation, which identifies all possible relations maintained in the text, while retaining the target relation representation to aid entity pair recognition. Second, we perform a multi-head attention over the text and a triplet attention with the target relation interacting with every token of the text to precisely produce all possible entity pairs in a sequential manner. Experiments on three benchmark datasets show that our proposed method successfully addresses the multiple relations and multiple entity pairs even with complex overlapping and significantly outperforms the state-of-the-art methods.

2015

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment
Xuwen Wang | Qiang Zhang | Xiaojie Wang | Junlian Li
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

Response Generation in Dialogue Using a Tailored PCFG Parser
Caixia Yuan | Xiaojie Wang | Qianhui He
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2009

Multi-Task Learning in Conditional Random Fields for Chunking in Shallow Semantic Parsing
Saike He | Xiaojie Wang | Yuan Dong | Taozheng Zhang | Xue Bai
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

Normalized Accessor Variety Combined with Conditional Random Fields in Chinese Word Segmentation
Saike He | Taozheng Zhang | Xue Bai | Xiaojie Wang | Yuan Dong
Proceedings of the Student Research Workshop

Accurate Learning for Chinese Function Tags from Minimal Features
Caixia Yuan | Fuji Ren | Xiaojie Wang
Proceedings of the ACL-IJCNLP 2009 Student Research Workshop

2008

BUPT Systems in the SIGHAN Bakeoff 2007
Ying Qin | Caixia Yuan | Jiashen Sun | Xiaojie Wang
Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing

2006

Word Segmentation and Named Entity Recognition for SIGHAN Bakeoff3
Suxiang Zhang | Ying Qin | Juan Wen | Xiaojie Wang
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

2004

Trajectory Based Word Sense Disambiguation
Xiaojie Wang | Yuji Matsumoto
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

Combining Segmenter and Chunker for Chinese Word Segmentation
Masayuki Asahara | Chooi Ling Goh | Xiaojie Wang | Yuji Matsumoto
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

1999

A new way to conceptual meaning representation
Xiaojie Wang | Yixin Zhong
Proceedings of Machine Translation Summit VII

Co-authors

Yuji Matsumoto 2

Taozheng Zhang 2

Masayuki Asahara 1

Hongshen Chen 1

Bayu Distiawan 1

Chooi-Ling Goh 1

Bingshuo Qian 1

Anxiang Zhang 1

Guangwei Zhang 1

Suxiang Zhang 1

Xiao-Hua Zhou 1

Venues