Tong Xu - ACL Anthology

Tong Xu

2026

To Paraphrase or Not: Efficient Comment Detoxification with Unsupervised Detoxifiability Discrimination
Jing Ke | Zheyong Xie | Shaosheng Cao | Tong Xu | Enhong Chen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Mitigating toxic content is critical for maintaining a healthy social platform, yet existing detoxification systems face significant limitations: overcorrection from uniformly processing all toxic comments, and parallel data scarcity in paraphrasing model training. To tackle these challenges, we propose Detoxifiability-Aware Detoxification (DID), a novel paradigm that adaptively conducts filtering or paraphrasing for each toxic comment based on its detoxifiability, namely whether it can be paraphrased into a benign comment in essence. Specifically, DID integrates three core modules: (1) an unsupervised detoxifiability discriminator, (2) a semantic purification module that extracts harmful intents and then performs targeted paraphrasing only on detoxifiable comments and (3) a feedback-adaptive refinement loop that processes remaining harmful contents only when they are detoxifiable. Experimental results demonstrate that DID significantly outperforms existing approaches on academic data and an industrial platform, establishing a novel and practical modeling paradigm for comment detoxification.

2025

AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models
Qi Liu | Jingqing Ruan | Hao Li | Haodong Zhao | Desheng Wang | Jiansong Chen | Wan Guanglu | Xunliang Cai | Zhi Zheng | Tong Xu
Findings of the Association for Computational Linguistics: ACL 2025

Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO’s capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO.

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
Xiangfeng Wang | Xiao Li | Yadong Wei | Songxueyu | Yang Song | Xiaxiaoqiang | Fangrui Zeng | Zaiyi Chen | Liuliu | Gu Xu | Tong Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a Human-Inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 2500 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

Streamlining the Collaborative Chain of Models into A Single Forward Pass in Generation-Based Tasks
Yuanjie Lyu | Chao Zhang | Yuhao Chen | Yong Chen | Tong Xu
Findings of the Association for Computational Linguistics: ACL 2025

In Retrieval-Augmented Generation (RAG) and agent-based frameworks, the “Chain of Models” approach is widely used, where multiple specialized models work sequentially on distinct sub-tasks. This approach is effective but increases resource demands as each model must be deployed separately. Recent advancements attempt to address this by applying prompt tuning, which allows a shared base model to adapt to multiple tasks with minimal parameter changes. However, a key challenge remains: intermediate outputs, passed between models as plain text, require recomputation of hidden states (i.e., Key and Value (KV) states in Transformers) during inference. In this paper, we introduce FTHSS, a novel prompt-tuning method that enables models to share KV hidden states, eliminating redundant forward passes and reducing KV cache storage. By modifying input and attention masks during training, FTHSS allows models to effectively utilize KV hidden states from prior models in both single- and multi-round scenarios. Empirical results on four tasks show that FTHSS matches the performance of traditional model chains while improving inference efficiency.

iPET: An Interactive Emotional Companion Dialogue System with LLM-Powered Virtual Pet World Simulation
Zheyong Xie | Shaosheng Cao | Zuozhu Liu | Zheyu Ye | Zihan Niu | Chonggang Lu | Tong Xu | Enhong Chen | Zhe Xu | Yao Hu | Wei Lu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

The rapid advancement of large language models (LLMs) has unlocked transformative potential for role-playing emotional companion products, enabling systems that support emotional well-being, educational development, and therapeutic applications. However, existing approaches often lack sustained personalization and contextual adaptability, limiting their effectiveness in real-world settings. In this paper, we introduce iPET, an LLM-powered virtual pet agent designed to enhance user engagement through rich, dynamic pet behaviors and interactions tailored to individual preferences. iPET comprises three core components: a dialogue module that instantiates virtual pet agents for emotionally interactive conversations; a memory module that stores and synthesizes records of both agent and user experiences; and a world simulation module that generates diverse, preference-driven pet behaviors guided by high-level reflections. Deployed for over 200 days in a real-world, non-commercial product, iPET has served millions of users – providing emotional support to psychologically distressed individuals and demonstrating its effectiveness in practical applications.

Think and Recall: Layer-Level Prompting for Lifelong Model Editing
Jinke Wang | Zenan Ying | Qi Liu | Wei Chen | Tong Xu | Huijun Hou | Zhi Zheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Lifelong model editing aims to dynamically adjust a model’s output with respect to specific facts, knowledge points, or behaviors, enabling the model to adapt to the ever-changing demands of the real world without requiring retraining. While some retrieval-based methods have demonstrated potential in lifelong editing scenarios by storing edited knowledge in external memory, they often suffer from limitations in usability, such as requiring additional training corpora or lacking support for reversible and detachable edits.To address these issues, we propose a plug-and-play method for knowledge retrieval and storage, i.e., Layer-Level Prompting (LLP), which enables seamless and efficient lifelong model editing. In our LLP framework, the reasoning process of LLMs is divided into two stages, respectively knowledge retrieval (Think) and knowledge injection(Recall). Specifically, the knowledge retrieval process is performed in the early layers of the model. Based on the retrieved information, the model is guided to access the updated knowledge stored in the subsequent layer to complete the knowledge editing process. Experimental results demonstrate that our method consistently outperforms existing techniques on lifelong model editing tasks, achieving superior performance on question answering and hallucination benchmarks across different LLMs.

Think Wider, Detect Sharper: Reinforced Reference Coverage for Document-Level Self-Contradiction Detection
Yuhao Chen | Yuanjie Lyu | Shuochen Liu | Chao Zhang | Junhui Lv | Tong Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Detecting self-contradictions within documents is a challenging task for ensuring textual coherence and reliability. While large language models (LLMs) have advanced in many natural language understanding tasks, document-level self-contradiction detection (DSCD) remains insufficiently studied. Recent approaches leveraging Chain-of-Thought (CoT) prompting aim to enhance reasoning and interpretability; however, they only gain marginal improvement and often introduce inconsistencies across repeated responses. We observe that such inconsistency arises from incomplete reasoning chains that fail to include all relevant contradictory sentences consistently. To address this, we propose a two-stage method that combines supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance DSCD performance. In the SFT phase, a teacher model helps the model learn reasoning patterns, while RL further refines its reasoning ability. Our method incorporates a task-specific reward function to expand the model’s reasoning scope, boosting both accuracy and consistency. On the ContraDoc benchmark, our approach significantly boosts Llama 3.1-8B-Instruct’s accuracy from 38.5% to 51.1%, and consistency from 59.6% to76.2%.

Following Occam’s Razor: Dynamic Combination of Structured Knowledge for Multi-Hop Question Answering using LLMs
Wei Chen | Zhi Zheng | Lili Zhao | Huijun Hou | Tong Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

Multi-hop question answering is a challenging task that requires capturing information from different positions in multiple documents. Recently, several methods propose to enhance Large Language Models (LLMs) by incorporating structured knowledge, aiming to grasp key information for solving this task. Despite certain achievements, they still face the following challenges: 1) The neglect of text-based reasoning capabilities. 2) Information redundancy between text and triples. 3) Information loss during structured knowledge extraction. To solve the above challenges, in this paper, we propose Dynamic Combination of Structured Knowledge (DCSK), a novel framework for integrating text-based and triple-based paradigms. Following Occam’s Razor, DCSK dynamically determine the necessity of structured knowledge by the designed multi-faceted evaluation, which systematically assess the correctness, clarity, and informativeness of text-based prediction. For questions that require structured knowledge, we develop an iterative fact refiner that screens for question-relevant triples, verifies their factual adequacy, and thereby effectively excludes irrelevant and redundant information. Furthermore, based on the verification, we construct an adaptive knowledge reasoner that dynamically adjusts the need for text supplementation, thus mitigating the information deficiency in selected triples. Extensive experiments on three MHQA datasets demonstrate the efficiency and effectiveness of DCSK.

MLLM-I2W: Harnessing Multimodal Large Language Model for Zero-Shot Composed Image Retrieval
Tong Bao | Che Liu | Derong Xu | Zhi Zheng | Tong Xu
Proceedings of the 31st International Conference on Computational Linguistics

Combined Image Retrieval (CIR) involves retrieving an image based on a reference image and a brief text description, which is widely present in various scenarios such as fashion recommendation. Existing methods can be mainly divided into two categories, respectively supervised CIR methods and Zero-Shot CIR (ZS-CIR) methods. In contrast to supervised CIR methods, which need manually annotated triples for training task-specific models, ZS-CIR models can be trained using images datasets only and performs well. However, ZS-CIR still faces the primary challenge of learning how to map pseudo-words to images within the joint image-text embedding space. Therefore, in this paper, we propose a novel image-text mapping network, named MLLM-I2W, which adaptively converts description-related image information into pseudo-word markers for precise ZS-CIR. Specifically, the image and text encoding enhancement module within the MLLM prompt selects subject headings and generates text descriptions. It then reduces the modality gap between images and text using uncertainty modeling. An adaptive weighting module and a prototype are proposed to adjust and learn the deep fusion features, which are further mapped to pseudo-word markers via well-designed MOE-based mapping network. Our model demonstrates consistent improvements across common CIR benchmarks, including COCO, CIRR, and Fashion-IQ.

2024

Granular Entity Mapper: Advancing Fine-grained Multimodal Named Entity Recognition and Grounding
Ziqi Wang | Chen Zhu | Zhi Zheng | Xinhang Li | Tong Xu | Yongyi He | Qi Liu | Ying Yu | Enhong Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Multimodal Named Entity Recognition and Grounding (MNERG) aims to extract paired textual and visual entities from texts and images. It has been well explored through a two-step paradigm: initially identifying potential visual entities using object detection methods and then aligning the extracted textual entities with their corresponding visual entities. However, when it comes to fine-grained MNERG, the long-tailed distribution of textual entity categories and the performance of object detectors limit the effectiveness of traditional methods. Specifically, more detailed classification leads to many low-frequency categories, and existing object detection methods often fail to pinpoint subtle regions within images. To address these challenges, we propose the Granular Entity Mapper (GEM) framework. Firstly, we design a multi-granularity entity recognition module, followed by a reranking module based on the Multimodal Large Language Model (MLLM) to incorporate hierarchical information of entity categories, visual cues, and external textual resources collectively for accurate fine-grained textual entity recognition. Then, we utilize a pre-trained Large Visual Language Model (LVLM) as an implicit visual entity grounder that directly deduces relevant visual entity regions from the entire image without the need for bounding box training. Experimental results on the GMNER and FMNERG datasets demonstrate that our GEM framework achieves state-of-the-art results on the fine-grained content extraction task.

Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation
Yuanjie Lyu | Zihan Niu | Zheyong Xie | Chao Zhang | Tong Xu | Yang Wang | Enhong Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite the significant progress of large language models (LLMs) in various tasks, they often produce factual errors due to their limited internal knowledge. Retrieval-Augmented Generation (RAG), which enhances LLMs with external knowledge sources, offers a promising solution. However, these methods can be misled by irrelevant paragraphs in retrieved documents. Due to the inherent uncertainty in LLM generation, inputting the entire document may introduce off-topic information, causing the model to deviate from the central topic and affecting the relevance of the generated content. To address these issues, we propose the Retrieve-Plan-Generation (RPG) framework. RPG generates plan tokens to guide subsequent generation in the plan stage. In the answer stage, the model selects relevant fine-grained paragraphs based on the plan and uses them for further answer generation. This plan-answer process is repeated iteratively until completion, enhancing generation relevance by focusing on specific topics. To implement this framework efficiently, we utilize a simple but effective multi-task prompt-tuning method, enabling the existing LLMs to handle both planning and answering. We comprehensively compare RPG with baselines across 5 knowledge-intensive generation tasks, demonstrating the effectiveness of our approach.

Visualization Recommendation with Prompt-based Reprogramming of Large Language Models
Xinhang Li | Jingbo Zhou | Wei Chen | Derong Xu | Tong Xu | Enhong Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Visualization recommendations, which aim to automatically match proper visual charts for specific data tables, can significantly simplify the data analysis process. Traditional approaches in this domain have primarily relied on rule-based or machine learning-based methodologies. These methods often demand extensive manual maintenance and yet fail to fully comprehend the tabular data, leading to unsatisfactory performance. Recently, Large Language Models (LLMs) have emerged as powerful tools, exhibiting strong reasoning capabilities. This advancement suggests their substantial promise in addressing visualization recommendation challenges. However, effectively harnessing LLMs to discern and rationalize patterns in tabular data, and consequently deduce the essential information for chart generation, remains an unresolved challenge. To this end, we introduce a novel Hierarchical Table Prompt-based reprogramming framework, named HTP. This framework aims to integrate multi-dimensional tabular data into LLMs through a strategically crafted prompt learning method while keeping the LLMs’ backbone and weights unaltered. The HTP framework uniquely incorporates a four-level prompt structure, encompassing general, instance, cluster, and column levels. This multi-level approach is engineered to provide a comprehensive understanding of both general distribution and multifaceted fine-grained features of tabular data, before inputting the tabular data into the frozen LLM. Our empirical studies confirm that the HTP framework achieves state-of-the-art performance, marking an advancement in the field of data visualization and analysis. The code and data will be made publicly available upon acceptance.

FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
Junyi Zhu | Shuochen Liu | Yu Yu | Bo Tang | Yibo Yan | Zhiyu Li | Feiyu Xiong | Tong Xu | Matthew B. Blaschko
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs’ context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by updating only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model’s ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem’s potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem.

MRT: Multi-modal Short- and Long-range Temporal Convolutional Network for Time-sync Comment Video Behavior Prediction
Weihao Zhao | Weidong He | Hao Wang | Haoyang Bi | Han Wu | Chen Zhu | Tong Xu | Enhong Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

As a fresh way to improve the user viewing experience, videos of time-sync comments have attracted a lot of interest. Many efforts have been made to explore the effectiveness of time-sync comments for various applications. However, due to the complexity of interactions among users, videos, and comments, it still remains challenging to understand users’ behavior on time-sync comments. Along this line, we study the problem of time-sync comment behavior prediction with considerations of both historical behaviors and multi-modal information of visual frames and textual comments. Specifically, we propose a novel Multi-modal short- and long-Range Temporal Convolutional Network model, namely MRT. Firstly, we design two amplified Temporal Convolutional Networks with different sizes of receptive fields, to capture both short- and long-range surrounding contexts for each frame and time-sync comments. Then, we design a bottle-neck fusion module to obtain the multi-modal enhanced representation. Furthermore, we take the user preferences into consideration to generate the personalized multi-model semantic representation at each timestamp. Finally, we utilize the binary cross-entropy loss to optimize MRT on the basis of users’ historical records. Through comparing with representative baselines, we demonstrate the effectiveness of MRT and qualitatively verify the necessity and utility of short- and long-range contextual and multi-modal information through extensive experiments.

QDMR-based Planning-and-Solving Prompting for Complex Reasoning Tasks
Jinfeng Huang | Qiaoqiao She | Wenbin Jiang | Hua Wu | Yang Hao | Tong Xu | Feng Wu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Chain-of-Thought prompting has improved reasoning capability of large language models (LLM). However, it still is challenging to guarantee the effectiveness and stability for questions requiring complicated reasoning. Recently, Plan-and-Solve prompting enhances the reasoning capability for complex questions by planning the solution steps firstly and then solving them step by step, but it suffers the difficulty to represent and execute the problem-solving logic of complex questions. To deal with these challenges, in this work, we propose a novel Plan-and-Solve prompting method based on Question Decomposition Meaning Representation (QDMR). Specifically, this method first allows the LLM to generate a QDMR graph to represent the problem-solving logic, which is a directed acyclic graph composed of sub-questions. Then, the LLM generates a specific solving process based on the QDMR graph. When solving each sub-question, it can locate the preceding sub-questions and their answers according to the QDMR graph, and then utilize this information for solution. Compared with existing Plan-and-Solve prompting techniques, our method can not only represent the problem-solving logic of complicated questions more accurately with the aid of QDMR graph, but also deliver the dependence information accurately for different solution steps according to the QDMR graph. In addition, with the supervised fine-tuning on the Allen Institute dataset, the decomposing capability of LLM for complicated questions can be considerably enhanced. Extensive experiments show that our method has achieve a great significance in arithmetic reasoning and commonsense reasoning task by comparing the classical Chain-of-Thought prompting and Plan-and-Solve prompting techniques, and the improvements achieved are even greater for problems with more reasoning steps.

Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding
Derong Xu | Ziheng Zhang | Zhihong Zhu | Zhenxi Lin | Qidong Liu | Xian Wu | Tong Xu | Xiangyu Zhao | Yefeng Zheng | Enhong Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

The impressive capabilities of large language models (LLMs) have attracted extensive interests of applying LLMs to medical field. However, the complex nature of clinical environments presents significant hallucination challenges for LLMs, hindering their widespread adoption. In this paper, we address these hallucination issues in the context of Medical Information Extraction (MIE) tasks by introducing ALternate Contrastive Decoding (ALCD). We begin by redefining MIE tasks as an identify-and-classify process. We then separate the identification and classification functions of LLMs by selectively masking the optimization of tokens during fine-tuning. During the inference stage, we alternately contrast output distributions derived from sub-task models. This approach aims to selectively enhance the identification and classification capabilities while minimizing the influence of other inherent abilities in LLMs. Additionally, we propose an alternate adaptive constraint strategy to more effectively adjust the scale and scope of contrastive tokens. Through comprehensive experiments on two different backbones and six diverse medical information extraction tasks, ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.

Multi-perspective Improvement of Knowledge Graph Completion with Large Language Models
Derong Xu | Ziheng Zhang | Zhenxi Lin | Xian Wu | Zhihong Zhu | Tong Xu | Xiangyu Zhao | Yefeng Zheng | Enhong Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Knowledge graph completion (KGC) is a widely used method to tackle incompleteness in knowledge graphs (KGs) by making predictions for missing links. Description-based KGC leverages pre-trained language models to learn entity and relation representations with their names or descriptions, which shows promising results. However, the performance of description-based KGC is still limited by the quality of text and the incomplete structure, as it lacks sufficient entity descriptions and relies solely on relation names, leading to sub-optimal results. To address this issue, we propose MPIKGC, a general framework to compensate for the deficiency of contextualized knowledge and improve KGC by querying large language models (LLMs) from various perspectives, which involves leveraging the reasoning, explanation, and summarization capabilities of LLMs to expand entity descriptions, understand relations, and extract structures, respectively. We conducted extensive evaluation of the effectiveness and improvement of our framework based on four description-based KGC models, for both link prediction and triplet classification tasks. All codes and generated data will be publicly available after review.

In-Context Former: Lightning-fast Compressing Context for Large Language Model
Xiangfeng Wang | Zaiyi Chen | Tong Xu | Zheyong Xie | Yongyi He | Enhong Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach to mitigate these costs is compressing the long input contexts. Existing methods typically leverage the self-attention mechanism of the large model itself for context compression. While these methods have achieved notable results, the compression process still entails quadratic complexity. To mitigate this limitation, we propose the In-Context Former (IC-Former). This method does not rely on the target large model but instead utilizes cross-attention mechanisms to extract and condense information from the contextual embeddings. The computational overhead of our method grows linearly with the compression range. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times while achieving 90% of the baseline performance on evaluation metrics. Additionally, IC-Former demonstrates strong regularity in its interactions with the context, enhancing its interpretability. Overall, IC-Former significantly reduces compression costs, making real-time compression scenarios feasible.

Double-Checker: Large Language Model as a Checker for Few-shot Named Entity Recognition
Wei Chen | Lili Zhao | Zhi Zheng | Tong Xu | Yang Wang | Enhong Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Recently, few-shot Named Entity Recognition (NER) has attracted significant attention due to the high cost of obtaining high-quality labeled data. Decomposition-based methods have demonstrated remarkable performance on this task, which initially train a type-independent span detector and subsequently classify the detected spans based on their types. However, this framework has an evident drawback as a domain-agnostic detector cannot ensure the identification of only those entity spans that are specific to the target domain. To address this issue, we propose Double-Checker, which leverages collaboration between Large Language Models (LLMs) and small models. Specifically, we employ LLMs to verify candidate spans predicted by the small model and eliminate any spans that fall outside the scope of the target domain. Extensive experiments validate the effectiveness of our method, consistently yielding improvements over two baseline approaches. Our code is available at https://github.com/fanshu6hao/Double-Checker.

2023

The MineTrans Systems for IWSLT 2023 Offline Speech Translation and Speech-to-Speech Translation Tasks
Yichao Du | Guo Zhengsheng | Jinchuan Tian | Zhirui Zhang | Xing Wang | Jianwei Yu | Zhaopeng Tu | Tong Xu | Enhong Chen
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper presents the extscMineTrans English-to-Chinese speech translation systems developed for two challenge tracks of IWSLT 2023, i.e., Offline Speech Translation (S2T) and Speech-to-Speech Translation (S2ST). For the S2T track, extscMineTrans employs a practical cascaded system to explore the limits of translation performance in both constrained and unconstrained settings, where the whole system consists of automatic speech recognition (ASR), punctuation recognition (PC), and machine translation (MT) modules. We also investigate the effectiveness of multiple ASR architectures and explore two MT strategies: supervised in-domain fine-tuning and prompt-guided translation using a large language model. For the S2ST track, we explore a speech-to-unit (S2U) framework to build an end-to-end S2ST system. This system encodes the target speech as discrete units via our trained HuBERT. Then it leverages the standard sequence-to-sequence model to directly learn the mapping between source speech and discrete units without any auxiliary recognition tasks (i.e., ASR and MT tasks). Various efforts are made to improve the extscMineTrans’s performance, such as acoustic model pre-training on large-scale data, data filtering, data augmentation, speech segmentation, knowledge distillation, consistency training, model ensembles, etc.

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
Wenjun Peng | Jingwei Yi | Fangzhao Wu | Shangxi Wu | Bin Bin Zhu | Lingjuan Lyu | Binxing Jiao | Tong Xu | Guangzhong Sun | Xing Xie
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated powerful capabilities in both text understanding and generation. Companies have begun to offer Embedding as a Service (EaaS) based on these LLMs, which can benefit various natural language processing (NLP) tasks for customers. However, previous studies have shown that EaaS is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs, as training these models is extremely expensive. To protect the copyright of LLMs for EaaS, we propose an Embedding Watermark method called {pasted macro ‘METHOD’} that implants backdoors on embeddings. Our method selects a group of moderate-frequency words from a general text corpus to form a trigger set, then selects a target embedding as the watermark, and inserts it into the embeddings of texts containing trigger words as the backdoor. The weight of insertion is proportional to the number of trigger words included in the text. This allows the watermark backdoor to be effectively transferred to EaaS-stealer’s model for copyright verification while minimizing the adverse impact on the original embeddings’ utility. Our extensive experiments on various datasets show that our method can effectively protect the copyright of EaaS models without compromising service quality. Our code is available at https://github.com/yjw1029/EmbMarker.

2022

Towards Table-to-Text Generation with Pretrained Language Model: A Table Structure Understanding and Text Deliberating Approach
Miao Chen | Xinjiang Lu | Tong Xu | Yanyan Li | Zhou Jingbo | Dejing Dou | Hui Xiong
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Although remarkable progress on the neural table-to-text methods has been made, the generalization issues hinder the applicability of these models due to the limited source tables. Large-scale pretrained language models sound like a promising solution to tackle such issues. However, how to effectively bridge the gap between the structured table and the text input by fully leveraging table information to fuel the pretrained model is still not well explored. Besides, another challenge of integrating the deliberation mechanism into the text-to-text pretrained model for solving the table-to-text task remains seldom studied. In this paper, to implement the table-to-text generation with pretrained language model, we propose a table structure understanding and text deliberating approach, namely TASD. To be specific, we devise a three-layered multi-head attention network to realize the table-structureaware text generation model with the help of the pretrained language model. Furthermore, a multi-pass decoder framework is adopted to enhance the capability of polishing generated text for table descriptions. The empirical studies, as well as human evaluation, on two public datasets, validate that our approach can generate faithful and fluent descriptive texts for different types of tables.

VIRT: Improving Representation-based Text Matching via Virtual Interaction
Dan Li | Yang Yang | Hongyin Tang | Jiahao Liu | Qifan Wang | Jingang Wang | Tong Xu | Wei Wu | Enhong Chen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Text matching is a fundamental research problem in natural language understanding. Interaction-based approaches treat the text pair as a single sequence and encode it through cross encoders, while representation-based models encode the text pair independently with siamese or dual encoders. Interaction-based models require dense computations and thus are impractical in real-world applications. Representation-based models have become the mainstream paradigm for efficient text matching. However, these models suffer from severe performance degradation due to the lack of interactions between the pair of texts. To remedy this, we propose a Virtual InteRacTion mechanism (VIRT) for improving representation-based text matching while maintaining its efficiency. In particular, we introduce an interactive knowledge distillation module that is only applied during training. It enables deep interaction between texts by effectively transferring knowledge from the interaction-based model. A light interaction strategy is designed to fully leverage the learned interactive knowledge. Experimental results on six text matching benchmarks demonstrate the superior performance of our method over several state-of-the-art representation-based models. We further show that VIRT can be integrated into existing methods as plugins to lift their performances.

Non-Parametric Domain Adaptation for End-to-End Speech Translation
Yichao Du | Weizhi Wang | Zhirui Zhang | Boxing Chen | Tong Xu | Jun Xie | Enhong Chen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The end-to-end speech translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency and fewer parameters. However, the effectiveness of neural-based approaches to this task is severely limited by the available training corpus, especially for domain adaptation where in-domain triplet data is scarce or nonexistent. In this paper, we propose a novel non-parametric method that leverages in-domain text translation corpus to achieve domain adaptation for E2E-ST systems. To this end, we first incorporate an additional encoder into the pre-trained E2E-ST model to realize text translation modeling, based on which the decoder’s output representations for text and speech translation tasks are unified by reducing the correspondent representation mismatch in available triplet training data. During domain adaptation, a k-nearest-neighbor (kNN) classifier is introduced to produce the final translation distribution using the external datastore built by the domain-specific text translation corpus, while the universal output representation is adopted to perform a similarity search. Experiments on the Europarl-ST benchmark demonstrate that when in-domain text translation data is involved only, our proposed approach significantly improves baseline by 12.82 BLEU on average in all translation directions, even outperforming the strong in-domain fine-tuning strategy.

2020

BERT-MK: Integrating Graph Contextualized Knowledge into Pre-trained Language Models
Bin He | Di Zhou | Jinghui Xiao | Xin Jiang | Qun Liu | Nicholas Jing Yuan | Tong Xu
Findings of the Association for Computational Linguistics: EMNLP 2020

Complex node interactions are common in knowledge graphs (KGs), and these interactions can be considered as contextualized knowledge exists in the topological structure of KGs. Traditional knowledge representation learning (KRL) methods usually treat a single triple as a training unit, neglecting the usage of graph contextualized knowledge. To utilize these unexploited graph-level knowledge, we propose an approach to model subgraphs in a medical KG. Then, the learned knowledge is integrated with a pre-trained language model to do the knowledge generalization. Experimental results demonstrate that our model achieves the state-of-the-art performance on several medical NLP tasks, and the improvement above MedERNIE indicates that graph contextualized knowledge is beneficial.

2019

Capsule Network with Interactive Attention for Aspect-Level Sentiment Classification
Chunning Du | Haifeng Sun | Jingyu Wang | Qi Qi | Jianxin Liao | Tong Xu | Ming Liu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Aspect-level sentiment classification is a crucial task for sentiment analysis, which aims to identify the sentiment polarities of specific targets in their context. The main challenge comes from multi-aspect sentences, which express multiple sentiment polarities towards different targets, resulting in overlapped feature representation. However, most existing neural models tend to utilize static pooling operation or attention mechanism to identify sentimental words, which therefore insufficient for dealing with overlapped features. To solve this problem, we propose to utilize capsule network to construct vector-based feature representation and cluster features by an EM routing algorithm. Furthermore, interactive attention mechanism is introduced in the capsule routing procedure to model the semantic relationship between aspect terms and context. The iterative routing also enables encoding sentence from a global perspective. Experimental results on three datasets show that our proposed model achieves state-of-the-art performance.

Co-authors

Shaosheng Cao 2

Xiangfeng Wang 2

Matthew B. Blaschko 1

Jiansong Chen 1

Jinfeng Huang 1

Jingqing Ruan 1

Guangzhong Sun 1

Jinchuan Tian 1

Hua Wu (吴华) 1

Nicholas Jing Yuan 1

Guo Zhengsheng 1

Bin Benjamin Zhu 1

Venues