Xiang Wan - ACL Anthology

Xiang Wan

2025

Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
Zhuo Li | Yuhao Du | Jinpeng Hu | Xiang Wan | Anningzhe Gao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Improving prompt quality is crucial for enhancing the performance of large language models (LLMs), particularly for Black-Box models like GPT4. Existing prompt refinement methods, while effective, often suffer from semantic inconsistencies between refined and original prompts, and fail to maintain users’ real intent. To address these challenges, we propose a self-instructed in-context learning framework that generates reliable derived prompts, keeping semantic consistency with the original prompts. Specifically, our framework incorporates a reinforcement learning mechanism, enabling direct interaction with the response model during prompt generation to better align with human preferences. We then formulate the querying as an in-context learning task, combining responses from LLMs with derived prompts to create a contextual demonstration for the original prompt. This approach effectively enhances alignment, reduces semantic discrepancies, and activates the LLM’s in-context learning ability for generating more beneficial response. Extensive experiments demonstrate that the proposed method not only generates better derived prompts but also significantly enhances LLMs’ ability to deliver more effective responses, particularly for Black-Box models like GPT4.

Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm
Zhuo Li | Yuhao Du | Xiaoqi Jiao | Steven Y. Guo | Yuege Feng | Xiang Wan | Anningzhe Gao | Jinpeng Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Selecting high-quality and diverse training samples from extensive datasets plays a crucial role in reducing training overhead and enhancing the performance of Large Language Models (LLMs). However, existing studies fall short in assessing the overall value of selected data, focusing primarily on individual quality, and struggle to strike an effective balance between ensuring diversity and minimizing data point traversals. Therefore, this paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples when incorporated into the subset. Thanks to the advanced language understanding capabilities of LLMs, we utilize LLMs to evaluate the value of each option during the selection process. Furthermore, we design a greedy sampling process where samples are incrementally added to the subset, thereby improving efficiency by eliminating the need for exhaustive traversal of the entire dataset with the limited budget. Extensive experiments demonstrate that selected data from our method not only surpasses the performance of the full dataset but also achieves competitive results with recent powerful studies, while requiring fewer selections. Moreover, we validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.

APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
Zhuo Li | Yuege Feng | Dandan Guo | Jinpeng Hu | Anningzhe Gao | Xiang Wan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning, where the Bradley-Terry (BT) objective has been recognized as simple yet powerful, specifically for pairwise preference learning. However, BT-based RMs often struggle to effectively distinguish between similar preference responses, leading to insufficient separation between preferred and non-preferred outputs. Consequently, they may easily overfit easy samples and cannot generalize well to Out-Of-Distribution (OOD) samples, resulting in suboptimal performance. To address these challenges, this paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism. Specifically, we design to dynamically adjust the RM focus on more challenging samples through margins, based on both semantic similarity and model-predicted reward differences, which is approached from a distributional perspective solvable with Optimal Transport (OT). By incorporating these factors into a principled OT cost matrix design, our adaptive margin enables the RM to better capture distributional differences between chosen and rejected responses, yielding significant improvements in performance, convergence speed, and generalization capabilities. Experimental results across multiple benchmarks demonstrate that our method outperforms several existing RM techniques, showcasing enhanced performance in both In-Distribution (ID) and OOD settings. Moreover, RLHF experiments support our practical effectiveness in better aligning LLMs with human preferences.

Atoxia: Red-teaming Large Language Models with Target Toxic Answers
Yuhao Du | Zhuo Li | Pengyu Cheng | Xiang Wan | Anningzhe Gao
Findings of the Association for Computational Linguistics: NAACL 2025

Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that **A**ttacks LLMs with **T**arget **Toxi**c **A**nswers (**Atoxia**). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.

Huatuo-26M, a Large-scale Chinese Medical QA Dataset
Xidong Wang | Jianquan Li | Shunian Chen | Yuxuan Zhu | Xiangbo Wu | Zhiyi Zhang | Xiaolong Xu | Junying Chen | Jie Fu | Xiang Wan | Anningzhe Gao | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025

Large Language Models infuse newfound vigor into the advancement of the medical domain, yet the scarcity of data poses a significant bottleneck hindering community progress. In this paper, we release the largest ever medical Question Answering (QA) dataset with 26 Million QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset’s utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
Junying Chen | Chi Gui | Anningzhe Gao | Ke Ji | Xidong Wang | Xiang Wan | Benyou Wang
Findings of the Association for Computational Linguistics: ACL 2025

The field of AI healthcare has undergone a significant transformation with the advent of large language models (LLMs), yet the challenges of interpretability within these models remain largely unaddressed. This study introduces **Chain-of-Diagnosis (CoD)** to enhance the interpretability of medical automatic diagnosis. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician’s thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making. This interpretability makes model diagnostics controllable and aids in identifying critical symptoms for inquiry through the entropy reduction of confidences. With CoD, we developed **DiagnosisGPT**, capable of diagnosing 9,604 diseases for validating CoD. Experimental results demonstrate that DiagnosisGPT outperforms other LLMs on automatic diagnostic tasks across three real-world benchmarks. Moreover, DiagnosisGPT provides interpretability while ensuring controllability in diagnostic rigor.

Unlocking LLMs’ Self-Improvement Capacity with Autonomous Learning for Domain Adaptation
Ke Ji | Junying Chen | Anningzhe Gao | Wenya Xie | Xiang Wan | Benyou Wang
Findings of the Association for Computational Linguistics: ACL 2025

Self-supervised pre-training and instruction fine-tuning demonstrate the potential of large language models (LLMs) for domain adaptation (DA). In pursuit of superhuman performance, LLMs have demonstrated significant potential in math and coding through self-improvement algorithms that rely on iterative training with self-generated data. This success stems from the clear reward signals in these environments, which provide a solid foundation for self-improvement. However, when it comes to general DA scenarios, two main challenges emerge: 1) ambiguous self-improvement reward signals and 2) lack of high-quality instruction fine-tuning datasets. This motivates this paper addresses how LLMs can adapt autonomously to new domains using only a large amount of unlabeled target corpora. Inspired by the human practice of self-reflection through open- and closed-book exercises to achieve domain generalization, we propose autonomous learning, which creates a self-improvement learning environment for DA. Here, the model generates questions from documents and conducts two explorations—one with the original document and one with a masked version. By comparing these explorations, the LLMs can independently identify and enhance its policy for reducing knowledge gaps. Experiments across various DA tasks demonstrate that autonomous learning enhances the DA performance of existing models, outperforming traditional fine-tuning and self-improvement methods. Our code is publicly available at https://github.com/FreedomIntelligence/AL.

Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating objective queries without considering real-world user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 26 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria.

2024

PlatoLM: Teaching LLMs in Multi-Round Dialogue via a User Simulator
Chuyi Kong | Yaxin Fan | Xiang Wan | Feng Jiang | Benyou Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The unparalleled performance of closed-sourced ChatGPT has sparked efforts towards its democratization, with notable strides made by leveraging real user and ChatGPT dialogues, as evidenced by Vicuna. However, due to challenges in gathering dialogues involving human participation, current endeavors like Baize and UltraChat rely on ChatGPT conducting roleplay to simulate humans based on instructions, resulting in overdependence on seeds, diminished human-likeness, limited topic diversity, and an absence of genuine multi-round conversational dynamics. To address the above issues, we propose a paradigm to simulate human behavior better and explore the benefits of incorporating more human-like questions in multi-turn conversations. Specifically, we directly target human questions extracted from genuine human-machine conversations as a learning goal and provide a novel user simulator called ‘Socratic‘. The experimental results show our response model, ‘PlatoLM‘, achieves SoTA performance among LLaMA-based 7B models in MT-Bench. Our findings further demonstrate that our method introduces highly human-like questioning patterns and rich topic structures, which can teach the response model better than previous works in multi-round conversations.

Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Junying Chen | Chi Gui | Ruyi Ouyang | Anningzhe Gao | Shunian Chen | Guiming Hardy Chen | Xidong Wang | Zhenyang Cai | Ke Ji | Xiang Wan | Benyou Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they often fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the **PubMedVision** dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM **HuatuoGPT-Vision**, which shows superior performance in medical multimodal scenarios among open-source MLLMs. Our code and data are available at https://github.com/FreedomIntelligence/HuatuoGPT-Vision.

CMB: A Comprehensive Medical Benchmark in Chinese
Xidong Wang | Guiming Chen | Song Dingjie | Zhang Zhiyi | Zhihong Chen | Qingying Xiao | Junying Chen | Feng Jiang | Jianquan Li | Xiang Wan | Benyou Wang | Haizhou Li
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in contextual incongruities to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.

This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed ‘AceGPT’, sets the state-of-the-art standard for open Arabic LLMs across various benchmarks. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.

2023

One Cannot Stand for Everyone! Leveraging Multiple User Simulators to train Task-oriented Dialogue Systems
Yajiao Liu | Xin Jiang | Yichun Yin | Yasheng Wang | Fei Mi | Qun Liu | Xiang Wan | Benyou Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

User simulators are agents designed to imitate human users; recent advances have found that Task-oriented Dialogue (ToD) systems optimized toward a user simulator could better satisfy the need of human users. However, this might result in a sub-optimal ToD system if it is tailored to only one ad hoc user simulator, since human users can behave differently. In this paper, we propose a framework called MUST to optimize ToD systems via leveraging Multiple User SimulaTors. The main challenges of implementing MUST fall in 1) how to adaptively determine which user simulator to interact with the ToD system at each optimization step, since the ToD system might be over-fitted to some specific user simulators, and simultaneously under-fitted to some others; 2) how to avoid catastrophic forgetting of the adaption for a simulator that is not selected for several consecutive optimization steps.To tackle these challenges, we formulate MUST as a Multi-armed bandits (MAB) problem and provide a method called MUST_adaptive that balances i) the boosting adaption for adaptive interactions between different user simulators and the ToD system andii) the uniform adaption to avoid the catastrophic forgetting issue.With both automatic evaluations and human evaluations, our experimental results on MultiWOZ show that the dialogue system trained by MUST achieves a better performance than those trained by a single user simulator. It also has a better generalization ability when testing with unseen user simulators.

Toward Expanding the Scope of Radiology Report Summarization to Multiple Anatomies and Modalities
Zhihong Chen | Maya Varma | Xiang Wan | Curtis Langlotz | Jean-Benoit Delbrouck
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Radiology report summarization (RRS) is a growing area of research. Given the Findings section of a radiology report, the goal is to generate a summary (called an Impression section) that highlights the key observations and conclusions of the radiology study. However, RRS currently faces essential limitations. First, many prior studies conduct experiments on private datasets, preventing reproduction of results and fair comparisons across different systems and solutions. Second, most prior approaches are evaluated solely on chest X-rays. To address these limitations, we propose a dataset (MIMIC-RRS) involving three new modalities and seven new anatomies based on the MIMIC-III and MIMIC-CXR datasets. We then conduct extensive experiments to evaluate the performance of models both within and across modality-anatomy pairs in MIMIC-RRS. In addition, we evaluate their clinical efficacy via RadGraph, a factual correctness metric.

Improving Grammatical Error Correction with Multimodal Feature Integration
Tao Fang | Jinpeng Hu | Derek F. Wong | Xiang Wan | Lidia S. Chao | Tsung-Hui Chang
Findings of the Association for Computational Linguistics: ACL 2023

Grammatical error correction (GEC) is a promising task aimed at correcting errors in a text. Many methods have been proposed to facilitate this task with remarkable results. However, most of them only focus on enhancing textual feature extraction without exploring the usage of other modalities’ information (e.g., speech), which can also provide valuable knowledge to help the model detect grammatical errors. To shore up this deficiency, we propose a novel framework that integrates both speech and text features to enhance GEC. In detail, we create new multimodal GEC datasets for English and German by generating audio from text using the advanced text-to-speech models. Subsequently, we extract acoustic and textual representations by a multimodal encoder that consists of a speech and a text encoder. A mixture-of-experts (MoE) layer is employed to selectively align representations from the two modalities, and then a dot attention mechanism is used to fuse them as final multimodal representations. Experimental results on CoNLL14, BEA19 English, and Falko-MERLIN German show that our multimodal GEC models achieve significant improvements over strong baselines and achieve a new state-of-the-art result on the Falko-MERLIN test set.

Improving Radiology Summarization with Radiograph and Anatomy Prompts
Jinpeng Hu | Zhihong Chen | Yang Liu | Xiang Wan | Tsung-Hui Chang
Findings of the Association for Computational Linguistics: ACL 2023

The impression is crucial for the referring physicians to grasp key information since it is concluded from the findings and reasoning of radiologists. To alleviate the workload of radiologists and reduce repetitive human labor in impression writing, many researchers have focused on automatic impression generation. However, recent works on this task mainly summarize the corresponding findings and pay less attention to the radiology images. In clinical, radiographs can provide more detailed valuable observations to enhance radiologists’ impression writing, especially for complicated cases. Besides, each sentence in findings usually focuses on single anatomy, such that they only need to be matched to corresponding anatomical regions instead of the whole image, which is beneficial for textual and visual features alignment. Therefore, we propose a novel anatomy-enhanced multimodal model to promote impression generation. In detail, we first construct a set of rules to extract anatomies and put these prompts into each sentence to highlight anatomy characteristics. Then, two separate encoders are applied to extract features from the radiograph and findings. Afterward, we utilize a contrastive learning module to align these two representations at the overall level and use a co-attention to fuse them at the sentence level with the help of anatomy-enhanced sentence representation. The experimental results on two benchmark datasets confirm the effectiveness of the proposed method, which achieves state-of-the-art results.

On the Difference of BERT-style and CLIP-style Text Encoders
Zhihong Chen | Guiming Chen | Shizhe Diao | Xiang Wan | Benyou Wang
Findings of the Association for Computational Linguistics: ACL 2023

Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans.

HuatuoGPT, Towards Taming Language Model to Be a Doctor
Hongbo Zhang | Junying Chen | Feng Jiang | Fei Yu | Zhihong Chen | Guiming Chen | Jianquan Li | Xiangbo Wu | Zhang Zhiyi | Qingying Xiao | Xiang Wan | Benyou Wang | Haizhou Li
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we present HuatuoGPT, a Large Language Model (LLM) for medical consultation. The core recipe of HuatuoGPT is to leverage both distilled data from **ChatGPT** and real-world data from **doctors** in the supervised fine-tuning stage. This is not only because purely using **ChatGPT**-distilled data might cause ‘model collapse’, but also because real-world data from **doctors** would be complementary to **ChatGPT**-distilled data. The responses from ChatGPT are usually detailed, well-presented, fluent, and instruction-followed, but it cannot perform like a doctor in many aspects, e.g. for interactive diagnosis. Therefore, the extra doctors’ data could tame a distilled language model to perform like doctors. To synergize the strengths of both data sources, we introduce RLMF (Reinforcement Learning from Mixed Feedback) where a reward model is trained to align the language model with the merits that both sources (ChatGPT and doctors) bring. Experimental results (in GPT-4 evaluation, human evaluation, and medical benchmark datasets) demonstrate that HuatuoGPT achieves state-of-the-art results in performing medical consultation among open-source LLMs. It is worth noting that by using additional real-world data and RLMF, the distilled language model (i.e., HuatuoGPT) outperforms its teacher model (i.e., ChatGPT) in most cases.

2022

Graph Enhanced Contrastive Learning for Radiology Findings Summarization
Jinpeng Hu | Zhuo Li | Zhihong Chen | Zhen Li | Xiang Wan | Tsung-Hui Chang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The impression section of a radiology report summarizes the most prominent observation from the findings section and is the most important section for radiologists to communicate to physicians. Summarizing findings is time-consuming and can be prone to error for inexperienced radiologists, and thus automatic impression generation has attracted substantial attention. With the encoder-decoder framework, most previous studies explore incorporating extra knowledge (e.g., static pre-defined clinical ontologies or extra background information). Yet, they encode such knowledge by a separate encoder to treat it as an extra input to their models, which is limited in leveraging their relations with the original findings. To address the limitation, we propose a unified framework for exploiting both extra knowledge and the original findings in an integrated way so that the critical information (i.e., key words and their relations) can be extracted in an appropriate way to facilitate impression generation. In detail, for each input findings, it is encoded by a text encoder and a graph is constructed through its entities and dependency tree. Then, a graph encoder (e.g., graph neural networks (GNNs)) is adopted to model relation information in the constructed graph. Finally, to emphasize the key words in the findings, contrastive learning is introduced to map positive samples (constructed by masking non-key words) closer and push apart negative ones (constructed by masking key words). The experimental results on two datasets, OpenI and MIMIC-CXR, confirm the effectiveness of our proposed method, where the state-of-the-art results are achieved.

A Simple yet Effective Relation Information Guided Approach for Few-Shot Relation Extraction
Yang Liu | Jinpeng Hu | Xiang Wan | Tsung-Hui Chang
Findings of the Association for Computational Linguistics: ACL 2022

Few-Shot Relation Extraction aims at predicting the relation for a pair of entities in a sentence by training with a few labelled examples in each relation. Some recent works have introduced relation information (i.e., relation labels or descriptions) to assist model learning based on Prototype Network. However, most of them constrain the prototypes of each relation class implicitly with relation information, generally through designing complex network structures, like generating hybrid features, combining with contrastive learning or attention networks. We argue that relation information can be introduced more explicitly and effectively into the model. Thus, this paper proposes a direct addition approach to introduce relation information. Specifically, for each relation class, the relation representation is first generated by concatenating two views of relations (i.e., [CLS] token embedding and the mean value of embeddings of all tokens) and then directly added to the original prototype for both train and prediction. Experimental results on the benchmark dataset FewRel 1.0 show significant improvements and achieve comparable results to the state-of-the-art, which demonstrates the effectiveness of our proposed approach. Besides, further analyses verify that the direct addition is a much more effective way to integrate the relation representations and the original prototypes.

Learn from Relation Information: Towards Prototype Representation Rectification for Few-Shot Relation Extraction
Yang Liu | Jinpeng Hu | Xiang Wan | Tsung-Hui Chang
Findings of the Association for Computational Linguistics: NAACL 2022

Few-shot Relation Extraction refers to fast adaptation to novel relation classes with few samples through training on the known relation classes. Most existing methods focus on implicitly introducing relation information (i.e., relation label or relation description) to constrain the prototype representation learning, such as contrastive learning, graphs, and specifically designed attentions, which may bring useless and even harmful parameters. Besides, these approaches are limited in handing outlier samples far away from the class center due to the weakly implicit constraint. In this paper, we propose an effective and parameter-less Prototype Rectification Method (PRM) to promote few-shot relation extraction, where we utilize a prototype rectification module to rectify original prototypes explicitly by the relation information. Specifically, PRM is composed of two gate mechanisms. One gate decides how much of the original prototype remains, and another one updates the remained prototype with relation information. In doing so, better and stabler global relation information can be captured for guiding prototype representations, and thus PRM can robustly deal with outliers. Moreover, we also extend PRM to both none-of-the-above (NOTA) and domain adaptation scenarios. Experimental results on FewRel 1.0 and 2.0 datasets demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance.

A Label-Aware Autoregressive Framework for Cross-Domain NER
Jinpeng Hu | He Zhao | Dan Guo | Xiang Wan | Tsung-Hui Chang
Findings of the Association for Computational Linguistics: NAACL 2022

Cross-domain named entity recognition (NER) aims to borrow the entity information from the source domain to help the entity recognition in the target domain with limited labeled data. Despite the promising performance of existing approaches, most of them focus on reducing the discrepancy of token representation between source and target domains, while the transfer of the valuable label information is often not explicitly considered or even ignored. Therefore, we propose a novel autoregressive framework to advance cross-domain NER by first enhancing the relationship between labels and tokens and then further improving the transferability of label information. Specifically, we associate each label with an embedding vector, and for each token, we utilize a bidirectional LSTM (Bi-LSTM) to encode the labels of its previous tokens for modeling internal context information and label dependence. Afterward, we propose a Bi-Attention module that merges the token representation from a pre-trained model and the label features from the Bi-LSTM as the label-aware information, which is concatenated to the token representation to facilitate cross-domain NER. In doing so, label information contained in the embedding vectors can be effectively transferred to the target domain, and Bi-LSTM can further model the label relationship among different domains by pre-train and then fine-tune setting. Experimental results on several datasets confirm the effectiveness of our model, where our model achieves significant improvements over the state of the arts.

Hero-Gang Neural Model For Named Entity Recognition
Jinpeng Hu | Yaling Shen | Yang Liu | Xiang Wan | Tsung-Hui Chang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Named entity recognition (NER) is a fundamental and important task in NLP, aiming at identifying named entities (NEs) from free text. Recently, since the multi-head attention mechanism applied in the Transformer model can effectively capture longer contextual information, Transformer-based models have become the mainstream methods and have achieved significant performance in this task. Unfortunately, although these models can capture effective global context information, they are still limited in the local feature and position information extraction, which is critical in NER. In this paper, to address this limitation, we propose a novel Hero-Gang Neural structure (HGN), including the Hero and Gang module, to leverage both global and local information to promote NER. Specifically, the Hero module is composed of a Transformer-based encoder to maintain the advantage of the self-attention mechanism, and the Gang module utilizes a multi-window recurrent module to extract local features and position information under the guidance of the Hero module. Afterward, the proposed multi-window attention effectively combines global information and multiple local features for predicting entity labels. Experimental results on several benchmark datasets demonstrate the effectiveness of our proposed model.

2021

Dependency-driven Relation Extraction with Attentive Graph Convolutional Networks
Yuanhe Tian | Guimin Chen | Yan Song | Xiang Wan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Syntactic information, especially dependency trees, has been widely used by existing studies to improve relation extraction with better semantic guidance for analyzing the context information associated with the given entities. However, most existing studies suffer from the noise in the dependency trees, especially when they are automatically generated, so that intensively leveraging dependency information may introduce confusions to relation classification and necessary pruning is of great importance in this task. In this paper, we propose a dependency-driven approach for relation extraction with attentive graph convolutional networks (A-GCN). In this approach, an attention mechanism upon graph convolutional networks is applied to different contextual words in the dependency tree obtained from an off-the-shelf dependency parser, to distinguish the importance of different word dependencies. Consider that dependency types among words also contain important contextual guidance, which is potentially helpful for relation extraction, we also include the type information in A-GCN modeling. Experimental results on two English benchmark datasets demonstrate the effectiveness of our A-GCN, which outperforms previous studies and achieves state-of-the-art performance on both datasets.

Cross-modal Memory Networks for Radiology Report Generation
Zhihong Chen | Yaling Shen | Yan Song | Xiang Wan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Medical imaging plays a significant role in clinical practice of medical diagnosis, where the text reports of the images are essential in understanding them and facilitating later treatments. By generating the reports automatically, it is beneficial to help lighten the burden of radiologists and significantly promote clinical automation, which already attracts much attention in applying artificial intelligence to medical domain. Previous studies mainly follow the encoder-decoder paradigm and focus on the aspect of text generation, with few studies considering the importance of cross-modal mappings and explicitly exploit such mappings to facilitate radiology report generation. In this paper, we propose a cross-modal memory networks (CMN) to enhance the encoder-decoder framework for radiology report generation, where a shared memory is designed to record the alignment between images and texts so as to facilitate the interaction and generation across modalities. Experimental results illustrate the effectiveness of our proposed model, where state-of-the-art performance is achieved on two widely used benchmark datasets, i.e., IU X-Ray and MIMIC-CXR. Further analyses also prove that our model is able to better align information from radiology images and texts so as to help generating more accurate reports in terms of clinical indicators.

Exploring Word Segmentation and Medical Concept Recognition for Chinese Medical Texts
Yang Liu | Yuanhe Tian | Tsung-Hui Chang | Song Wu | Xiang Wan | Yan Song
Proceedings of the 20th Workshop on Biomedical Language Processing

Chinese word segmentation (CWS) and medical concept recognition are two fundamental tasks to process Chinese electronic medical records (EMRs) and play important roles in downstream tasks for understanding Chinese EMRs. One challenge to these tasks is the lack of medical domain datasets with high-quality annotations, especially medical-related tags that reveal the characteristics of Chinese EMRs. In this paper, we collected a Chinese EMR corpus, namely, ACEMR, with human annotations for Chinese word segmentation and EMR-related tags. On the ACEMR corpus, we run well-known models (i.e., BiLSTM, BERT, and ZEN) and existing state-of-the-art systems (e.g., WMSeg and TwASP) for CWS and medical concept recognition. Experimental results demonstrate the necessity of building a dedicated medical dataset and show that models that leverage extra resources achieve the best performance for both tasks, which provides certain guidance for future studies on model selection in the medical domain.

Relation Extraction with Type-aware Map Memories of Word Dependencies
Guimin Chen | Yuanhe Tian | Yan Song | Xiang Wan
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Word Graph Guided Summarization for Radiology Findings
Jinpeng Hu | Jianling Li | Zhihong Chen | Yaling Shen | Yan Song | Xiang Wan | Tsung-Hui Chang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Field Embedding: A Unified Grain-Based Framework for Word Representation
Junjie Luo | Xi Chen | Jichao Sun | Yuejia Xiang | Ningyu Zhang | Xiang Wan
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Word representations empowered with additional linguistic information have been widely studied and proved to outperform traditional embeddings. Current methods mainly focus on learning embeddings for words while embeddings of linguistic information (referred to as grain embeddings) are discarded after the learning. This work proposes a framework field embedding to jointly learn both word and grain embeddings by incorporating morphological, phonetic, and syntactical linguistic fields. The framework leverages an innovative fine-grained pipeline that integrates multiple linguistic fields and produces high-quality grain sequences for learning supreme word representations. A novel algorithm is also designed to learn embeddings for words and grains by capturing information that is contained within each field and that is shared across them. Experimental results of lexical tasks and downstream natural language processing tasks illustrate that our framework can learn better word embeddings and grain embeddings. Qualitative evaluations show grain embeddings effectively capture the semantic information.

2020

Named Entity Recognition for Social Media Texts with Semantic Augmentation
Yuyang Nie | Yuanhe Tian | Xiang Wan | Yan Song | Bo Dai
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts, especially user-generated social media content. Semantic augmentation is a potential way to alleviate this problem. Given that rich semantic information is implicitly preserved in pre-trained word embeddings, they are potential ideal resources for semantic augmentation. In this paper, we propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account. In particular, we obtain the augmented semantic information from a large-scale corpus, and propose an attentive semantic augmentation module and a gate module to encode and aggregate such information, respectively. Extensive experiments are performed on three benchmark datasets collected from English and Chinese social media platforms, where the results demonstrate the superiority of our approach to previous studies across all three datasets.

Generating Radiology Reports via Memory-driven Transformer
Zhihong Chen | Yan Song | Tsung-Hui Chang | Xiang Wan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Medical imaging is frequently used in clinical practice and trials for diagnosis and treatment. Writing imaging reports is time-consuming and can be error-prone for inexperienced radiologists. Therefore, automatically generating radiology reports is highly desired to lighten the workload of radiologists and accordingly promote clinical automation, which is an essential task to apply artificial intelligence to the medical domain. In this paper, we propose to generate radiology reports with memory-driven Transformer, where a relational memory is designed to record key information of the generation process and a memory-driven conditional layer normalization is applied to incorporating the memory into the decoder of Transformer. Experimental results on two prevailing radiology report datasets, IU X-Ray and MIMIC-CXR, show that our proposed approach outperforms previous models with respect to both language generation metrics and clinical evaluations. Particularly, this is the first work reporting the generation results on MIMIC-CXR to the best of our knowledge. Further analyses also demonstrate that our approach is able to generate long reports with necessary medical terms as well as meaningful image-text attention mappings.

Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information
Yuyang Nie | Yuanhe Tian | Yan Song | Xiang Ao | Xiang Wan
Findings of the Association for Computational Linguistics: EMNLP 2020

Named entity recognition (NER) is highly sensitive to sentential syntactic and semantic properties where entities may be extracted according to how they are used and placed in the running text. To model such properties, one could rely on existing resources to providing helpful knowledge to the NER task; some existing studies proved the effectiveness of doing so, and yet are limited in appropriately leveraging the knowledge such as distinguishing the important ones for particular context. In this paper, we improve NER by leveraging different types of syntactic information through attentive ensemble, which functionalizes by the proposed key-value memory networks, syntax attention, and the gate mechanism for encoding, weighting and aggregating such syntactic information, respectively. Experimental results on six English and Chinese benchmark datasets suggest the effectiveness of the proposed model and show that it outperforms previous studies on all experiment datasets.

Co-authors

Feng Jiang (蒋峰) 3

Qingying Xiao 2

Mosen Alharthi 1

Lidia S. Chao 1

Guiming Hardy Chen 1

Jean-Benoit Delbrouck 1

Steven Y. Guo 1

Curtis Langlotz 1

Derek F. Wong (黄辉) 1

Venues