Xuan Wang


pdf bib
Text Augmented Open Knowledge Graph Completion via Pre-Trained Language Models
Pengcheng Jiang | Shivam Agarwal | Bowen Jin | Xuan Wang | Jimeng Sun | Jiawei Han
Findings of the Association for Computational Linguistics: ACL 2023

The mission of open knowledge graph (KG) completion is to draw new findings from known facts. Existing works that augment KG completion require either (1) factual triples to enlarge the graph reasoning space or (2) manually designed prompts to extract knowledge from a pre-trained language model (PLM), exhibiting limited performance and requiring expensive efforts from experts. To this end, we propose TagReal that automatically generates quality query prompts and retrieves support information from large text corpora to probe knowledge from PLM for KG completion. The results show that TagReal achieves state-of-the-art performance on two benchmark datasets. We find that TagReal has superb performance even with limited training data, outperforming existing embedding-based, graph-based, and PLM-based methods.

pdf bib
ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision
Ming Zhong | Siru Ouyang | Minhao Jiang | Vivian Hu | Yizhu Jiao | Xuan Wang | Jiawei Han
Findings of the Association for Computational Linguistics: ACL 2023

Structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. Consequently, the scarcity of sufficient training data poses an obstacle to the progress of related models in this domain. In this paper, we propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions. Additionally, we adopt synthetic data from patent records as distant supervision to incorporate domain knowledge into the model. Experiments demonstrate that ReactIE achieves substantial improvements and outperforms all existing baselines.

pdf bib
MEGClass: Extremely Weakly Supervised Text Classification via Mutually-Enhancing Text Granularities
Priyanka Kargupta | Tanay Komarlu | Susik Yoon | Xuan Wang | Jiawei Han
Findings of the Association for Computational Linguistics: EMNLP 2023

Text classification is essential for organizing unstructured text. Traditional methods rely on human annotations or, more recently, a set of class seed words for supervision, which can be costly, particularly for specialized or emerging domains. To address this, using class surface names alone as extremely weak supervision has been proposed. However, existing approaches treat different levels of text granularity (documents, sentences, or words) independently, disregarding inter-granularity class disagreements and the context identifiable exclusively through joint extraction. In order to tackle these issues, we introduce MEGClass, an extremely weakly-supervised text classification method that leverages Mutually-Enhancing Text Granularities. MEGClass utilizes coarse- and fine-grained context signals obtained by jointly considering a document’s most class-indicative words and sentences. This approach enables the learning of a contextualized document representation that captures the most discriminative class indicators. By preserving the heterogeneity of potential classes, MEGClass can select the most informative class-indicative documents as iterative feedback to enhance the initial word-based class representations and ultimately fine-tune a pre-trained text classifier. Extensive experiments on seven benchmark datasets demonstrate that MEGClass outperforms other weakly and extremely weakly supervised methods.

pdf bib
DRGCoder: Explainable Clinical Coding for the Early Prediction of Diagnostic-Related Groups
Daniel Hajialigol | Derek Kaknes | Tanner Barbour | Daphne Yao | Chris North | Jimeng Sun | David Liem | Xuan Wang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Medical claim coding is the process of transforming medical records, usually presented as free texts written by clinicians, or discharge summaries, into structured codes in a classification system such as ICD-10 (International Classification of Diseases, Tenth Revision) or DRG (Diagnosis-Related Group) codes. This process is essential for medical billing and transitional care; however, manual coding is time-consuming, error-prone, and expensive. To solve these issues, we propose DRGCoder, an explainability-enhanced clinical claim coding system for the early prediction of medical severity DRGs (MS-DRGs), a classification system that categorizes patients’ hospital stays into various DRG groups based on the severity of illness and mortality risk. The DRGCoder framework introduces a novel multi-task Transformer model for MS-DRG prediction, modeling both the DRG labels of the discharge summaries and the important, or salient words within he discharge summaries. We allow users to inspect DRGCoder’s reasoning by visualizing the weights for each word of the input. Additionally, DRGCoder allows users to identify diseases within discharge summaries and compare across multiple discharge summaries. Our demo is available at https://huggingface.co/spaces/danielhajialigol/DRGCoder. A video demonstrating the demo can be found at https://www.youtube.com/watch?v=pcdiG6VwqlA

pdf bib
Reaction Miner: An Integrated System for Chemical Reaction Extraction from Textual Data
Ming Zhong | Siru Ouyang | Yizhu Jiao | Priyanka Kargupta | Leo Luo | Yanzhen Shen | Bobby Zhou | Xianrui Zhong | Xuan Liu | Hongxiang Li | Jinfeng Xiao | Minhao Jiang | Vivian Hu | Xuan Wang | Heng Ji | Martin Burke | Huimin Zhao | Jiawei Han
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Chemical reactions, as a core entity in the realm of chemistry, hold crucial implications in diverse areas ranging from hands-on laboratory research to advanced computational drug design. Despite a burgeoning interest in employing NLP techniques to extract these reactions, aligning this task with the real-world requirements of chemistry practitioners remains an ongoing challenge. In this paper, we present Reaction Miner, a system specifically designed to interact with raw scientific literature, delivering precise and more informative chemical reactions. Going beyond mere extraction, Reaction Miner integrates a holistic workflow: it accepts PDF files as input, bypassing the need for pre-processing and bolstering user accessibility. Subsequently, a text segmentation module ensures that the refined text encapsulates complete chemical reactions, augmenting the accuracy of extraction. Moreover, Reaction Miner broadens the scope of existing pre-defined reaction roles, including vital attributes previously neglected, thereby offering a more comprehensive depiction of chemical reactions. Evaluations conducted by chemistry domain users highlight the efficacy of each module in our system, demonstrating Reaction Miner as a powerful tool in this field.


pdf bib
Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds
Yu Zhang | Yu Meng | Xuan Wang | Sheng Wang | Jiawei Han
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Discovering latent topics from text corpora has been studied for decades. Many existing topic models adopt a fully unsupervised setting, and their discovered topics may not cater to users’ particular interests due to their inability of leveraging user guidance. Although there exist seed-guided topic discovery approaches that leverage user-provided seeds to discover topic-representative terms, they are less concerned with two factors: (1) the existence of out-of-vocabulary seeds and (2) the power of pre-trained language models (PLMs). In this paper, we generalize the task of seed-guided topic discovery to allow out-of-vocabulary seeds. We propose a novel framework, named SeeTopic, wherein the general knowledge of PLMs and the local semantics learned from the input corpus can mutually benefit each other. Experiments on three real datasets from different domains demonstrate the effectiveness of SeeTopic in terms of topic coherence, accuracy, and diversity.


pdf bib
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation
Qingyun Wang | Manling Li | Xuan Wang | Nikolaus Parulian | Guangxing Han | Jiawei Ma | Jingxuan Tu | Ying Lin | Ranran Haoran Zhang | Weili Liu | Aabhas Chauhan | Yingjun Guan | Bangzheng Li | Ruisong Li | Xiangchen Song | Yi Fung | Heng Ji | Jiawei Han | Shih-Fu Chang | James Pustejovsky | Jasmine Rah | David Liem | Ahmed ELsayed | Martha Palmer | Clare Voss | Cynthia Schneider | Boyan Onyshkevych
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence. All of the data, KGs, reports.

pdf bib
Noise Robust Named Entity Understanding for Voice Assistants
Deepak Muralidharan | Joel Ruben Antony Moniz | Sida Gao | Xiao Yang | Justine Kao | Stephen Pulman | Atish Kothari | Ray Shen | Yinying Pan | Vivek Kaul | Mubarak Seyed Ibrahim | Gang Xiang | Nan Dun | Yidan Zhou | Andy O | Yuan Zhang | Pooja Chitkara | Xuan Wang | Alkesh Patel | Kushal Tayal | Roger Zheng | Peter Grasch | Jason D Williams | Lin Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to 3.13% and EL accuracy by up to 3.6% in F1 score. The features used also lead to better accuracies in other natural language understanding tasks, such as domain classification and semantic parsing.

pdf bib
ChemNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-Guided Distant Supervision
Xuan Wang | Vivian Hu | Xiangchen Song | Shweta Garg | Jinfeng Xiao | Jiawei Han
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Scientific literature analysis needs fine-grained named entity recognition (NER) to provide a wide range of information for scientific discovery. For example, chemistry research needs to study dozens to hundreds of distinct, fine-grained entity types, making consistent and accurate annotation difficult even for crowds of domain experts. On the other hand, domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER. In distant supervision, training labels are generated by matching mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: incomplete annotation and noisy annotation. We propose ChemNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle these challenges. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. It significantly improves the distant label generation for the subsequent sequence labeling model training. We also provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that ChemNER is highly effective, outperforming substantially the state-of-the-art NER methods (with .25 absolute F1 score improvement).

pdf bib
Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training
Yu Meng | Yunyi Zhang | Jiaxin Huang | Xuan Wang | Yu Zhang | Heng Ji | Jiawei Han
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We study the problem of training named entity recognition (NER) models using only distantly-labeled data, which can be automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. The biggest challenge of distantly-supervised NER is that the distant supervision may induce incomplete and noisy labels, rendering the straightforward application of supervised learning ineffective. In this paper, we propose (1) a noise-robust learning scheme comprised of a new loss function and a noisy label removal step, for training NER models on distantly-labeled data, and (2) a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model. On three benchmark datasets, our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.


pdf bib
EVIDENCEMINER: Textual Evidence Discovery for Life Sciences
Xuan Wang | Yingjun Guan | Weili Liu | Aabhas Chauhan | Enyi Jiang | Qi Li | David Liem | Dibakar Sigdel | John Caufield | Peipei Ping | Jiawei Han
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Traditional search engines for life sciences (e.g., PubMed) are designed for document retrieval and do not allow direct retrieval of specific statements. Some of these statements may serve as textual evidence that is key to tasks such as hypothesis generation and new finding validation. We present EVIDENCEMINER, a web-based system that lets users query a natural language statement and automatically retrieves textual evidence from a background corpora for life sciences. EVIDENCEMINER is constructed in a completely automated way without any human effort for training data annotation. It is supported by novel data-driven methods for distantly supervised named entity recognition and open information extraction. The entities and patterns are pre-computed and indexed offline to support fast online evidence retrieval. The annotation results are also highlighted in the original document for better visualization. EVIDENCEMINER also includes analytic functionalities such as the most frequent entity and relation summarization. EVIDENCEMINER can help scientists uncover important research issues, leading to more effective research and more in-depth quantitative analysis. The system of EVIDENCEMINER is available at https://evidenceminer.firebaseapp.com/.


pdf bib
Variational Autoregressive Decoder for Neural Response Generation
Jiachen Du | Wenjie Li | Yulan He | Ruifeng Xu | Lidong Bing | Xuan Wang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Combining the virtues of probability graphic models and neural networks, Conditional Variational Auto-encoder (CVAE) has shown promising performance in applications such as response generation. However, existing CVAE-based models often generate responses from a single latent variable which may not be sufficient to model high variability in responses. To solve this problem, we propose a novel model that sequentially introduces a series of latent variables to condition the generation of each word in the response sequence. In addition, the approximate posteriors of these latent variables are augmented with a backward Recurrent Neural Network (RNN), which allows the latent variables to capture long-term dependencies of future tokens in generation. To facilitate training, we supplement our model with an auxiliary objective that predicts the subsequent bag of words. Empirical experiments conducted on Opensubtitle and Reddit datasets show that the proposed model leads to significant improvement on both relevance and diversity over state-of-the-art baselines.


pdf bib
XJNLP at SemEval-2017 Task 12: Clinical temporal information ex-traction with a Hybrid Model
Yu Long | Zhijing Li | Xuan Wang | Chen Li
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Temporality is crucial in understanding the course of clinical events from a patient’s electronic health recordsand temporal processing is becoming more and more important for improving access to content. SemEval 2017 Task 12 (Clinical TempEval) addressed this challenge using the THYME corpus, a corpus of clinical narratives annotated with a schema based on TimeML2 guidelines. We developed and evaluated approaches for: extraction of temporal expressions (TIMEX3) and EVENTs; EVENT attributes; document-time relations. Our approach is a hybrid model which is based on rule based methods, semi-supervised learning, and semantic features with addition of manually crafted rules.

pdf bib
Life-iNet: A Structured Network-Based Knowledge Exploration and Analytics System for Life Sciences
Xiang Ren | Jiaming Shen | Meng Qu | Xuan Wang | Zeqiu Wu | Qi Zhu | Meng Jiang | Fangbo Tao | Saurabh Sinha | David Liem | Peipei Ping | Richard Weinshilboum | Jiawei Han
Proceedings of ACL 2017, System Demonstrations


pdf bib
Improving Distributed Representation of Word Sense via WordNet Gloss Composition and Context Clustering
Tao Chen | Ruifeng Xu | Yulan He | Xuan Wang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)


pdf bib
Simple Maximum Entropy Models for Multilingual Coreference Resolution
Xinxin Li | Xuan Wang | Xingwei Liao
Joint Conference on EMNLP and CoNLL - Shared Task

pdf bib
A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language
Sajjad Ahmad Khan | Waqas Anwar | Usama Ijaz Bajwa | Xuan Wang
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib
N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language
Faryal Jahangir | Waqas Anwar | Usama Ijaz Bajwa | Xuan Wang
Proceedings of the 10th Workshop on Asian Language Resources

pdf bib
Zhijun Wu: Chinese Semantic Dependency Parsing with Third-Order Features
Zhijun Wu | Xuan Wang | Xinxin Li
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)


pdf bib
Coreference Resolution with Loose Transitivity Constraints
Xinxin Li | Xuan Wang | Shuhan Qi
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Diversifying Information Needs in Results of Question Retrieval
Yaoyun Zhang | Xiaolong Wang | Xuan Wang | Ruifeng Xu | Jun Xu | ShiXi Fan
Proceedings of 5th International Joint Conference on Natural Language Processing


pdf bib
A Cascade Method for Detecting Hedges and their Scope in Natural Language Text
Buzhou Tang | Xiaolong Wang | Xuan Wang | Bo Yuan | Shixi Fan
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
Exploiting Rich Features for Detecting Hedges and their Scope
Xinxin Li | Jianping Shen | Xiang Gao | Xuan Wang
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
Chinese Word Segmentation based on Mixing Multiple Preprocessor and CRF
Jianping Shen | Xuan Wang | Hainan Zhao | Wenxiao Zhang
CIPS-SIGHAN Joint Conference on Chinese Language Processing


pdf bib
A Joint Syntactic and Semantic Dependency Parsing System based on Maximum Entropy Models
Buzhou Tang | Lu Li | Xinxin Li | Xuan Wang | Xiaolong Wang
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task


pdf bib
Chunking with Max-Margin Markov Networks
Buzhou Tang | Xuan Wang | Xiaolong Wang
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Semantic Chunk Annotation for complex questions using Conditional Random Field
Shixi Fan | Yaoyun Zhang | Wing W. Y. Ng | Xuan Wang | Xiaolong Wang
Coling 2008: Proceedings of the workshop on Knowledge and Reasoning for Answering Questions

pdf bib
Discriminative Learning of Syntactic and Semantic Dependencies
Lu Li | Shixi Fan | Xuan Wang | Xiaolong Wang
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning