Jian Wu


2024

pdf bib
Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications
Weize Liu | Yinlong Xu | Hongxia Xu | Jintai Chen | Xuming Hu | Jian Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recently, large language models (LLMs) have achieved tremendous breakthroughs in the field of NLP, but still lack understanding of their internal neuron activities when processing different languages. We designed a method to convert dense LLMs into fine-grained MoE architectures, and then visually studied the multilingual activation patterns of LLMs through expert activation frequency heatmaps. Through comprehensive experiments on different model families, different model sizes, and different variants, we analyzed the similarities and differences in the internal neuron activation patterns of LLMs when processing different languages. Specifically, we investigated the distribution of high-frequency activated experts, multilingual shared experts, whether multilingual activation patterns are related to language families, and the impact of instruction tuning on activation patterns. We further explored leveraging the discovered differences in expert activation frequencies to guide sparse activation and pruning. Experimental results demonstrated that our method significantly outperformed random expert pruning and even exceeded the performance of unpruned models in some languages. Additionally, we found that configuring different pruning rates for different layers based on activation level differences could achieve better results. Our findings reveal the multilingual processing mechanisms within LLMs and utilize these insights to offer new perspectives for applications such as sparse activation and model pruning.

pdf bib
VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis
Jiaxiang Liu | Tianxiang Hu | Huimin Xiong | Jiawei Du | Yang Feng | Jian Wu | Joey Tianyi Zhou | Zuozhu Liu
Findings of the Association for Computational Linguistics: EMNLP 2024

Vision-language models like CLIP, utilizing class proxies derived from class name text features, have shown a notable capability in zero-shot medical image diagnosis which is vital in scenarios with limited disease databases or labeled samples. However, insufficient medical text precision and the modal disparity between text and vision spaces pose challenges for such paradigm. We show analytically and experimentally that enriching medical texts with detailed descriptions can markedly enhance the diagnosis performance, with the granularity and phrasing of these enhancements having a crucial impact on CLIP’s understanding of medical images; and learning proxies within the vision domain can effectively circumvent the modal gap issue. Based on our analysis, we propose a medical visual proxy learning framework comprising two key components: a text refinement module that create high quality medical text descriptions, and a stable Sinkhorn algorithm for an efficient generation of pseudo labels which further guide the visual proxy learning. Our method elevates the Vanilla CLIP inference by supplying meticulously crafted clues to leverage CLIP’s existing interpretive power and using the feature of refined texts to bridge the vision-text gap. The effectiveness and robustness of our method are clearly demonstrated through extensive experiments. Notably, our method outperforms the state-of-the-art zero-shot medical image diagnosis by a significant margin, ranging from 1.69% to 15.31% on five datasets covering various diseases, confirming its immense potential in zero-shot diagnosis across diverse medical applications.

pdf bib
Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models
Weize Liu | Guocong Li | Kai Zhang | Bang Du | Qiyuan Chen | Xuming Hu | Hongxia Xu | Jintai Chen | Jian Wu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have achieved remarkable advancements in natural language processing. However, the massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments. While techniques such as chain-of-thought (CoT) distillation have displayed promise in distilling LLMs into small language models (SLMs), there is a risk that distilled SLMs may still inherit flawed reasoning and hallucinations from LLMs. To address these issues, we propose a twofold methodology: First, we introduce a novel method for distilling the self-evaluation capability from LLMs into SLMs, aiming to mitigate the adverse effects of flawed reasoning and hallucinations inherited from LLMs. Second, we advocate for distilling more comprehensive thinking by incorporating multiple distinct CoTs and self-evaluation outputs, to ensure a more thorough and robust knowledge transfer into SLMs. Experiments on three NLP benchmarks demonstrate that our method significantly improves the performance of distilled SLMs, offering a new perspective for developing more effective and efficient SLMs in resource-constrained environments.

pdf bib
Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences
Sai Koneru | Jian Wu | Sarah Rajtmajer
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Hypothesis formulation and testing are central to empirical research. A strong hypothesis is a best guess based on existing evidence and informed by a comprehensive view of relevant literature. However, with exponential increase in the number of scientific articles published annually, manual aggregation and synthesis of evidence related to a given hypothesis is a challenge. Our work explores the ability of current large language models (LLMs) to discern evidence in support or refute of specific hypotheses based on the text of scientific abstracts. We share a novel dataset for the task of scientific hypothesis evidencing using community-driven annotations of studies in the social sciences. We compare the performance of LLMs to several state of the art methods and highlight opportunities for future research in this area. Our dataset is shared with the research community: https://github.com/Sai90000/ScientificHypothesisEvidencing.git

2023

pdf bib
TACR: A Table Alignment-based Cell Selection Method for HybridQA
Jian Wu | Yicheng Xu | Yan Gao | Jian-Guang Lou | Börje Karlsson | Manabu Okumura
Findings of the Association for Computational Linguistics: ACL 2023

Hybrid Question-Answering (HQA), which targets reasoning over tables and passages linked from table cells, has witnessed significant research in recent years. A common challenge in HQA and other passage-table QA datasets is that it is generally unrealistic to iterate over all table rows, columns, and linked passages to retrieve evidence. Such a challenge made it difficult for previous studies to show their reasoning ability in retrieving answers. To bridge this gap, we propose a novel Table-alignment-based Cell-selection and Reasoning model (TACR) for hybrid text and table QA, evaluated on the HybridQA and WikiTableQuestions datasets. In evidence retrieval, we design a table-question-alignment enhanced cell-selection method to retrieve fine-grained evidence. In answer reasoning, we incorporate a QA module that treats the row containing selected cells as context. Experimental results over the HybridQA and WikiTableQuestions (WTQ) datasets show that TACR achieves state-of-the-art results on cell selection and outperforms fine-grained evidence retrieval baselines on HybridQA, while achieving competitive performance on WTQ. We also conducted a detailed analysis to demonstrate that being able to align questions to tables in the cell-selection stage can result in important gains from experiments of over 90% table row and column selection accuracy, meanwhile also improving output explainability.

pdf bib
Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for Imbalanced Medical Classification
Jiahuan Yan | Haojun Gao | Zhang Kai | Weize Liu | Danny Chen | Jian Wu | Jintai Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Deep learning approaches exhibit promising performances on various text tasks. However, they are still struggling on medical text classification since samples are often extremely imbalanced and scarce. Different from existing mainstream approaches that focus on supplementary semantics with external medical information, this paper aims to rethink the data challenges in medical texts and present a novel framework-agnostic algorithm called Text2Tree that only utilizes internal label hierarchy in training deep learning models. We embed the ICD code tree structure of labels into cascade attention modules for learning hierarchy-aware label representations. Two new learning schemes, Similarity Surrogate Learning (SSL) and Dissimilarity Mixup Learning (DML), are devised to boost text classification by reusing and distinguishing samples of other labels following the label representation hierarchy, respectively. Experiments on authoritative public datasets and real-world medical records show that our approach stably achieves superior performances over classical and advanced imbalanced classification methods. Our code is available at https://github.com/jyansir/Text2Tree.

2022

pdf bib
DialMed: A Dataset for Dialogue-based Medication Recommendation
Zhenfeng He | Yuqiang Han | Zhenqiu Ouyang | Wei Gao | Hongxu Chen | Guandong Xu | Jian Wu
Proceedings of the 29th International Conference on Computational Linguistics

Medication recommendation is a crucial task for intelligent healthcare systems. Previous studies mainly recommend medications with electronic health records (EHRs). However, some details of interactions between doctors and patients may be ignored or omitted in EHRs, which are essential for automatic medication recommendation. Therefore, we make the first attempt to recommend medications with the conversations between doctors and patients. In this work, we construct DIALMED, the first high-quality dataset for medical dialogue-based medication recommendation task. It contains 11, 996 medical dialogues related to 16 common diseases from 3 departments and 70 corresponding common medications. Furthermore, we propose a Dialogue structure and Disease knowledge aware Network (DDN), where a QA Dialogue Graph mechanism is designed to model the dialogue structure and the knowledge graph is used to introduce external disease knowledge. The extensive experimental results demonstrate that the proposed method is a promising solution to recommend medications with medical dialogues. The dataset and code are available at https://github.com/f-window/DialMed.

2021

pdf bib
Extractive Research Slide Generation Using Windowed Labeling Ranking
Athar Sefid | Prasenjit Mitra | Jian Wu | C Lee Giles
Proceedings of the Second Workshop on Scholarly Document Processing

Presentation slides generated from original research papers provide an efficient form to present research innovations. Manually generating presentation slides is labor-intensive. We propose a method to automatically generates slides for scientific articles based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures the importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.

2020

pdf bib
Acknowledgement Entity Recognition in CORD-19 Papers
Jian Wu | Pei Wang | Xin Wei | Sarah Rajtmajer | C. Lee Giles | Christopher Griffin
Proceedings of the First Workshop on Scholarly Document Processing

Acknowledgements are ubiquitous in scholarly papers. Existing acknowledgement entity recognition methods assume all named entities are acknowledged. Here, we examine the nuances between acknowledged and named entities by analyzing sentence structure. We develop an acknowledgement extraction system, AckExtract based on open-source text mining software and evaluate our method using manually labeled data. AckExtract uses the PDF of a scholarly paper as input and outputs acknowledgement entities. Results show an overall performance of F_1=0.92. We built a supplementary database by linking CORD-19 papers with acknowledgement entities extracted by AckExtract including persons and organizations and find that only up to 50–60% of named entities are actually acknowledged. We further analyze chronological trends of acknowledgement entities in CORD-19 papers. All codes and labeled data are publicly available at https://github.com/lamps-lab/ackextract.

pdf bib
SmartCiteCon: Implicit Citation Context Extraction from Academic Literature Using Supervised Learning
Chenrui Guo | Haoran Cui | Li Zhang | Jiamin Wang | Wei Lu | Jian Wu
Proceedings of the 8th International Workshop on Mining Scientific Publications

We introduce SmartCiteCon (SCC), a Java API for extracting both explicit and implicit citation context from academic literature in English. The tool is built on a Support Vector Machine (SVM) model trained on a set of 7,058 manually annotated citation context sentences, curated from 34,000 papers from the ACL Anthology. The model with 19 features achieves F1=85.6%. SCC supports PDF, XML, and JSON files out-of-box, provided that they are conformed to certain schemas. The API supports single document processing and batch processing in parallel. It takes about 12–45 seconds on average depending on the format to process a document on a dedicated server with 6 multithreaded cores. Using SCC, we extracted 11.8 million citation context sentences from ~33.3k PMC papers in the CORD-19 dataset, released on June 13, 2020. We will provide continuous supplementary data contribution to the CORD-19 and other datasets. The source code is released at https://gitee.com/irlab/SmartCiteCon.

2015

pdf bib
Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation
Minghua Nuo | Huidan Liu | Congjun Long | Jian Wu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Zipf’s Law and Statistical Data on Modern Tibetan
Huidan Liu | Minghua Nuo | Jian Wu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2012

pdf bib
Building Large Scale Text Corpus for Tibetan Natural Language Processing by Extracting Text from Web Pages
Huidan Liu | Minghua Nuo | Jian Wu | Yeping He
Proceedings of the 10th Workshop on Asian Language Resources

pdf bib
Tibetan Base Noun Phrase Identification Framework Based on Chinese-Tibetan Sentence Aligned Corpus
Ming Hua Nuo | Hui Dan Liu | Wei Na Zhao | Long Long Ma | Jian Wu | Zhi Ming Ding
Proceedings of COLING 2012

2011

pdf bib
Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field
Huidan Liu | Minghua Nuo | Longlong Ma | Jian Wu | Yeping He
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

pdf bib
Compression Methods by Code Mapping and Code Dividing for Chinese Dictionary Stored in a Double-Array Trie
Huidan Liu | Minghua Nuo | Longlong Ma | Jian Wu | Yeping He
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation
Huidan Liu | Weina Zhao | Minghua Nuo | Li Jiang | Jian Wu | Yeping He
Coling 2010: Posters