Zhen Wang


2024

pdf bib
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
Somanshu Singla | Zhen Wang | Tianyang Liu | Abdullah Ashfaq | Zhiting Hu | Eric P. Xing
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Aligning Large Language Models (LLMs) traditionally relies on complex and costly training processes like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). To address the challenge of achieving alignment without these extensive tuning costs and expensive annotations, we present a novel, tuning-free approach for self-alignment called Dynamic Rewarding with Prompt Optimization (DRPO). Our approach enables self-alignment through a search-based prompt optimization framework, allowing the model to self-improve and generate optimized prompts without additional training or human supervision. The core of DRPO leverages a dynamic rewarding mechanism to identify and rectify model-specific alignment weaknesses, enabling LLMs to adapt quickly to various alignment challenges. Empirical evaluations on eight recent LLMs, including both open- and closed-source, reveal that DRPO significantly enhances alignment performance, enabling base models to outperform their SFT/RLHF-tuned counterparts. Moreover, DRPO’s automatically optimized prompts surpass those curated by human experts, demonstrating its superior alignment capabilities. Our findings envision a highly cost-effective and adaptable solution for future alignment research to be further explored.

pdf bib
Active Learning for Abstractive Text Summarization via LLM-Determined Curriculum and Certainty Gain Maximization
Dongyuan Li | Ying Zhang | Zhen Wang | Shiyin Tan | Satoshi Kosugi | Manabu Okumura
Findings of the Association for Computational Linguistics: EMNLP 2024

For abstractive text summarization, laborious data annotation and time-consuming model training become two high walls, hindering its further progress. Active Learning, selecting a few informative instances for annotation and model training, sheds light on solving these issues. However, only few active learning-based studies focus on abstractive text summarization and suffer from low stability, effectiveness, and efficiency. To solve the problems, we propose a novel LLM-determined curriculum active learning framework. Firstly, we design a prompt to ask large language models to rate the difficulty of instances, which guides the model to train on from easier to harder instances. Secondly, we design a novel active learning strategy, i.e., Certainty Gain Maximization, enabling to select instances whose distribution aligns well with the overall distribution. Experiments show our method can improve stability, effectiveness, and efficiency of abstractive text summarization backbones.

pdf bib
ControversialQA: Exploring Controversy in Question Answering
Zhen Wang | Peide Zhu | Jie Yang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Controversy is widespread online. Previous studies mainly define controversy based on vague assumptions of its relation to sentiment such as hate speech and offensive words. This paper introduces the first question-answering dataset that defines content controversy by user perception, i.e., votes from plenty of users. It contains nearly 10K questions, and each question has a best answer and a most controversial answer. Experimental results reveal that controversy detection in question answering is essential and challenging, and there is no strong correlation between controversy and sentiment tasks. We also show that controversial answers and most acceptable answers cannot be distinguished by retrieval-based QA models, which may cause controversy issues. With these insights, we believe ControversialQA can inspire future research on controversy in QA systems.

2023

pdf bib
ThinkSum: Probabilistic reasoning over sets using large language models
Batu Ozturkler | Nikolay Malkin | Zhen Wang | Nebojsa Jojic
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have a substantial capacity for high-level analogical reasoning: reproducing patterns in linear text that occur in their training data (zero-shot evaluation) or in the provided context (few-shot in-context learning). However, recent studies show that even the more advanced LLMs fail in scenarios that require reasoning over multiple objects or facts and making sequences of logical deductions. We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner. In the first stage (Think – retrieval of associations), a LLM is queried in parallel over a set of phrases extracted from the prompt or an auxiliary model call. In the second stage (Sum – probabilistic inference or reasoning), the results of these queries are aggregated to make the final prediction. We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks, achieving improvements over the state of the art using GPT-family models on thirteen difficult tasks, often with far smaller model variants. We also compare and contrast ThinkSum with other proposed modifications to direct prompting of LLMs, such as variants of chain-of-thought prompting. Our results suggest that because the probabilistic inference in ThinkSum is performed outside of calls to the LLM, ThinkSum is less sensitive to prompt design, yields more interpretable predictions, and can be flexibly combined with latent variable models to extract structured knowledge from LLMs. Overall, our proposed paradigm represents a promising approach for enhancing the reasoning capabilities of LLMs.

pdf bib
Entity Tracking via Effective Use of Multi-Task Learning Model and Mention-guided Decoding
Janvijay Singh | Fan Bai | Zhen Wang
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Cross-task knowledge transfer via multi-task learning has recently made remarkable progress in general NLP tasks. However, entity tracking on the procedural text has not benefited from such knowledge transfer because of its distinct formulation, i.e., tracking the event flow while following structural constraints. State-of-the-art entity tracking approaches either design complicated model architectures or rely on task-specific pre-training to achieve good results. To this end, we propose MeeT, a Multi-task learning-enabled entity Tracking approach, which utilizes knowledge gained from general domain tasks to improve entity tracking. Specifically, MeeT first fine-tunes T5, a pre-trained multi-task learning model, with entity tracking-specialized QA formats, and then employs our customized decoding strategy to satisfy the structural constraints. MeeT achieves state-of-the-art performances on two popular entity tracking datasets, even though it does not require any task-specific architecture design or pre-training.

pdf bib
Reasoning with Language Model is Planning with World Model
Shibo Hao | Yi Gu | Haodi Ma | Joshua Hong | Zhen Wang | Daisy Wang | Zhiting Hu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have shown remarkable reasoning capabilities, particularly with Chain-of-Thought-style prompts. However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks or performing complex math or logical reasoning. This is due to LLMs’ absence of an internal world model for predicting world states (e.g., environment status, variable values) and simulating long-term action outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, Reasoning via Planning (RAP). RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monte Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, properly balancing exploration v.s. exploitation to achieve a high-reward reasoning path efficiently. We apply RAP to a variety of challenging reasoning problems, such as plan generation, math reasoning, and logical inference. Empirical results demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency, e.g., RAP on LLaMA-33B surpasses CoT on GPT-4 with 33% relative improvement in plan generation.

pdf bib
Roll Up Your Sleeves: Working with a Collaborative and Engaging Task-Oriented Dialogue System
Lingbo Mo | Shijie Chen | Ziru Chen | Xiang Deng | Ashley Lewis | Sunit Singh | Samuel Stevens | Chang-You Tai | Zhen Wang | Xiang Yue | Tianshu Zhang | Yu Su | Huan Sun
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We introduce TacoBot, a user-centered task-oriented digital assistant designed to guide users through complex real-world tasks with multiple steps. Covering a wide range of cooking and how-to tasks, we aim to deliver a collaborative and engaging dialogue experience. Equipped with language understanding, dialogue management, and response generation components supported by a robust search engine, TacoBot ensures efficient task assistance. To enhance the dialogue experience, we explore a series of data augmentation strategies using LLMs to train advanced neural models continuously. TacoBot builds upon our successful participation in the inaugural Alexa Prize TaskBot Challenge, where our team secured third place among ten competing teams. We offer TacoBot as an open-source framework that serves as a practical example for deploying task-oriented dialogue systems.

2022

pdf bib
Coherence boosting: When your pretrained language model is not paying enough attention
Nikolay Malkin | Zhen Wang | Nebojsa Jojic
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Long-range semantic coherence remains a challenge in automatic language generation and understanding. We demonstrate that large language models have insufficiently learned the effect of distant words on next-token prediction. We present coherence boosting, an inference procedure that increases a LM’s focus on a long context. We show the benefits of coherence boosting with pretrained models by distributional analyses of generated ordinary text and dialog responses. It is also found that coherence boosting with state-of-the-art models for various zero-shot NLP tasks yields performance gains with no additional training.

pdf bib
D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat
Binwei Yao | Chao Shi | Likai Zou | Lingfeng Dai | Mengyue Wu | Lu Chen | Zhen Wang | Kai Yu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In a depression-diagnosis-directed clinical session, doctors initiate a conversation with ample emotional support that guides the patients to expose their symptoms based on clinical diagnosis criteria. Such a dialogue system is distinguished from existing single-purpose human-machine dialog systems, as it combines task-oriented and chit-chats with uniqueness in dialogue topics and procedures. However, due to the social stigma associated with mental illness, the dialogue data related to depression consultation and diagnosis are rarely disclosed. Based on clinical depression diagnostic criteria ICD-11 and DSM-5, we designed a 3-phase procedure to construct D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat, which simulates the dialogue between doctors and patients during the diagnosis of depression, including diagnosis results and symptom summary given by professional psychiatrists for each conversation. Upon the newly-constructed dataset, four tasks mirroring the depression diagnosis process are established: response generation, topic prediction, dialog summary, and severity classification of depressive episode and suicide risk. Multi-scale evaluation results demonstrate that a more empathy-driven and diagnostic-accurate consultation dialogue system trained on our dataset can be achieved compared to rule-based bots.

pdf bib
ASCM: An Answer Space Clustered Prompting Method without Answer Engineering
Zhen Wang | Yating Yang | Zhou Xi | Bo Ma | Lei Wang | Rui Dong | Azmat Anwar
Findings of the Association for Computational Linguistics: ACL 2022

Prompt-based learning, which exploits knowledge from pre-trained language models by providing textual prompts and designing appropriate answer-category mapping methods, has achieved impressive successes on few-shot text classification and natural language inference (NLI). Because of the diverse linguistic expression, there exist many answer tokens for the same category. However, both manual answer design and automatic answer search constrain answer space and therefore hardly achieve ideal performance. To address this issue, we propose an answer space clustered prompting model (ASCM) together with a synonym initialization method (SI) which automatically categorizes all answer tokens in a semantic-clustered embedding space. We also propose a stable semi-supervised method named stair learning (SL) that orderly distills knowledge from better models to weaker models. Extensive experiments demonstrate that our ASCM+SL significantly outperforms existing state-of-the-art techniques in few-shot settings.

pdf bib
Answer Quality Aware Aggregation for Extractive QA Crowdsourcing
Peide Zhu | Zhen Wang | Claudia Hauff | Jie Yang | Avishek Anand
Findings of the Association for Computational Linguistics: EMNLP 2022

Quality control is essential for creating extractive question answering (EQA) datasets via crowdsourcing. Aggregation across answers, i.e. word spans within passages annotated, by different crowd workers is one major focus for ensuring its quality. However, crowd workers cannot reach a consensus on a considerable portion of questions. We introduce a simple yet effective answer aggregation method that takes into account the relations among the answer, question, and context passage. We evaluate answer quality from both the view of question answering model to determine how confident the QA model is about each answer and the view of the answer verification model to determine whether the answer is correct. Then we compute aggregation scores with each answer’s quality and its contextual embedding produced by pre-trained language models. The experiments on a large real crowdsourced EQA dataset show that our framework outperforms baselines by around 16% on precision and effectively conduct answer aggregation for extractive QA task.

pdf bib
Search to Pass Messages for Temporal Knowledge Graph Completion
Zhen Wang | Haotong Du | Quanming Yao | Xuelong Li
Findings of the Association for Computational Linguistics: EMNLP 2022

Completing missing facts is a fundamental task for temporal knowledge graphs (TKGs).Recently, graph neural network (GNN) based methods, which can simultaneously explore topological and temporal information, have become the state-of-the-art (SOTA) to complete TKGs. However, these studies are based on hand-designed architectures and fail to explore the diverse topological and temporal properties of TKG.To address this issue, we propose to use neural architecture search (NAS) to design data-specific message passing architecture for TKG completion.In particular, we develop a generalized framework to explore topological and temporal information in TKGs.Based on this framework, we design an expressive search space to fully capture various properties of different TKGs. Meanwhile, we adopt a search algorithm, which trains a supernet structure by sampling single path for efficient search with less cost.We further conduct extensive experiments on three benchmark datasets. The results show that the searched architectures by our method achieve the SOTA performances.Besides, the searched models can also implicitly reveal diverse properties in different TKGs.Our code is released in https://github.com/striderdu/SPA.

pdf bib
N24News: A New Dataset for Multimodal News Classification
Zhen Wang | Xu Shan | Xiangxie Zhang | Jie Yang
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Current news datasets merely focus on text features on the news and rarely leverage the feature of images, excluding numerous essential features for news classification. In this paper, we propose a new dataset, N24News, which is generated from New York Times with 24 categories and contains both text and image information in each news. We use a multitask multimodal method and the experimental results show multimodal news classification performs better than text-only news classification. Depending on the length of the text, the classification accuracy can be increased by up to 8.11%. Our research reveals the relationship between the performance of a multimodal classifier and its sub-classifiers, and also the possible improvements when applying multimodal in news classification. N24News is shown to have great potential to prompt the multimodal news studies.

pdf bib
Knowledge Transfer between Structured and Unstructured Sources for Complex Question Answering
Lingbo Mo | Zhen Wang | Jie Zhao | Huan Sun
Proceedings of the Workshop on Structured and Unstructured Knowledge Integration (SUKI)

Multi-hop question answering (QA) combines multiple pieces of evidence to search for the correct answer. Reasoning over a text corpus (TextQA) and/or a knowledge base (KBQA) has been extensively studied and led to distinct system architectures. However, knowledge transfer between such two QA systems has been under-explored. Research questions like what knowledge is transferred or whether the transferred knowledge can help answer over one source using another one, are yet to be answered. In this paper, therefore, we study the knowledge transfer of multi-hop reasoning between structured and unstructured sources. We first propose a unified QA framework named SimultQA to enable knowledge transfer and bridge the distinct supervisions from KB and text sources. Then, we conduct extensive analyses to explore how knowledge is transferred by leveraging the pre-training and fine-tuning paradigm. We focus on the low-resource fine-tuning to show that pre-training SimultQA on one source can substantially improve its performance on the other source. More fine-grained analyses on transfer behaviors reveal the types of transferred knowledge and transfer patterns. We conclude with insights into how to construct better QA datasets and systems to exploit knowledge transfer for future work.

2021

pdf bib
MedAI at SemEval-2021 Task 5: Start-to-end Tagging Framework for Toxic Spans Detection
Zhen Wang | Hongjie Fan | Junfei Liu
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes the system submitted to SemEval 2021 Task 5: Toxic Spans Detection. The task concerns evaluating systems that detect the spans that make a text toxic when detecting such spans are possible. To address the possibly multi-span detection problem, we develop a start-to-end tagging framework on top of RoBERTa based language model. Besides, we design a custom loss function that takes distance into account. In comparison to other participating teams, our system has achieved 69.03% F1 score, which is slightly lower (-1.8 and -1.73) than the top 1(70.83%) and top 2 (70.77%), respectively.

2020

pdf bib
Rationalizing Medical Relation Prediction from Corpus-level Statistics
Zhen Wang | Jennifer Lee | Simon Lin | Huan Sun
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Nowadays, the interpretability of machine learning models is becoming increasingly important, especially in the medical domain. Aiming to shed some light on how to rationalize medical relation prediction, we present a new interpretable framework inspired by existing theories on how human memory works, e.g., theories of recall and recognition. Given the corpus-level statistics, i.e., a global co-occurrence graph of a clinical text corpus, to predict the relations between two entities, we first recall rich contexts associated with the target entities, and then recognize relational interactions between these contexts to form model rationales, which will contribute to the final prediction. We conduct experiments on a real-world public clinical dataset and show that our framework can not only achieve competitive predictive performance against a comprehensive list of neural baseline models, but also present rationales to justify its prediction. We further collaborate with medical experts deeply to verify the usefulness of our model rationales for clinical decision making.

pdf bib
Diversify Question Generation with Continuous Content Selectors and Question Type Modeling
Zhen Wang | Siwei Rao | Jie Zhang | Zhen Qin | Guangjian Tian | Jun Wang
Findings of the Association for Computational Linguistics: EMNLP 2020

Generating questions based on answers and relevant contexts is a challenging task. Recent work mainly pays attention to the quality of a single generated question. However, question generation is actually a one-to-many problem, as it is possible to raise questions with different focuses on contexts and various means of expression. In this paper, we explore the diversity of question generation and come up with methods from these two aspects. Specifically, we relate contextual focuses with content selectors, which are modeled by a continuous latent variable with the technique of conditional variational auto-encoder (CVAE). In the realization of CVAE, a multimodal prior distribution is adopted to allow for more diverse content selectors. To take into account various means of expression, question types are explicitly modeled and a diversity-promoting algorithm is proposed further. Experimental results on public datasets show that our proposed method can significantly improve the diversity of generated questions, especially from the perspective of using different question types. Overall, our proposed method achieves a better trade-off between generation quality and diversity compared with existing approaches.

2018

pdf bib
Joint Training of Candidate Extraction and Answer Selection for Reading Comprehension
Zhen Wang | Jiachen Liu | Xinyan Xiao | Yajuan Lyu | Tian Wu
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While sophisticated neural-based techniques have been developed in reading comprehension, most approaches model the answer in an independent manner, ignoring its relations with other answer candidates. This problem can be even worse in open-domain scenarios, where candidates from multiple passages should be combined to answer a single question. In this paper, we formulate reading comprehension as an extract-then-select two-stage procedure. We first extract answer candidates from passages, then select the final answer by combining information from all the candidates. Furthermore, we regard candidate extraction as a latent variable and train the two-stage process jointly with reinforcement learning. As a result, our approach has improved the state-of-the-art performance significantly on two challenging open-domain reading comprehension datasets. Further analysis demonstrates the effectiveness of our model components, especially the information fusion of all the candidates and the joint training of the extract-then-select procedure.

2015

pdf bib
Aligning Knowledge and Text Embeddings by Entity Descriptions
Huaping Zhong | Jianwen Zhang | Zhen Wang | Hai Wan | Zheng Chen
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Chinese Semantic Role Labeling with Bidirectional Recurrent Neural Networks
Zhen Wang | Tingsong Jiang | Baobao Chang | Zhifang Sui
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Knowledge Graph and Text Jointly Embedding
Zhen Wang | Jianwen Zhang | Jianlin Feng | Zheng Chen
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
A mixed approach for Chinese word segmentation
Zhen Wang
Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
Extraction system for Personal Attributes Extraction of CLP2014
Zhen Wang
Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing

2013

pdf bib
A Mixed Morpho-Syntactic and Statistical Approach to Chinese Named Entity Recognition (Une approche mixte morpho-syntaxique et statistique pour la reconnaissance d’entités nommées en langue chinoise) [in French]
Zhen Wang
Proceedings of RECITAL 2013

2004

pdf bib
Aligning Bilingual Corpora Using Sentences Location Information
Weigang Li | Ting Liu | Zhen Wang | Sheng Li
Proceedings of the Third SIGHAN Workshop on Chinese Language Processing