Hao Wang


2024

pdf bib
LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay
Yihuai Lan | Zhiqiang Hu | Lei Wang | Yang Wang | Deheng Ye | Peilin Zhao | Ee-Peng Lim | Hui Xiong | Hao Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper explores the open research problem of understanding the social behaviors of LLM-based agents. Using Avalon as a testbed, we employ system prompts to guide LLM agents in gameplay. While previous studies have touched on gameplay with LLM agents, research on their social behaviors is lacking. We propose a novel framework, tailored for Avalon, features a multi-agent system facilitating efficient communication and interaction. We evaluate its performance based on game success and analyze LLM agents’ social behaviors. Results affirm the framework’s effectiveness in creating adaptive agents and suggest LLM-based agents’ potential in navigating dynamic social interactions. By examining collaboration and confrontation behaviors, we offer insights into this field’s research and applications.

pdf bib
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings
Hao Wang | Hao Li | Minlie Huang | Lei Sha
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The safety defense methods of Large language models (LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters.To cope with this challenge, in this paper, we propose an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLM’s security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives sourced from the Advbench dataset.The results indicate that our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate than existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.

pdf bib
GOME: Grounding-based Metaphor Binding With Conceptual Elaboration For Figurative Language Illustration
Linhao Zhang | Jintao Liu | Li Jin | Hao Wang | Kaiwen Wei | Guangluan Xu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The illustration or visualization of figurative language, such as linguistic metaphors, is an emerging challenge for existing Large Language Models (LLMs) and multimodal models. Due to their comparison of seemingly unrelated concepts in metaphors, existing LLMs have a tendency of over-literalization, which illustrates figurative language solely based on literal objects, ignoring the underlying groundings and associations across disparate metaphorical domains. Furthermore, prior approaches have ignored the binding process between visual objects and metaphorical attributes, which further intensifies the infidelity of visual metaphors. To address the issues above, we propose GOME (Grounding-based Metaphor Binding), which illustrates linguistic metaphors from the grounding perspective elaborated through LLMs. GOME consists of two steps for metaphor illustration, including grounding-based elaboration and scenario visualization. In the elaboration step, metaphorical knowledge is integrated into systematic instructions for LLMs, which employs a CoT prompting method rooted in rhetoric. This approach specifies metaphorical devices such as vehicles and groundings, to ensure accurate and faithful descriptions consumed by text-to-image models. In the visualization step, an inference-time metaphor binding method is realized based on elaboration outputs, which register attentional control during the diffusion process, and captures the underlying attributes from the abstract metaphorical domain. Comprehensive evaluations using multiple downstream tasks confirm that, GOME is superior to isolated LLMs, diffusion models, or their direct collaboration.

pdf bib
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
Hao Wang | Shuhei Kurita | Shuichiro Shimizu | Daisuke Kawahara
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.

pdf bib
LatticeGen: Hiding Generated Text in a Lattice for Privacy-Aware Large Language Model Generation on Cloud
Mengke Zhang | Tianxing He | Tianle Wang | Lu Mi | Niloofar Mireshghallah | Binyi Chen | Hao Wang | Yulia Tsvetkov
Findings of the Association for Computational Linguistics: NAACL 2024

In the current user-server interaction paradigm of prompted generation with large language models (LLMs) on cloud, the server fully controls the generation process, which leaves zero options for users who want to keep the generated text private to themselves. For privacy-aware text generation on cloud, we propose LatticeGen, a cooperative protocol in which the server still handles most of the computation while the client controls the sampling operation. The key idea is that the true generated sequence is mixed with noise tokens by the client and hidden in a noised lattice. Only the client knows which tokens are the true ones. Considering potential attacks from a hypothetically malicious server and how the client can defend against it, we propose the repeated beam-search attack and the mixing noise scheme. In our experiments we apply LatticeGen to protect both prompt and generation. It is shown that while the noised lattice degrades generation quality, LatticeGen successfully protects the true generation to a remarkable degree under strong attacks (more than 50% of the semantic remains hidden as measured by BERTScore).

pdf bib
Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System
Hengguan Huang | Songtao Wang | Hongfu Liu | Hao Wang | Ye Wang
Findings of the Association for Computational Linguistics: ACL 2024

Traditional applications of natural language processing (NLP) in healthcare have predominantly focused on patient-centered services, enhancing patient interactions and care delivery, such as through medical dialogue systems. However, the potential of NLP to benefit inexperienced doctors, particularly in areas such as communicative medical coaching, remains largely unexplored. We introduce “ChatCoach”, a human-AI cooperative framework designed to assist medical learners in practicing their communication skills during patient consultations. ChatCoach differentiates itself from conventional dialogue systems by offering a simulated environment where medical learners can practice dialogues with a patient agent, while a coach agent provides immediate, structured feedback. This is facilitated by our proposed Generalized Chain-of-Thought (GCoT) approach, which fosters the generation of structured feedback and enhances the utilization of external knowledge sources. Additionally, we have developed a dataset specifically for evaluating Large Language Models (LLMs) within the ChatCoach framework on communicative medical coaching tasks. Our empirical results validate the effectiveness of ChatCoach.

pdf bib
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Zhicheng Guo | Sijie Cheng | Hao Wang | Shihao Liang | Yujia Qin | Peng Li | Zhiyuan Liu | Maosong Sun | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2024

Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.

pdf bib
Variational Language Concepts for Interpreting Foundation Language Models
Hengyi Wang | Shiwei Tan | Zhiqing Hong | Desheng Zhang | Hao Wang
Findings of the Association for Computational Linguistics: EMNLP 2024

Foundation Language Models (FLMs) such as BERT and its variants have achieved remarkable success in natural language processing. To date, the interpretability of FLMs has primarily relied on the attention weights in their self-attention layers. However, these attention weights only provide word-level interpretations, failing to capture higher-level structures, and are therefore lacking in readability and intuitiveness. To address this challenge, we first provide a formal definition of *conceptual interpretation* and then propose a variational Bayesian framework, dubbed VAriational Language Concept (VALC), to go beyond word-level interpretations and provide concept-level interpretations. Our theoretical analysis shows that our VALC finds the optimal language concepts to interpret FLM predictions. Empirical results on several real-world datasets show that our method can successfully provide conceptual interpretation for FLMs.

pdf bib
A Framework of Knowledge Graph-Enhanced Large Language Model Based on Question Decomposition and Atomic Retrieval
Yading Li | Dandan Song | Changzhi Zhou | Yuhang Tian | Hao Wang | Ziyi Yang | Shuhao Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Knowledge graphs (KGs) can provide explainable reasoning for large language models (LLMs), alleviating their hallucination problem. Knowledge graph question answering (KGQA) is a typical benchmark to evaluate the methods enhancing LLMs with KG. Previous methods on KG-enhanced LLM for KGQA either enhance LLMs with KG retrieval in a single round or perform multi-hop KG reasoning in multiple rounds with LLMs. Both of them conduct retrieving and reasoning based solely on the whole original question, without any processing to the question. To tackle this limitation, we propose a framework of KG-enhanced LLM based on question decomposition and atomic retrieval, called KELDaR. We introduce question decomposition tree as the framework for LLM reasoning. This approach extracts the implicit information of reasoning steps within complex questions, serving as a guide to facilitate atomic retrieval on KG targeting the atomic-level simple questions at leaves of the tree. Additionally, we design strategies for atomic retrieval, which extract and retrieve question-relevant KG subgraphs to assist the few-shot LLM in answering atomic-level questions. Experiments on KGQA datasets demonstrate that our framework outperforms existing reasoning-based baselines. And in a low-cost setting without additional training or fine-tuning, our framework achieves competitive or superior results compared to most existing training-based baselines.

pdf bib
Augmenting Reasoning Capabilities of LLMs with Graph Structures in Knowledge Base Question Answering
Yuhang Tian | Dandan Song | Zhijing Wu | Changzhi Zhou | Hao Wang | Jun Yang | Jing Xu | Ruanmin Cao | HaoYu Wang
Findings of the Association for Computational Linguistics: EMNLP 2024

Recently, significant progress has been made in employing Large Language Models (LLMs) for semantic parsing to address Knowledge Base Question Answering (KBQA) tasks. Previous work utilize LLMs to generate query statements on Knowledge Bases (KBs) for retrieving answers. However, LLMs often generate incorrect query statements due to the lack of relevant knowledge in the previous methods. To address this, we propose a framework called Augmenting Reasoning Capabilities of LLMs with Graph Structures in Knowledge Base Question Answering (ARG-KBQA), which retrieves question-related graph structures to improve the performance of LLMs. Unlike other methods that directly retrieve relations or triples from KBs, we introduce an unsupervised two-stage ranker to perform multi-hop beam search on KBs, which could provide LLMs with more relevant information to the questions. Experimental results demonstrate that ARG-KBQA sets a new state-of-the-art on GrailQA and WebQSP under the few-shot setting. Additionally, ARG-KBQA significantly outperforms previous few-shot methods on questions with unseen query statement in the training data.

pdf bib
Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance
Ziqi Yin | Hao Wang | Kaito Horio | Daisuike Kawahara | Satoshi Sekine
Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)

We investigate the impact of politeness levels in prompts on the performance of large language models (LLMs). Polite language in human communications often garners more compliance and effectiveness, while rudeness can cause aversion, impacting response quality. We consider that LLMs mirror human communication traits, suggesting they align with human cultural norms. We assess the impact of politeness in prompts on LLMs across English, Chinese, and Japanese tasks. We observed that impolite prompts often result in poor performance, but overly polite language does not guarantee better outcomes. The best politeness level is different according to the language. This phenomenon suggests that LLMs not only reflect human behavior but are also influenced by language, particularly in different cultural contexts. Our findings highlight the need to factor in politeness for cross-cultural natural language processing and LLM usage.

pdf bib
Separation and Fusion: A Novel Multiple Token Linking Model for Event Argument Extraction
Jing Xu | Dandan Song | Siu Hui | Zhijing Wu | Meihuizi Jia | Hao Wang | Yanru Zhou | Changzhi Zhou | Ziyi Yang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In event argument extraction (EAE), a promising approach involves jointly encoding text and argument roles, and performing multiple token linking operations. This approach further falls into two categories. One extracts arguments within a single event, while the other attempts to extract arguments from multiple events simultaneously. However, the former lacks to leverage cross-event information and the latter requires tougher predictions with longer encoded role sequences and extra linking operations. In this paper, we design a novel separation-and-fusion paradigm to separately acquire cross-event information and fuse it into the argument extraction of a target event. Following the paradigm, we propose a novel multiple token linking model named Sep2F, which can effectively build event correlations via roles and preserve the simple linking predictions of single-event extraction. In particular, we employ one linking module to extract arguments for the target event and another to aggregate the role information of multiple events. More importantly, we propose a novel two-fold fusion module to ensure that the aggregated cross-event information serves EAE well. We evaluate our proposed model on sentence-level and document-level datasets, including ACE05, RAMS, WikiEvents and MLEE. The extensive experimental results indicate that our model outperforms the state-of-the-art EAE models on all the datasets.

pdf bib
Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation
Hao Wang | Tetsuro Morimura | Ukyo Honda | Daisuke Kawahara
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models’ training.

pdf bib
A Benchmark Suite of Japanese Natural Questions
Takuya Uematsu | Hao Wang | Daisuke Kawahara | Tomohide Shibata
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

To develop high-performance and robust natural language processing (NLP) models, it is important to have various question answering (QA) datasets to train, evaluate, and analyze them. Although there are various QA datasets available in English, there are only a few QA datasets in other languages. We focus on Japanese, a language with only a few basic QA datasets, and aim to build a Japanese version of Natural Questions (NQ) consisting of questions that naturally arise from human information needs. We collect natural questions from query logs of a Japanese search engine and build the dataset using crowdsourcing. We construct Japanese Natural Questions (JNQ) and a Japanese version of BoolQ (JBoolQ), which is derived from NQ and consists of yes/no questions. JNQ consists of 16,871 questions, and JBoolQ consists of 6,467 questions. We also define two tasks from JNQ and one from JBoolQ and establish baselines using competitive methods drawn from related literature. We hope that these datasets will facilitate research on QA and NLP models in Japanese. We are planning to release JNQ and JBoolQ.

pdf bib
FinNLP-AgentScen-2024 Shared Task: Financial Challenges in Large Language Models - FinLLMs
Qianqian Xie | Jimin Huang | Dong Li | Zhengyu Chen | Ruoyu Xiang | Mengxi Xiao | Yangyang Yu | Vijayasai Somasundaram | Kailai Yang | Chenhan Yuan | Zheheng Luo | Zhiwei Liu | Yueru He | Yuechen Jiang | Haohang Li | Duanyu Feng | Xiao-Yang Liu | Benyou Wang | Hao Wang | Yanzhao Lai | Jordan Suchow | Alejandro Lopez-Lira | Min Peng | Sophia Ananiadou
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning

pdf bib
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
Wentian Wang | Sarthak Jain | Paul Kantor | Jacob Feldman | Lazaros Gallos | Hao Wang
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that “truly” understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.

pdf bib
Virtual Compiler Is All You Need For Assembly Code Search
Zeyu Gao | Hao Wang | Yuanda Wang | Chao Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs.Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code to assembly code. This approach allows for “virtual” compilation across a wide range of programming languages without the need for a real compiler, preserving semantic equivalency and expanding the possibilities for assembly code dataset construction. Furthermore, we use ViC to construct a sufficiently large dataset for assembly code search. Employing this extensive dataset, we achieve a substantial improvement in assembly code search performance, with our model surpassing the leading baseline by 26%.

pdf bib
Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training
Junqing He | Kunhao Pan | Xiaoqun Dong | Zhuoyang Song | LiuYiBo LiuYiBo | Qianguosun Qianguosun | Yuxin Liang | Hao Wang | Enming Zhang | Jiaxing Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While large language models (LLMs) are equipped with longer text input capabilities than before, they are struggling to seek correct information in long contexts. The “lost in the middle” problem challenges most LLMs, referring to the dramatic decline in accuracy when correct information is located in the middle. To overcome this crucial issue, this paper proposes to enhance the information searching and reflection ability of LLMs in long contexts via specially designed tasks called Position-Agnostic Multi-step QA (PAM QA). Trained in this task, our model excels in focusing more precisely on the desired information. Experimental results show substantial improvement in Multi-doc QA and other benchmarks, superior to state-of-the-art models by 13.7% absolute gain in shuffled settings, by 21.5% in passage retrieval task. We release our model and code to promote related research in the community.

pdf bib
Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation
Yuxin Liang | Zhuoyang Song | Hao Wang | Jiaxing Zhang
Proceedings of the 3rd Workshop on Knowledge Augmented Methods for NLP

We evaluate the ability of Large Language Models (LLMs) to discern and express their internal knowledge state, a key factor in countering factual hallucination and ensuring reliable application of LLMs. We observe a robust self-awareness of internal knowledge state in LLMs, evidenced by over 85% accuracy in knowledge state probing. However, LLMs often fail to faithfully express their internal knowledge during generation, leading to factual hallucinations. We develop an automated hallucination annotation tool, DreamCatcher, which merges knowledge probing and consistency checking methods to rank factual preference data. Using knowledge preference as reward, We propose a Reinforcement Learning from Knowledge Feedback (RLKF) training framework, leveraging reinforcement learning to enhance the factuality and honesty of LLMs. Our experiments across multiple models show that RLKF training effectively enhances the ability of models to utilize their internal knowledge state, boosting performance in a variety of knowledge-based and honesty-related tasks.

pdf bib
Mixture-of-LoRAs: An Efficient Multitask Tuning Method for Large Language Models
Wenfeng Feng | Chuzhan Hao | Yuewei Zhang | Yu Han | Hao Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Instruction Tuning has the potential to stimulate or enhance specific capabilities of large language models (LLMs). However, achieving the right balance of data is crucial to prevent catastrophic forgetting and interference between tasks. To address these limitations and enhance training flexibility, we propose the Mixture-of-LoRAs (MoA) architecture which is a novel and parameter-efficient tuning method designed for multi-task learning with LLMs. In this paper, we start by individually training multiple domain-specific LoRA modules using corresponding supervised corpus data. These LoRA modules can be aligned with the expert design principles observed in Mixture-of-Experts (MoE). Subsequently, we combine the multiple LoRAs using an explicit routing strategy and introduce domain labels to facilitate multi-task learning, which help prevent interference between tasks and ultimately enhances the performance of each individual task. Furthermore, each LoRA model can be iteratively adapted to a new domain, allowing for quick domain-specific adaptation. Experiments on diverse tasks demonstrate superior and robust performance, which can further promote the wide application of domain-specific LLMs.

pdf bib
MRT: Multi-modal Short- and Long-range Temporal Convolutional Network for Time-sync Comment Video Behavior Prediction
Weihao Zhao | Weidong He | Hao Wang | Haoyang Bi | Han Wu | Chen Zhu | Tong Xu | Enhong Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

As a fresh way to improve the user viewing experience, videos of time-sync comments have attracted a lot of interest. Many efforts have been made to explore the effectiveness of time-sync comments for various applications. However, due to the complexity of interactions among users, videos, and comments, it still remains challenging to understand users’ behavior on time-sync comments. Along this line, we study the problem of time-sync comment behavior prediction with considerations of both historical behaviors and multi-modal information of visual frames and textual comments. Specifically, we propose a novel Multi-modal short- and long-Range Temporal Convolutional Network model, namely MRT. Firstly, we design two amplified Temporal Convolutional Networks with different sizes of receptive fields, to capture both short- and long-range surrounding contexts for each frame and time-sync comments. Then, we design a bottle-neck fusion module to obtain the multi-modal enhanced representation. Furthermore, we take the user preferences into consideration to generate the personalized multi-model semantic representation at each timestamp. Finally, we utilize the binary cross-entropy loss to optimize MRT on the basis of users’ historical records. Through comparing with representative baselines, we demonstrate the effectiveness of MRT and qualitatively verify the necessity and utility of short- and long-range contextual and multi-modal information through extensive experiments.

pdf bib
Towards Human-Like Machine Comprehension: Few-Shot Relational Learning in Visually-Rich Documents
Hao Wang | Tang Li | Chenhui Chu | Rui Wang | Pinpin Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Key-value relations are prevalent in Visually-Rich Documents (VRDs), often depicted in distinct spatial regions accompanied by specific color and font styles. These non-textual cues serve as important indicators that greatly enhance human comprehension and acquisition of such relation triplets. However, current document AI approaches often fail to consider this valuable prior information related to visual and spatial features, resulting in suboptimal performance, particularly when dealing with limited examples. To address this limitation, our research focuses on few-shot relational learning, specifically targeting the extraction of key-value relation triplets in VRDs. Given the absence of a suitable dataset for this task, we introduce two new few-shot benchmarks built upon existing supervised benchmark datasets. Furthermore, we propose a variational approach that incorporates relational 2D-spatial priors and prototypical rectification techniques. This approach aims to generate relation representations that are more aware of the spatial context and unseen relation in a manner similar to human perception. Experimental results demonstrate the effectiveness of our proposed method by showcasing its ability to outperform existing methods. This study also opens up new possibilities for practical applications.

2023

pdf bib
Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models
Hao Wang | Hirofumi Shimizu | Daisuke Kawahara
Findings of the Association for Computational Linguistics: ACL 2023

Recent studies in natural language processing (NLP) have focused on modern languages and achieved state-of-the-art results in many tasks. Meanwhile, little attention has been paid to ancient texts and related tasks. Classical Chinese first came to Japan approximately 2,000 years ago. It was gradually adapted to a Japanese form called Kanbun-Kundoku (Kanbun) in Japanese reading and translating methods, which has significantly impacted Japanese literature. However, compared to the rich resources of ancient texts in mainland China, Kanbun resources remain scarce in Japan.To solve this problem, we construct the first Classical-Chinese-to-Kanbun dataset in the world. Furthermore, we introduce two tasks, character reordering and machine translation, both of which play a significant role in Kanbun comprehension. We also test the current language models on these tasks and discuss the best evaluation method by comparing the results with human scores. We release our code and dataset on GitHub.

pdf bib
DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading
Hao Wang | Qingxuan Wang | Yue Li | Changqing Wang | Chenhui Chu | Rui Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

The use of visually-rich documents in various fields has created a demand for Document AI models that can read and comprehend documents like humans, which requires the overcoming of technical, linguistic, and cognitive barriers. Unfortunately, the lack of appropriate datasets has significantly hindered advancements in the field. To address this issue, we introduce DocTrack, a visually-rich document dataset really aligned with human eye-movement information using eye-tracking technology. This dataset can be used to investigate the challenges mentioned above. Additionally, we explore the impact of human reading order on document understanding tasks and examine what would happen if a machine reads in the same order as a human. Our results suggest that although Document AI models have made significant progresses, they still have a long way to go before they can read visually richer documents as accurately, continuously, and flexibly as humans do. These findings have potential implications for future research and development of document intelligence.

pdf bib
Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning
Hao Wang | Xiahua Chen | Rui Wang | Chenhui Chu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Extracting meaningful entities belonging to predefined categories from Visually-rich Form-like Documents (VFDs) is a challenging task. Visual and layout features such as font, background, color, and bounding box location and size provide important cues for identifying entities of the same type. However, existing models commonly train a visual encoder with weak cross-modal supervision signals, resulting in a limited capacity to capture these non-textual features and suboptimal performance. In this paper, we propose a novel Visually-Asymmetric coNsistenCy Learning (VANCL) approach that addresses the above limitation by enhancing the model’s ability to capture fine-grained visual and layout features through the incorporation of color priors. Experimental results on benchmark datasets show that our approach substantially outperforms the strong LayoutLM series baseline, demonstrating the effectiveness of our approach. Additionally, we investigate the effects of different color schemes on our approach, providing insights for optimizing model performance. We believe our work will inspire future research on multimodal information extraction.

pdf bib
Harnessing the Plug-and-Play Controller by Prompting
Hao Wang | Lei Sha
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Controllable text generation is a growing field within natural language generation (NLG) that focuses on producing text that meets specific constraints in real-world applications. Previous approaches, such as plug-and-play controllers (PPCs), aimed to steer the properties of generated text in a flexible manner. However, these methods often compromised the integrity of the language model’s decoding process, resulting in less smooth text generation.Alternatively, other techniques utilized multiple attribute prompts to align the generated text with desired attributes, but this approach required prompt design for each attribute and was dependent on the size of the language model. This paper introduces a novel method for flexible attribute control in text generation using pre-trained language models (PLMs). The proposed approach aims to enhance the fluency of generated text by guiding the generation process with PPCs. The key idea is to dynamically adjust the distribution of generated text by modifying prompts, effectively constraining the output space of the language model and influencing the desired attribute. To enable smooth cooperation between the PLM and the PPC, our work innovativel proposes a new model fine-tuning method: Reinforcement Learning with Dynamic Adjust Feedback (RLDAF).This fine-tuning process adapts a small subset of the language model’s parameters based on the generating actions taken during the PPC control process. The resulting harmonious collaboration between the PLM and PPC leads to improved smoothness in text generation during inference. Extensive experiments were conducted on the SST2 dataset, and the proposed method outperformed previous approaches in various evaluation metrics, including text fluency and attribute consistency.

2022

pdf bib
R2F: A General Retrieval, Reading and Fusion Framework for Document-level Natural Language Inference
Hao Wang | Yixin Cao | Yangguang Li | Zhen Huang | Kun Wang | Jing Shao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Document-level natural language inference (DOCNLI) is a new challenging task in natural language processing, aiming at judging the entailment relationship between a pair of hypothesis and premise documents. Current datasets and baselines largely follow sentence-level settings, but fail to address the issues raised by longer documents. In this paper, we establish a general solution, named Retrieval, Reading and Fusion (R2F) framework, and a new setting, by analyzing the main challenges of DOCNLI: interpretability, long-range dependency, and cross-sentence inference. The basic idea of the framework is to simplify document-level task into a set of sentence-level tasks, and improve both performance and interpretability with the power of evidence. For each hypothesis sentence, the framework retrieves evidence sentences from the premise, and reads to estimate its credibility. Then the sentence-level results are fused to judge the relationship between the documents. For the setting, we contribute complementary evidence and entailment label annotation on hypothesis sentences, for interpretability study. Our experimental results show that R2F framework can obtain state-of-the-art performance and is robust for diverse evidence retrieval methods. Moreover, it can give more interpretable prediction results. Our model and code are released at https://github.com/phoenixsecularbird/R2F.

pdf bib
Toward Knowledge-Enriched Conversational Recommendation Systems
Tong Zhang | Yong Liu | Boyang Li | Peixiang Zhong | Chen Zhang | Hao Wang | Chunyan Miao
Proceedings of the 4th Workshop on NLP for Conversational AI

Conversational Recommendation Systems recommend items through language based interactions with users. In order to generate naturalistic conversations and effectively utilize knowledge graphs (KGs) containing background information, we propose a novel Bag-of-Entities loss, which encourages the generated utterances to mention concepts related to the item being recommended, such as the genre or director of a movie. We also propose an alignment loss to further integrate KG entities into the response generation network. Experiments on the large-scale REDIAL dataset demonstrate that the proposed system consistently outperforms state-of-the-art baselines.

pdf bib
IMCI: Integrate Multi-view Contextual Information for Fact Extraction and Verification
Hao Wang | Yangguang Li | Zhen Huang | Yong Dou
Proceedings of the 29th International Conference on Computational Linguistics

With the rapid development of automatic fake news detection technology, fact extraction and verification (FEVER) has been attracting more attention. The task aims to extract the most related fact evidences from millions of open-domain Wikipedia documents and then verify the credibility of corresponding claims. Although several strong models have been proposed for the task and they have made great process, we argue that they fail to utilize multi-view contextual information and thus cannot obtain better performance. In this paper, we propose to integrate multi-view contextual information (IMCI) for fact extraction and verification. For each evidence sentence, we define two kinds of context, i.e. intra-document context and inter-document context. Intra-document context consists of the document title and all the other sentences from the same document. Inter-document context consists of all other evidences which may come from different documents. Then we integrate the multi-view contextual information to encode the evidence sentences to handle the task. Our experimental results on FEVER 1.0 shared task show that our IMCI framework makes great progress on both fact extraction and verification, and achieves state-of-the-art performance with a winning FEVER score of 73.96% and label accuracy of 77.25% on the online blind test set. We also conduct ablation study to detect the impact of multi-view contextual information.

2021

pdf bib
融合零指代识别的篇章级机器翻译(Context-aware Machine Translation Integrating Zero Pronoun Recognition)
Hao Wang (汪浩) | Junhui Li (李军辉) | Zhengxian Gong (贡正仙)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

在汉语等其他有省略代词习惯的语言中,通常会删掉可从上下文信息推断出的代词。尽管以Transformer为代表的的神经机器翻译模型取得了巨大的成功,但这种省略现象依旧对神经机器翻译模型造成了很大的挑战。本文在Transformer基础上提出了一个融合零指代识别的翻译模型,并引入篇章上下文来丰富指代信息。具体地,该模型采用联合学习的框架,在翻译模型基础上,联合了一个分类任务,即判别句子中省略代词在句子所表示的成分,使得模型能够融合零指代信息辅助翻译。通过在中英对话数据集上的实验,验证了本文提出方法的有效性,与基准模型相比,翻译性能提升了1.48个BLEU值。

2020

pdf bib
Bayes-enhanced Lifelong Attention Networks for Sentiment Classification
Hao Wang | Shuai Wang | Sahisnu Mazumder | Bing Liu | Yan Yang | Tianrui Li
Proceedings of the 28th International Conference on Computational Linguistics

The classic deep learning paradigm learns a model from the training data of a single task and the learned model is also tested on the same task. This paper studies the problem of learning a sequence of tasks (sentiment classification tasks in our case). After each sentiment classification task is learned, its knowledge is retained to help future task learning. Following this setting, we explore attention neural networks and propose a Bayes-enhanced Lifelong Attention Network (BLAN). The key idea is to exploit the generative parameters of naive Bayes to learn attention knowledge. The learned knowledge from each task is stored in a knowledge base and later used to build lifelong attentions. The constructed lifelong attentions are then used to enhance the attention of the network to help new task learning. Experimental results on product reviews from Amazon.com show the effectiveness of the proposed model.

pdf bib
Argumentation Mining on Essays at Multi Scales
Hao Wang | Zhen Huang | Yong Dou | Yu Hong
Proceedings of the 28th International Conference on Computational Linguistics

Argumentation mining on essays is a new challenging task in natural language processing, which aims to identify the types and locations of argumentation components. Recent research mainly models the task as a sequence tagging problem and deal with all the argumentation components at word level. However, this task is not scale-independent. Some types of argumentation components which serve as core opinions on essays or paragraphs, are at essay level or paragraph level. Sequence tagging method conducts reasoning by local context words, and fails to effectively mine these components. To this end, we propose a multi-scale argumentation mining model, where we respectively mine different types of argumentation components at corresponding levels. Besides, an effective coarse-to-fine argumentation fusion mechanism is proposed to further improve the performance. We conduct a serial of experiments on the Persuasive Essay dataset (PE2.0). Experimental results indicate that our model outperforms existing models on mining all types of argumentation components.

pdf bib
Entity-Aware Dependency-Based Deep Graph Attention Network for Comparative Preference Classification
Nianzu Ma | Sahisnu Mazumder | Hao Wang | Bing Liu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper studies the task of comparative preference classification (CPC). Given two entities in a sentence, our goal is to classify whether the first (or the second) entity is preferred over the other or no comparison is expressed at all between the two entities. Existing works either do not learn entity-aware representations well and fail to deal with sentences involving multiple entity pairs or use sequential modeling approaches that are unable to capture long-range dependencies between the entities. Some also use traditional machine learning approaches that do not generalize well. This paper proposes a novel Entity-aware Dependency-based Deep Graph Attention Network (ED-GAT) that employs a multi-hop graph attention over a dependency graph sentence representation to leverage both the semantic information from word embeddings and the syntactic information from the dependency graph to solve the problem. Empirical evaluation shows that the proposed model achieves the state-of-the-art performance in comparative preference classification.

pdf bib
Towards Persona-Based Empathetic Conversational Models
Peixiang Zhong | Chen Zhang | Hao Wang | Yong Liu | Chunyan Miao
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Empathetic conversational models have been shown to improve user satisfaction and task outcomes in numerous domains. In Psychology, persona has been shown to be highly correlated to personality, which in turn influences empathy. In addition, our empirical analysis also suggests that persona plays an important role in empathetic conversations. To this end, we propose a new task towards persona-based empathetic conversations and present the first empirical study on the impact of persona on empathetic responding. Specifically, we first present a novel large-scale multi-domain dataset for persona-based empathetic conversations. We then propose CoBERT, an efficient BERT-based response selection model that obtains the state-of-the-art performance on our dataset. Finally, we conduct extensive experiments to investigate the impact of persona on empathetic responding. Notably, our results show that persona improves empathetic responding more when CoBERT is trained on empathetic conversations than non-empathetic ones, establishing an empirical link between persona and empathy in human conversations.

2019

pdf bib
Learning with Noisy Labels for Sentence-level Sentiment Classification
Hao Wang | Bing Liu | Chaozhuo Li | Yan Yang | Tianrui Li
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Deep neural networks (DNNs) can fit (or even over-fit) the training data very well. If a DNN model is trained using data with noisy labels and tested on data with clean labels, the model may perform poorly. This paper studies the problem of learning with noisy labels for sentence-level sentiment classification. We propose a novel DNN model called NetAb (as shorthand for convolutional neural Networks with Ab-networks) to handle noisy labels during training. NetAb consists of two convolutional neural networks, one with a noise transition layer for dealing with the input noisy labels and the other for predicting ‘clean’ labels. We train the two networks using their respective loss functions in a mutual reinforcement manner. Experimental results demonstrate the effectiveness of the proposed model.

2018

pdf bib
A Neural Question Answering Model Based on Semi-Structured Tables
Hao Wang | Xiaodong Zhang | Shuming Ma | Xu Sun | Houfeng Wang | Mengxiang Wang
Proceedings of the 27th International Conference on Computational Linguistics

Most question answering (QA) systems are based on raw text and structured knowledge graph. However, raw text corpora are hard for QA system to understand, and structured knowledge graph needs intensive manual work, while it is relatively easy to obtain semi-structured tables from many sources directly, or build them automatically. In this paper, we build an end-to-end system to answer multiple choice questions with semi-structured tables as its knowledge. Our system answers queries by two steps. First, it finds the most similar tables. Then the system measures the relevance between each question and candidate table cells, and choose the most related cell as the source of answer. The system is evaluated with TabMCQ dataset, and gets a huge improvement compared to the state of the art.

2017

pdf bib
A Transition-based System for Universal Dependency Parsing
Hao Wang | Hai Zhao | Zhisong Zhang
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes the system for our participation in the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In this work, we design a system based on UDPipe1 for universal dependency parsing, where multilingual transition-based models are trained for different treebanks. Our system directly takes raw texts as input, performing several intermediate steps like tokenizing and tagging, and finally generates the corresponding dependency trees. For the special surprise languages for this task, we adopt a delexicalized strategy and predict basing on transfer learning from other related languages. In the final evaluation of the shared task, our system achieves a result of 66.53% in macro-averaged LAS F1-score.

pdf bib
Unsupervised Bilingual Segmentation using MDL for Machine Translation
Bin Shan | Hao Wang | Yves Lepage
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf bib
BTG-based Machine Translation with Simple Reordering Model using Structured Perceptron
Hao Wang | Yves Lepage
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf bib
Using Argument-based Features to Predict and Analyse Review Helpfulness
Haijing Liu | Yang Gao | Pin Lv | Mengxue Li | Shiqiang Geng | Minglan Li | Hao Wang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We study the helpful product reviews identification problem in this paper. We observe that the evidence-conclusion discourse relations, also known as arguments, often appear in product reviews, and we hypothesise that some argument-based features, e.g. the percentage of argumentative sentences, the evidences-conclusions ratios, are good indicators of helpful reviews. To validate this hypothesis, we manually annotate arguments in 110 hotel reviews, and investigate the effectiveness of several combinations of argument-based features. Experiments suggest that, when being used together with the argument-based features, the state-of-the-art baseline features can enjoy a performance boost (in terms of F1) of 11.01% in average.

2016

pdf bib
HSSA tree structures for BTG-based preordering in machine translation
Yujia Zhang | Hao Wang | Yves Lepage
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf bib
Yet Another Symmetrical and Real-time Word Alignment Method: Hierarchical Sub-sentential Alignment using F-measure
Hao Wang | Yves Lepage
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf bib
Combining fast_align with Hierarchical Sub-sentential Alignment for Better Word Alignments
Hao Wang | Yves Lepage
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)

fast align is a simple and fast word alignment tool which is widely used in state-of-the-art machine translation systems. It yields comparable results in the end-to-end translation experiments of various language pairs. However, fast align does not perform as well as GIZA++ when applied to language pairs with distinct word orders, like English and Japanese. In this paper, given the lexical translation table output by fast align, we propose to realign words using the hierarchical sub-sentential alignment approach. Experimental results show that simple additional processing improves the performance of word alignment, which is measured by counting alignment matches in comparison with fast align. We also report the result of final machine translation in both English-Japanese and Japanese-English. We show our best system provided significant improvements over the baseline as measured by BLEU and RIBES.

2015

pdf bib
結合ANN、全域變異數與真實軌跡挑選之基週軌跡產生方法(A Pitch-contour Generation Method Combining ANN Prediction,Global Variance Matching, and Real-contour Selection)[In Chinese]
Hung-Yan Gu | Kai-Wei Jiang | Hao Wang
Proceedings of the 27th Conference on Computational Linguistics and Speech Processing (ROCLING 2015)

pdf bib
Translation of Unseen Bigrams by Analogy Using an SVM Classifier
Hao Wang | Lu Lyu | Yves Lepage
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

2014

pdf bib
A Sentiment-aligned Topic Model for Product Aspect Rating Prediction
Hao Wang | Martin Ester
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
A Dataset for Research on Short-Text Conversations
Hao Wang | Zhengdong Lu | Hang Li | Enhong Chen
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle
Hao Wang | Dogan Can | Abe Kazemzadeh | François Bar | Shrikanth Narayanan
Proceedings of the ACL 2012 System Demonstrations

Search
Co-authors