Xu Liu


2024

pdf bib
Hallucination Diversity-Aware Active Learning for Text Summarization
Yu Xia | Xu Liu | Tong Yu | Sungchul Kim | Ryan Rossi | Anup Rao | Tung Mai | Shuai Li
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) have shown propensity to generate hallucinated outputs, i.e., texts that are factually incorrect or unsupported. Existing methods for alleviating hallucinations typically require costly human annotations to identify and correct hallucinations in LLM outputs. Moreover, most of these methods focus on a specific type of hallucination, e.g., entity or token errors, which limits their effectiveness in addressing various types of hallucinations exhibited in LLM outputs. To our best knowledge, in this paper we propose the first active learning framework to alleviate LLM hallucinations, reducing costly human annotations of hallucination needed. By measuring fine-grained hallucinations from errors in semantic frame, discourse and content verifiability in text summarization, we propose HAllucination Diversity-Aware Sampling (HADAS) to select diverse hallucinations for annotations in active learning for LLM finetuning. Extensive experiments on three datasets and different backbone models demonstrate advantages of our method in effectively and efficiently mitigating LLM hallucinations.

2023

pdf bib
Fine-grained Artificial Neurons in Audio-transformers for Disentangling Neural Auditory Encoding
Mengyue Zhou | Xu Liu | David Liu | Zihao Wu | Zhengliang Liu | Lin Zhao | Dajiang Zhu | Lei Guo | Junwei Han | Tianming Liu | Xintao Hu
Findings of the Association for Computational Linguistics: ACL 2023

The Wav2Vec and its variants have achieved unprecedented success in computational auditory and speech processing. Meanwhile, neural encoding studies that integrate the superb representation capability of Wav2Vec and link those representations to brain activities have provided novel insights into a fundamental question of how auditory and speech processing unfold in the human brain. Without an explicit definition, most existing studies treat each transformer encoding layer in Wav2Vec as a single artificial neuron (AN). That is, the layer-level embeddings are used to predict neural responses. However, the comprehensive layer-level embedding aggregates multiple types of contextual attention captured by multi-head self-attention (MSA) modules. Thus, the layer-level ANs lack fine-granularity for neural encoding. To address this limitation, we define the elementary units, i.e., each hidden dimension, as neuron-level ANs in Wav2Vec2.0, quantify their temporal responses, and couple those ANs with their biological-neuron (BN) counterparts in the human brain. Our experimental results demonstrated that: 1) The proposed neuron-level ANs carry meaningful neurolinguistic information; 2) Those ANs anchor to their BN signatures; 3) The AN-BN anchoring patterns are interpretable from a neurolinguistic perspective. More importantly, our results suggest an intermediate stage in both the computational representation in Wav2Vec2.0 and the cortical representation in the brain. Our study validates the fine-grained ANs in Wav2Vec2.0, which may serve as a novel and general strategy to link transformer-based deep learning models to neural responses for probing the sensory processing in the brain.

2021

pdf bib
Multi-Scale Progressive Attention Network for Video Question Answering
Zhicheng Guo | Jiaxuan Zhao | Licheng Jiao | Xu Liu | Lingling Li
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Understanding the multi-scale visual information in a video is essential for Video Question Answering (VideoQA). Therefore, we propose a novel Multi-Scale Progressive Attention Network (MSPAN) to achieve relational reasoning between cross-scale video information. We construct clips of different lengths to represent different scales of the video. Then, the clip-level features are aggregated into node features by using max-pool, and a graph is generated for each scale of clips. For cross-scale feature interaction, we design a message passing strategy between adjacent scale graphs, i.e., top-down scale interaction and bottom-up scale interaction. Under the question’s guidance of progressive attention, we realize the fusion of all-scale video features. Experimental evaluations on three benchmarks: TGIF-QA, MSVD-QA and MSRVTT-QA show our method has achieved state-of-the-art performance.