2024
pdf
bib
abs
Advancing Large Language Model Attribution through Self-Improving
Lei Huang
|
Xiaocheng Feng
|
Weitao Ma
|
Liang Zhao
|
Yuchun Fan
|
Weihong Zhong
|
Dongliang Xu
|
Qing Yang
|
Hongtao Liu
|
Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a Self-Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self-construct synthetic training data for warming up. To further self-improve the model’s attribution ability, START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.
pdf
bib
abs
Learning Fine-Grained Grounded Citations for Attributed Large Language Models
Lei Huang
|
Xiaocheng Feng
|
Weitao Ma
|
Yuxuan Gu
|
Weihong Zhong
|
Xiachong Feng
|
Weijiang Yu
|
Weihua Peng
|
Duyu Tang
|
Dandan Tu
|
Bing Qin
Findings of the Association for Computational Linguistics: ACL 2024
Despite the impressive performance on information-seeking tasks, large language models (LLMs) still struggle with hallucinations. Attributed LLMs, which augment generated text with in-line citations, demonstrate potential in mitigating hallucinations and improving verifiability. However, current approaches suffer from suboptimal citation quality due to their reliance on in-context learning. Furthermore, the practice of merely citing document identifiers complicates the process for users to pinpoint specific supporting evidence. In this work, we introduce FRONT, a training framework that teaches LLMs to generate Fine-grained grounded citations. By initially grounding fine-grained supporting quotes, which then guide the generation process, these quotes not only provide supervision signals to improve citation quality but also serve as fine-grained attributions. Experiments on the ALCE benchmark demonstrate the efficacy of FRONT in generating superior grounded responses and highly supportive citations. With LLaMA-2-7B, the framework significantly outperforms all the baselines, achieving an average of 14.21% improvement in citation quality across all datasets, even surpassing ChatGPT.
pdf
bib
abs
Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
Weihong Zhong
|
Xiaocheng Feng
|
Liang Zhao
|
Qiming Li
|
Lei Huang
|
Yuxuan Gu
|
Weitao Ma
|
Yuan Xu
|
Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs’ subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called \\textitMMHalSnowball to evaluate LVLMs’ behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least 31\\%, indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this Multimodal Hallucination Snowballing. To mitigate this issue, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than 24\\% of the snowballed multimodal hallucination while maintaining capabilities.
2020
pdf
bib
abs
The Medical Scribe: Corpus Development and Model Performance Analyses
Izhak Shafran
|
Nan Du
|
Linh Tran
|
Amanda Perry
|
Lauren Keyes
|
Mark Knichel
|
Ashley Domin
|
Lei Huang
|
Yu-hui Chen
|
Gang Li
|
Mingqiu Wang
|
Laurent El Shafey
|
Hagen Soltau
|
Justin Stuart Paul
Proceedings of the Twelfth Language Resources and Evaluation Conference
There is a growing interest in creating tools to assist in clinical note generation using the audio of provider-patient encounters. Motivated by this goal and with the help of providers and medical scribes, we developed an annotation scheme to extract relevant clinical concepts. We used this annotation scheme to label a corpus of about 6k clinical encounters. This was used to train a state-of-the-art tagging model. We report ontologies, labeling results, model performances, and detailed analyses of the results. Our results show that the entities related to medications can be extracted with a relatively high accuracy of 0.90 F-score, followed by symptoms at 0.72 F-score, and conditions at 0.57 F-score. In our task, we not only identify where the symptoms are mentioned but also map them to canonical forms as they appear in the clinical notes. Of the different types of errors, in about 19-38% of the cases, we find that the model output was correct, and about 17-32% of the errors do not impact the clinical note. Taken together, the models developed in this work are more useful than the F-scores reflect, making it a promising approach for practical applications.
2015
pdf
bib
Chinese Spelling Check System Based on N-gram Model
Weijian Xie
|
Peijie Huang
|
Xinrui Zhang
|
Kaiduo Hong
|
Qiang Huang
|
Bingzhou Chen
|
Lei Huang
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing
pdf
bib
Sentence-level Emotion Classification with Label and Context Dependence
Shoushan Li
|
Lei Huang
|
Rong Wang
|
Guodong Zhou
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
pdf
bib
Semi-Stacking for Semi-supervised Sentiment Classification
Shoushan Li
|
Lei Huang
|
Jingjing Wang
|
Guodong Zhou
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
2014
pdf
bib
Chinese Spelling Check System Based on Tri-gram Model
Qiang Huang
|
Peijie Huang
|
Xinrui Zhang
|
Weijian Xie
|
Kaiduo Hong
|
Bingzhou Chen
|
Lei Huang
Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing