Xiao Shi


2023

pdf bib
Hallucination Mitigation in Natural Language Generation from Large-Scale Open-Domain Knowledge Graphs
Xiao Shi | Zhengyuan Zhu | Zeyu Zhang | Chengkai Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In generating natural language descriptions for knowledge graph triples, prior works used either small-scale, human-annotated datasets or datasets with limited variety of graph shapes, e.g., those having mostly star graphs. Graph-to-text models trained and evaluated on such datasets are largely not assessed for more realistic large-scale, open-domain settings. We introduce a new dataset, GraphNarrative, to fill this gap. Fine-tuning transformer-based pre-trained language models has achieved state-of-the-art performance among graph-to-text models. However, this method suffers from information hallucination—the generated text may contain fabricated facts not present in input graphs. We propose a novel approach that, given a graph-sentence pair in GraphNarrative, trims the sentence to eliminate portions that are not present in the corresponding graph, by utilizing the sentence’s dependency parse tree. Our experiment results verify this approach using models trained on GraphNarrative and existing datasets. The dataset, source code, and trained models are released at https://github.com/idirlab/graphnarrator.

2021

pdf bib
A Dashboard for Mitigating the COVID-19 Misinfodemic
Zhengyuan Zhu | Kevin Meng | Josue Caraballo | Israa Jaradat | Xiao Shi | Zeyu Zhang | Farahnaz Akrami | Haojin Liao | Fatma Arslan | Damian Jimenez | Mohanmmed Samiul Saeef | Paras Pathak | Chengkai Li
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This paper describes the current milestones achieved in our ongoing project that aims to understand the surveillance of, impact of and intervention on COVID-19 misinfodemic on Twitter. Specifically, it introduces a public dashboard which, in addition to displaying case counts in an interactive map and a navigational panel, also provides some unique features not found in other places. Particularly, the dashboard uses a curated catalog of COVID-19 related facts and debunks of misinformation, and it displays the most prevalent information from the catalog among Twitter users in user-selected U.S. geographic regions. The paper explains how to use BERT models to match tweets with the facts and misinformation and to detect their stance towards such information. The paper also discusses the results of preliminary experiments on analyzing the spatio-temporal spread of misinformation.