Zifei Shan


2024

pdf bib
TruthReader: Towards Trustworthy Document Assistant Chatbot with Reliable Attribution
Dongfang Li | Xinshuo Hu | Zetian Sun | Baotian Hu | Shaolin Ye | Zifei Shan | Qian Chen | Min Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Document assistant chatbots are empowered with extensive capabilities by Large Language Models (LLMs) and have exhibited significant advancements. However, these systems may suffer from hallucinations that are difficult to verify in the context of given documents.Moreover, despite the emergence of products for document assistants, they either heavily rely on commercial LLM APIs or lack transparency in their technical implementations, leading to expensive usage costs and data privacy concerns. In this work, we introduce a fully open-source document assistant chatbot with reliable attribution, named TruthReader, utilizing adapted conversational retriever and LLMs. Our system enables the LLMs to generate answers with detailed inline citations, which can be attributed to the original document paragraphs, facilitating the verification of the factual consistency of the generated text. To further adapt the generative model, we develop a comprehensive pipeline consisting of data construction and model optimization processes.This pipeline equips the LLMs with the necessary capabilities to generate accurate answers, produce reliable citations, and refuse unanswerable questions. Our codebase, data and models are released, and the video demonstration of our system is available at https://youtu.be/RYVt3itzUQM.

2020

pdf bib
Entity Linking in 100 Languages
Jan A. Botha | Zifei Shan | Daniel Gillick
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing task, to obtain a single entity retrieval model that covers 100+ languages and 20 million entities. The model outperforms state-of-the-art results from a far more limited cross-lingual linking task. Rare entities and low-resource languages pose challenges at this large-scale, so we advocate for an increased focus on zero- and few-shot evaluation. To this end, we provide Mewsli-9, a large new multilingual dataset matched to our setting, and show how frequency-based analysis provided key insights for our model and training enhancements.