Combating Hallucination and Misinformation: Factual Information Generation with Tokenized Generative Transformer

Sourav Das, Sanjay Chatterji, Imon Mukherjee


Abstract
Large language models have gained a meteoric rise recently. With the prominence of LLMs, hallucination and misinformation generation have become a severity too. To combat this issue, we propose a contextual topic modeling approach called Co-LDA for generative transformer. It is based on Latent Dirichlet Allocation and is designed for accurate sentence-level information generation. This method extracts cohesive topics from COVID-19 research literature, grouping them into relevant categories. These contextually rich topic words serve as masked tokens in our proposed Tokenized Generative Transformer, a modified Generative Pre-Trained Transformer for generating accurate information in any designated topics. Our approach addresses micro hallucination and incorrect information issues in experimentation with the LLMs. We also introduce a Perplexity-Similarity Score system to measure semantic similarity between generated and original documents, offering accuracy and authenticity for generated texts. Evaluation of benchmark datasets, including question answering, language understanding, and language similarity demonstrates the effectiveness of our text generation method, surpassing some state-of-the-art transformer models.
Anthology ID:
2023.nlp4dh-1.18
Volume:
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Month:
December
Year:
2023
Address:
Tokyo, Japan
Editors:
Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter
Venues:
NLP4DH | IWCLUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
143–152
Language:
URL:
https://aclanthology.org/2023.nlp4dh-1.18
DOI:
Bibkey:
Cite (ACL):
Sourav Das, Sanjay Chatterji, and Imon Mukherjee. 2023. Combating Hallucination and Misinformation: Factual Information Generation with Tokenized Generative Transformer. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 143–152, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Combating Hallucination and Misinformation: Factual Information Generation with Tokenized Generative Transformer (Das et al., NLP4DH-IWCLUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nlp4dh-1.18.pdf