Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians

Jan Hoffbauer, Sylwester Sawicki, Marc Ulrich, Tolga Buz, Konstantin Dobler, Moritz Schneider, Gerard De Melo


Abstract
Powerful LLMs like ChatGPT are adopted rapidly for a wide array of tasks, but their limitations in domain-specific areas become apparent, particularly when prompted to recite facts. This is critical especially for knowledge workers, who are adopting LLM-based tools rapidly.While there are various techniques that can help ingest knowledge into LLMs such as instruction tuning and alignment, most have disadvantages. We examine the impact of prominent training techniques on LLMs’ knowledge accuracy using a knowledge-dense dataset that we curate from r/AskHistorians, a rich source of historical knowledge. We evaluate the impact of different models sizes from 1.3B to 7B parameters and other factors such as LoRA adapters, quantization, overfitting, and the inclusion of Reddit data in pretraining.In addition, we measure linguistic metrics and human and LLM-based preference. Our results suggest that pretraining and model size have a much stronger effect on knowledge accuracy than continued pretraining – unless the model is overfit to the tested knowledge.Fine-tuning on our Reddit dataset introduces less complex, but slightly more toxic language. Our study explores the challenges of injecting domain-specific datasets into LLMs and has implications for practitioners, e.g., when LLMs are to be fine-tuned with a company’s datasets.
Anthology ID:
2024.knowllm-1.9
Volume:
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Sha Li, Manling Li, Michael JQ Zhang, Eunsol Choi, Mor Geva, Peter Hase, Heng Ji
Venues:
KnowLLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
96–108
Language:
URL:
https://aclanthology.org/2024.knowllm-1.9
DOI:
Bibkey:
Cite (ACL):
Jan Hoffbauer, Sylwester Sawicki, Marc Ulrich, Tolga Buz, Konstantin Dobler, Moritz Schneider, and Gerard De Melo. 2024. Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians. In Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024), pages 96–108, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians (Hoffbauer et al., KnowLLM-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.knowllm-1.9.pdf