Konstantin Dobler
2024
Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians
Jan Hoffbauer
|
Sylwester Sawicki
|
Marc Ulrich
|
Tolga Buz
|
Konstantin Dobler
|
Moritz Schneider
|
Gerard De Melo
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)
Powerful LLMs like ChatGPT are adopted rapidly for a wide array of tasks, but their limitations in domain-specific areas become apparent, particularly when prompted to recite facts. This is critical especially for knowledge workers, who are adopting LLM-based tools rapidly.While there are various techniques that can help ingest knowledge into LLMs such as instruction tuning and alignment, most have disadvantages. We examine the impact of prominent training techniques on LLMs’ knowledge accuracy using a knowledge-dense dataset that we curate from r/AskHistorians, a rich source of historical knowledge. We evaluate the impact of different models sizes from 1.3B to 7B parameters and other factors such as LoRA adapters, quantization, overfitting, and the inclusion of Reddit data in pretraining.In addition, we measure linguistic metrics and human and LLM-based preference. Our results suggest that pretraining and model size have a much stronger effect on knowledge accuracy than continued pretraining – unless the model is overfit to the tested knowledge.Fine-tuning on our Reddit dataset introduces less complex, but slightly more toxic language. Our study explores the challenges of injecting domain-specific datasets into LLMs and has implications for practitioners, e.g., when LLMs are to be fine-tuned with a company’s datasets.
2023
FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models
Konstantin Dobler
|
Gerard de Melo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model’s embedding matrix. In this paper, we propose FOCUS - **F**ast **O**verlapping Token **C**ombinations **U**sing **S**parsemax, a novel embedding initialization method that effectively initializes the embedding matrix for a new tokenizer based on information in the source model’s embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and empirically show that FOCUS outperforms random initialization and previous work on language modeling and on a range of downstream tasks (NLI, QA, and NER). We publish our model checkpoints and code on GitHub.
Search
Co-authors
- Gerard De Melo 2
- Jan Hoffbauer 1
- Sylwester Sawicki 1
- Marc Ulrich 1
- Tolga Buz 1
- show all...