Lukas Hilgert
2024
Evaluating and Training Long-Context Large Language Models for Question Answering on Scientific Papers
Lukas Hilgert
|
Danni Liu
|
Jan Niehues
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
With the number of scientific papers published every year growing and current large language models (LLMs) showing state-of-the-art performance on natural language processing (NLP) tasks, we ask the question if LLMs could be utilized to answer questions on scientific papers.We investigate how well state-of-the-art large language models (LLMs) can answer questions on scientific paper by experimenting with long-context versions of the LLaMA 2 model and evaluating and training on the Qasper dataset.We analyze how well the LLMs handle longer papers and questions that can only be answered by accessing information from far out paragraphs. During our experiments, we see that the performance of these LLMs drops with growing length and position of relevant information.We employ different measures from simple prompts to chain-of-thought prompts and zero-shot usage to fine-tuning with QLoRA.While we still observe a performance loss with increased context length, our measures reduce the effects of this flaw, and we can achieve F1 scores similar to bigger models like GPT-4.
Search