2024
pdf
bib
abs
Systematic Evaluation of Long-Context LLMs on Financial Concepts
Lavanya Gupta
|
Saket Sharma
|
Yiyun Zhao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Long-context large language models (LC LLMs) promise to increase reliability of LLMs in real-world tasks requiring processing and understanding of long input documents. However, this ability of LC LLMs to reliably utilize their growing context windows remains under investigation. In this work, we evaluate the performance of state-of-the-art GPT-4 suite of LC LLMs in solving a series of progressively challenging tasks, as a function of factors such as context length, task difficulty, and position of key information by creating a real world financial news dataset. Our findings indicate that LC LLMs exhibit brittleness at longer context lengths even for simple tasks, with performance deteriorating sharply as task complexity increases. At longer context lengths, these state-of-the-art models experience catastrophic failures in instruction following resulting in degenerate outputs. Our prompt ablations also reveal unfortunate continued sensitivity to both the placement of the task instruction in the context window as well as minor markdown formatting. Finally, we advocate for more rigorous evaluation of LC LLMs by employing holistic metrics such as F1 (rather than recall) and reporting confidence intervals, thereby ensuring robust and conclusive findings.
pdf
bib
abs
Optimizing LLM Based Retrieval Augmented Generation Pipelines in the Financial Domain
Yiyun Zhao
|
Prateek Singh
|
Hanoz Bhathena
|
Bernardo Ramos
|
Aviral Joshi
|
Swaroop Gadiyaram
|
Saket Sharma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
Retrieval Augmented Generation (RAG) is a prominent approach in real-word applications for grounding large language model (LLM) generations in up to date and domain-specific knowledge. However, there is a lack of systematic investigations of the impact of each component (retrieval quality, prompts, generation models) on the generation quality of a RAG pipeline in real world scenarios. In this study, we benchmark 6 LLMs in 15 retrieval scenarios exploring 9 prompts over 2 real world financial domain datasets. We thoroughly discuss the impact of each component in RAG pipeline on answer generation quality and formulate specific recommendations for the design of RAG systems.
2023
pdf
bib
abs
When and how to paraphrase for named entity recognition?
Saket Sharma
|
Aviral Joshi
|
Yiyun Zhao
|
Namrata Mukhija
|
Hanoz Bhathena
|
Prateek Singh
|
Sashank Santhanam
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While paraphrasing is a promising approach for data augmentation in classification tasks, its effect on named entity recognition (NER) is not investigated systematically due to the difficulty of span-level label preservation. In this paper, we utilize simple strategies to annotate entity spans in generations and compare established and novel methods of paraphrasing in NLP such as back translation, specialized encoder-decoder models such as Pegasus, and GPT-3 variants for their effectiveness in improving downstream performance for NER across different levels of gold annotations and paraphrasing strength on 5 datasets. We thoroughly explore the influence of paraphrasers, and dynamics between paraphrasing strength and gold dataset size on the NER performance with visualizations and statistical testing. We find that the choice of the paraphraser greatly impacts NER performance, with one of the larger GPT-3 variants exceedingly capable of generating high quality paraphrases, yielding statistically significant improvements in NER performance with increasing paraphrasing strength, while other paraphrasers show more mixed results. Additionally, inline auto annotations generated by larger GPT-3 are strictly better than heuristic based annotations. We also find diminishing benefits of paraphrasing as gold annotations increase for most datasets. Furthermore, while most paraphrasers promote entity memorization in NER, the proposed GPT-3 configuration performs most favorably among the compared paraphrasers when tested on unseen entities, with memorization reducing further with paraphrasing strength. Finally, we explore mention replacement using GPT-3, which provides additional benefits over base paraphrasing for specific datasets.