Lee Quartey
2024
Towards an Automated Pointwise Evaluation Metric for Generated Long-Form Legal Summaries
Shao Min Tan
|
Quentin Grail
|
Lee Quartey
Proceedings of the Natural Legal Language Processing Workshop 2024
Long-form abstractive summarization is a task that has particular importance in the legal domain. Automated evaluation metrics are important for the development of text generation models, but existing research on the evaluation of generated summaries has focused mainly on short summaries. We introduce an automated evaluation methodology for generated long-form legal summaries, which involves breaking each summary into individual points, comparing the points in a human-written and machine-generated summary, and calculating a recall and precision score for the latter. The method is designed to be particularly suited for the complexities of legal text, and is also fully interpretable. We also create and release a small meta-dataset for the benchmarking of evaluation methods, focusing on long-form legal summarization. Our evaluation metric corresponds better with human evaluation compared to existing metrics which were not developed for legal data.
2023
Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks
Fangyi Yu
|
Lee Quartey
|
Frank Schilder
Findings of the Association for Computational Linguistics: ACL 2023
The use of large language models (LLMs) for zero- or few-shot prompting in natural language processing has given rise to a new research area known as prompt engineering. Recent studies have demonstrated that Chain-of-Thought (CoT) prompts can lead to significant improvements in tasks such as arithmetic and common-sense reasoning. This paper explores the use of such approaches in legal reasoning tasks by conducting experiments on the COLIEE entailment task, which is based on the Japanese Bar exam. We evaluate zero-shot/few-shot and fine-tuning approaches with and without explanations, as well as various prompting strategies. Our results indicate that while CoT prompting and fine-tuning with explanations can improve performance, the best results are achieved with prompts derived from specific legal reasoning techniques, such as IRAC (Issue, Rule, Application, Conclusion). In addition, we observe that few-shot learning where the demonstrations are derived from clustering past training data consistently yields high performance on the COLIEE entailment task for both the years of the data that we tested. Through our experiments, we improve the previous best result on the 2021 COLIEE task from 0.7037 to 0.8025 and surpass the best system from 2022 with an accuracy of 0.789.
Search