CDD: A Large Scale Dataset for Legal Intelligence Research

Changzhen Ji, Yating Zhang, Adam Jatowt, Haipang Wu


Abstract
As an important application of Artificial Intelligence, legal intelligence has recently attracted the attention of many researchers. Previous works investigated diverse issues like predicting crimes, predicting outcomes of judicial debates, or extracting information/knowledge from various kinds of legal documents. Although many advances have been made, the research on supporting prediction of court judgments remains relatively scarce, while the lack of large-scale data resources limits the development of this research.In this paper, we present a novel, large-size Court Debate Dataset (CDD), which includes 30,481 court cases, totaling 1,144,425 utterances. CDD contains real-world conversations involving judges, plaintiffs and defendants in court trials. To construct this dataset we have invited experienced judges to design appropriate labels for data records. We then asked law school students to provide annotations based on the defined labels. The dataset can be applied to several downstream tasks, such as text summarization, dialogue generation, text classification, etc. We introduce the details of the different tasks in the rapidly developing field of legal intelligence, the research of which can be fostered thanks to our dataset, and we provide the corresponding benchmark performance.
Anthology ID:
2023.emnlp-industry.7
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2023
Address:
Singapore
Editors:
Mingxuan Wang, Imed Zitouni
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
66–73
Language:
URL:
https://aclanthology.org/2023.emnlp-industry.7
DOI:
10.18653/v1/2023.emnlp-industry.7
Bibkey:
Cite (ACL):
Changzhen Ji, Yating Zhang, Adam Jatowt, and Haipang Wu. 2023. CDD: A Large Scale Dataset for Legal Intelligence Research. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 66–73, Singapore. Association for Computational Linguistics.
Cite (Informal):
CDD: A Large Scale Dataset for Legal Intelligence Research (Ji et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-industry.7.pdf
Video:
 https://aclanthology.org/2023.emnlp-industry.7.mp4