LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts

Don Tuggener, Pius von Däniken, Thomas Peetz, Mark Cieliebak


Abstract
We present LEDGAR, a multilabel corpus of legal provisions in contracts. The corpus was crawled and scraped from the public domain (SEC filings) and is, to the best of our knowledge, the first freely available corpus of its kind. Since the corpus was constructed semi-automatically, we apply and discuss various approaches to noise removal. Due to the rather large labelset of over 12’000 labels annotated in almost 100’000 provisions in over 60’000 contracts, we believe the corpus to be of interest for research in the field of Legal NLP, (large-scale or extreme) text classification, as well as for legal studies. We discuss several methods to sample subcopora from the corpus and implement and evaluate different automatic classification approaches. Finally, we perform transfer experiments to evaluate how well the classifiers perform on contracts stemming from outside the corpus.
Anthology ID:
2020.lrec-1.155
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1235–1241
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.155
DOI:
Bibkey:
Cite (ACL):
Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1235–1241, Marseille, France. European Language Resources Association.
Cite (Informal):
LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts (Tuggener et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.155.pdf