GerDaLIR: A German Dataset for Legal Information Retrieval

Marco Wrzalik, Dirk Krechel


Abstract
We present GerDaLIR, a German Dataset for Legal Information Retrieval based on case documents from the open legal information platform Open Legal Data. The dataset consists of 123K queries, each labelled with at least one relevant document in a collection of 131K case documents. We conduct several baseline experiments including BM25 and a state-of-the-art neural re-ranker. With our dataset, we aim to provide a standardized benchmark for German LIR and promote open research in this area. Beyond that, our dataset comprises sufficient training data to be used as a downstream task for German or multilingual language models.
Anthology ID:
2021.nllp-1.13
Volume:
Proceedings of the Natural Legal Language Processing Workshop 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Nikolaos Aletras, Ion Androutsopoulos, Leslie Barrett, Catalina Goanta, Daniel Preotiuc-Pietro
Venue:
NLLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
123–128
Language:
URL:
https://aclanthology.org/2021.nllp-1.13
DOI:
10.18653/v1/2021.nllp-1.13
Bibkey:
Cite (ACL):
Marco Wrzalik and Dirk Krechel. 2021. GerDaLIR: A German Dataset for Legal Information Retrieval. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 123–128, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
GerDaLIR: A German Dataset for Legal Information Retrieval (Wrzalik & Krechel, NLLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.nllp-1.13.pdf