Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing

Doaa Samy, Jerónimo Arenas-García, David Pérez-Fernández


Abstract
Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have occurred over the analysed time framework.
Anthology ID:
2020.lt4gov-1.6
Volume:
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
LREC | LT4Gov | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
32–36
Language:
English
URL:
https://aclanthology.org/2020.lt4gov-1.6
DOI:
Bibkey:
Cite (ACL):
Doaa Samy, Jerónimo Arenas-García, and David Pérez-Fernández. 2020. Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing. In Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov), pages 32–36, Marseille, France. European Language Resources Association.
Cite (Informal):
Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing (Samy et al., LT4Gov 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lt4gov-1.6.pdf