2024
pdf
bib
abs
KEVLAR: The Complete Resource for EuroVoc Classification of Legal Documents
Lorenzo Bocchi
|
Camilla Casula
|
Alessio Palmero Aprosio
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
The use of Machine Learning and Artificial Intelligence in the Public Administration (PA) has increased in the last years. In particular, recent guidelines proposed by various governments for the classification of documents released by the PA suggest to use the EuroVoc thesaurus. In this paper, we present KEVLAR, an all-in-one solution for performing the above-mentioned task on acts belonging to the Public Administration. First, we create a collection of 8 million documents in 24 languages, tagged with EuroVoc labels, taken from EUR-Lex, the web portal of the European Union legislation. Then, we train different pre-trained BERT-based models, comparing the performance of base models with domain-specific and multilingual ones. We release the corpus, the best-performing models, and a Docker image containing the source code of the trainer, the REST API, and the web interface. This image can be employed out-of-the-box for document classification.
pdf
bib
abs
Title Is (Not) All You Need for EuroVoc Multi-Label Classification of European Laws
Lorenzo Bocchi
|
Alessio Palmero Aprosio
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Machine Learning and Artificial Intelligence approaches within Public Administration (PA) have grown significantly in recent years. Specifically, new guidelines from various governments recommend employing the EuroVoc thesaurus for the classification of documents issued by the PA.In this paper, we explore some methods to perform document classification in the legal domain, in order to mitigate the length limitation for input texts in BERT models.We first collect data from the European Union, already tagged with the aforementioned taxonomy.Then we reorder the sentences included in the text, with the aim of bringing the most informative part of the document in the first part of the text.Results show that the title and the context are both important, although the order of the text may not.Finally, we release on GitHub both the dataset and the source code used for the experiments.
2021
pdf
bib
abs
Erase and Rewind: Manual Correction of NLP Output through a Web Interface
Valentino Frasnelli
|
Lorenzo Bocchi
|
Alessio Palmero Aprosio
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations
In this paper, we present Tintful, an NLP annotation software that can be used both to manually annotate texts and to fix mistakes in NLP pipelines, such as Stanford CoreNLP. Using a paradigm similar to wiki-like systems, a user who notices some wrong annotation can easily fix it and submit the resulting (and right) entry back to the tool developers. Moreover, Tintful can be used to easily annotate data from scratch. The input documents do not need to be in a particular format: starting from the plain text, the sentences are first annotated with CoreNLP, then the user can edit the annotations and submit everything back through a user-friendly interface.
pdf
bib
abs
EasyTurk: A User-Friendly Interface for High-Quality Linguistic Annotation with Amazon Mechanical Turk
Lorenzo Bocchi
|
Valentino Frasnelli
|
Alessio Palmero Aprosio
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Amazon Mechanical Turk (AMT) has recently become one of the most popular crowd-sourcing platforms, allowing researchers from all over the world to create linguistic datasets quickly and at a relatively low cost. Amazon provides both a web interface and an API for AMT, but they are not very user-friendly and miss some features that can be useful for NLP researchers. In this paper, we present EasyTurk, a free tool that improves the potential of Amazon Mechanical Turk by adding to it some new features. The tool is free and released under an open source license.