Gergana Lazarova
2021
A Comparative Study on Abstractive and Extractive Approaches in Summarization of European Legislation Documents
Valentin Zmiycharov
|
Milen Chechev
|
Gergana Lazarova
|
Todor Tsonkov
|
Ivan Koychev
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Extracting the most important part of legislation documents has great business value because the texts are usually very long and hard to understand. The aim of this article is to evaluate different algorithms for text summarization on EU legislation documents. The content contains domain-specific words. We collected a text summarization dataset of EU legal documents consisting of 1563 documents, in which the mean length of summaries is 424 words. Experiments were conducted with different algorithms using the new dataset. A simple extractive algorithm was selected as a baseline. Advanced extractive algorithms, which use encoders show better results than baseline. The best result measured by ROUGE scores was achieved by a fine-tuned abstractive T5 model, which was adapted to work with long texts.