Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller; Michael Gertz

Klexikon: A German Dataset for Joint Summarization and Simplification

Abstract

Traditionally, Text Simplification is treated as a monolingual translation task where sentences between source texts and their simplified counterparts are aligned for training. However, especially for longer input documents, summarizing the text (or dropping less relevant content altogether) plays an important role in the simplification process, which is currently not reflected in existing datasets. Simultaneously, resources for non-English languages are scarce in general and prohibitive for training new solutions. To tackle this problem, we pose core requirements for a system that can jointly summarize and simplify long source documents. We further describe the creation of a new dataset for joint Text Simplification and Summarization based on German Wikipedia and the German children’s encyclopedia “Klexikon”, consisting of almost 2,900 documents. We release a document-aligned version that particularly highlights the summarization aspect, and provide statistical evidence that this resource is well suited to simplification as well. Code and data are available on Github: https://github.com/dennlinger/klexikon

Anthology ID:: 2022.lrec-1.288
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 2693–2701
Language:
URL:: https://aclanthology.org/2022.lrec-1.288/
DOI:
Bibkey:
Cite (ACL):: Dennis Aumiller and Michael Gertz. 2022. Klexikon: A German Dataset for Joint Summarization and Simplification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2693–2701, Marseille, France. European Language Resources Association.
Cite (Informal):: Klexikon: A German Dataset for Joint Summarization and Simplification (Aumiller & Gertz, LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.288.pdf

PDF Cite Search Fix data