Yoshinari Nagai


2024

pdf bib
A Document-Level Text Simplification Dataset for Japanese
Yoshinari Nagai | Teruaki Oka | Mamoru Komachi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Document-level text simplification, a task that combines single-document summarization and intra-sentence simplification, has garnered significant attention. However, studies have primarily focused on languages such as English and German, leaving Japanese and similar languages underexplored because of a scarcity of linguistic resources. In this study, we devised JADOS, the first Japanese document-level text simplification dataset based on newspaper articles and Wikipedia. Our dataset focuses on simplification, to enhance readability by reducing the number of sentences and tokens in a document. We conducted investigations using our dataset. Firstly, we analyzed the characteristics of Japanese simplification by comparing it across different domains and with English counterparts. Moreover, we experimentally evaluated the performances of text summarization methods, transformer-based text simplification models, and large language models. In terms of D-SARI scores, the transformer-based models performed best across all domains. Finally, we manually evaluated several model outputs and target articles, demonstrating the need for document-level text simplification models in Japanese.