Using Bibliodata LODification to Create Metadata-Enriched Literary Corpora in Line with FAIR Principles

Agnieszka Karlinska, Cezary Rosiński, Marek Kubis, Patryk Hubar, Jan Wieczorek


Abstract
This paper discusses the design principles and procedures for creating a balanced corpus for research in computational literary studies, building on the experience of computational linguistics but adapting it to the specificities of the digital humanities. It showcases the development of the Metadata-enriched Polish Novel Corpus from the 19th and 20th centuries (19/20MetaPNC), consisting of 1,000 novels from 1854–1939, as an illustrative case and proposes a comprehensive workflow for the creation and reuse of literary corpora. What sets 19/20MetaPNC apart is its approach to balance, which considers the spatial dimension, the inclusion of non-canonical texts previously overlooked by other corpora, and the use of a complex, multi-stage metadata enrichment and verification process. Emphasis is placed on research-oriented metadata design, efficient data collection and data sharing according to the FAIR principles as well as 5- and 7-star data standards to increase the visibility and reusability of the corpus. A knowledge graph-based solution for the creation of exchangeable and machine-readable metadata describing corpora has been developed. For this purpose, metadata from bibliographic catalogs and other sources were transformed into Linked Data following the bibliodata LODification approach.
Anthology ID:
2024.lrec-main.1500
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
17271–17284
Language:
URL:
https://aclanthology.org/2024.lrec-main.1500
DOI:
Bibkey:
Cite (ACL):
Agnieszka Karlinska, Cezary Rosiński, Marek Kubis, Patryk Hubar, and Jan Wieczorek. 2024. Using Bibliodata LODification to Create Metadata-Enriched Literary Corpora in Line with FAIR Principles. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17271–17284, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Using Bibliodata LODification to Create Metadata-Enriched Literary Corpora in Line with FAIR Principles (Karlinska et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1500.pdf