StoryDB: Broad Multi-language Narrative Dataset

Alexey Tikhonov; Igor Samenko; Ivan P. Yamshchikov

doi:10.18653/v1/2021.eval4nlp-1.4

StoryDB: Broad Multi-language Narrative Dataset

Alexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

This paper presents StoryDB — a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.

Anthology ID:: 2021.eval4nlp-1.4
Volume:: Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, Marina Fomicheva
Venue:: Eval4NLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32–39
Language:
URL:: https://aclanthology.org/2021.eval4nlp-1.4/
DOI:: 10.18653/v1/2021.eval4nlp-1.4
Bibkey:
Cite (ACL):: Alexey Tikhonov, Igor Samenko, and Ivan P. Yamshchikov. 2021. StoryDB: Broad Multi-language Narrative Dataset. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 32–39, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: StoryDB: Broad Multi-language Narrative Dataset (Tikhonov et al., Eval4NLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.eval4nlp-1.4.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{tikhonov-etal-2021-storydb,
    title = "{S}tory{DB}: Broad Multi-language Narrative Dataset",
    author = "Tikhonov, Alexey  and
      Samenko, Igor  and
      Yamshchikov, Ivan P.",
    editor = "Gao, Yang  and
      Eger, Steffen  and
      Zhao, Wei  and
      Lertvittayakumjorn, Piyawat  and
      Fomicheva, Marina",
    booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.eval4nlp-1.4/",
    doi = "10.18653/v1/2021.eval4nlp-1.4",
    pages = "32--39",
    abstract = "This paper presents StoryDB {---} a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="tikhonov-etal-2021-storydb">
    <titleInfo>
        <title>StoryDB: Broad Multi-language Narrative Dataset</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Alexey</namePart>
        <namePart type="family">Tikhonov</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Igor</namePart>
        <namePart type="family">Samenko</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ivan</namePart>
        <namePart type="given">P</namePart>
        <namePart type="family">Yamshchikov</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2021-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Yang</namePart>
            <namePart type="family">Gao</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Steffen</namePart>
            <namePart type="family">Eger</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Wei</namePart>
            <namePart type="family">Zhao</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Piyawat</namePart>
            <namePart type="family">Lertvittayakumjorn</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Marina</namePart>
            <namePart type="family">Fomicheva</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Punta Cana, Dominican Republic</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>This paper presents StoryDB — a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.</abstract>
    <identifier type="citekey">tikhonov-etal-2021-storydb</identifier>
    <identifier type="doi">10.18653/v1/2021.eval4nlp-1.4</identifier>
    <location>
        <url>https://aclanthology.org/2021.eval4nlp-1.4/</url>
    </location>
    <part>
        <date>2021-11</date>
        <extent unit="page">
            <start>32</start>
            <end>39</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T StoryDB: Broad Multi-language Narrative Dataset
%A Tikhonov, Alexey
%A Samenko, Igor
%A Yamshchikov, Ivan P.
%Y Gao, Yang
%Y Eger, Steffen
%Y Zhao, Wei
%Y Lertvittayakumjorn, Piyawat
%Y Fomicheva, Marina
%S Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
%D 2021
%8 November
%I Association for Computational Linguistics
%C Punta Cana, Dominican Republic
%F tikhonov-etal-2021-storydb
%X This paper presents StoryDB — a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.
%R 10.18653/v1/2021.eval4nlp-1.4
%U https://aclanthology.org/2021.eval4nlp-1.4/
%U https://doi.org/10.18653/v1/2021.eval4nlp-1.4
%P 32-39

Download as File

Markdown (Informal)

[StoryDB: Broad Multi-language Narrative Dataset](https://aclanthology.org/2021.eval4nlp-1.4/) (Tikhonov et al., Eval4NLP 2021)

StoryDB: Broad Multi-language Narrative Dataset (Tikhonov et al., Eval4NLP 2021)

ACL

Alexey Tikhonov, Igor Samenko, and Ivan P. Yamshchikov. 2021. StoryDB: Broad Multi-language Narrative Dataset. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 32–39, Punta Cana, Dominican Republic. Association for Computational Linguistics.