StoryDB: Broad Multi-language Narrative Dataset

This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.


Introduction
Stories are central to human culture and communication.However, it seems that stories are easier said than generated.Despite incredible recent progress in natural language processing generation of longer texts is still a challenge (van Stegeren and Theune, 2019;Rashkin et al., 2020).Ostermann et al. (2019) present a machine comprehension corpus for the end-to-end evaluation of script knowledge with 50% of the questions in the corpus that require script knowledge for the correct answer.The authors demonstrate that though the task is not challenging to humans, existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base.
Partially, this challenge could be attributed to the lack of adequate memory models.Longer texts demand better memory mechanisms and possible ways to construct such mechanisms are discussed in the literature for the last 25 years.Long shortterm memory networks (Hochreiter and Schmidhuber, 1997), Neural Turing Machines (Graves et al., 2014), memory networks (Weston et al., 2014) and many other architectures try to tackle this problem.Attempts to introduce some form of memory in transformers, such as (Guo et al., 2019) or (Burtsev and Sapunov, 2020), could be regarded as the next steps in this long line of work.
There are some interesting recent attempts to generate long texts using some form that makes such longer text feasible for a human reader.For example, Agafonova et al. (2020) generate a diary of a neural network.Yet the generation of a narrative is still challenging.For a detailed review of earlier approaches to narrative generation, we address the reader to (Kybartas and Bidarra, 2016).Even modern models for narrative generation rely heavily on some form of expert knowledge or some type of hierarchical structure of the narrative.For example, Fan et al. (2019) first generate the predicateargument structure of the text, then generate a surface realization of the predicate-argument structure, finally replace the entity placeholders with contextsensitive names and references.Fan et al. (2019); Ammanabrolu et al. (2020) propose a hierarchical generation framework that first plans a storyline, and then generate a story based on it.present a technique for preprocessing textual story data into event sequences.Xu et al. (2018) develop a model that first generates the most critical phrases, called skeleton, and then expands the skeleton to a complete and fluent sentence.Similarly, Martin et al. (2018) provide a mid-level of abstraction between words and a sentence to minimize event sparsity and present a technique for automated story generation whereby the problem is decomposed into the generation of successive events and the generation of natural language sentences from events.Finally, Brahman et al. (2020) develop an approach, where the user provides the model with such mid-level sentence abstractions in the form of cue phrases during the generation process.
However, we should take into consideration that modern Natural Language Processing (NLP) is fun-damentally an experimental discipline, so the lack of dedicated data could be another bottleneck for the development of narrative generation.This paper tries to amend this problem.
Unfortunately, the majority of available narrative datasets deal with some constrained form of a short plot that is usually called scenario.These scenarios are centered around common activities, i.e. going grocery shopping or taking a shower.These narrative datasets available in the literature are also extremely small and could not be used with the most advanced modern NLP models.Regneri et al. (2010) collect 493 event sequence descriptions for 22 behavior scenarios.Modi et al. (2016) present InScript dataset that consists of 1,000 stories centered around 10 different scenarios.Wanzare et al. (2019) provide 200 scenarios and attempt to identify all references to them in a collection of narrative texts.Mostafazadeh et al. (2016) present a corpus of 50k five-sentence commonsense stories.Finally, there is an MPST dataset that contains 14K movie plot synopses, (Kar et al., 2018), and WikiPlots1 that contains 112 936 story plots extracted from English language Wikipedia.Recently Malysheva et al. (2021) provided a dataset of TV series along with an instrument for narrative arc analysis.These datasets are useful yet as well as a vast majority of the narrative datasets they are only available in English.
This paper provides a large multi-language dataset of stories in natural language.The stories have a cross-language index and every story and character are cross-linked if they occur in different languages.Additionally, the texts have tags such as a genre or a topic.This is the first story dataset of such magnitude that we know of.We hope that a large dataset of long storylines could be used for various aspects of narrative research as well as to facilitate experiments with end-to-end narrative generation.

Data
StoryDB is motivated by several interesting experiments that used WikiPlots -one of the larger English datasets of narratives available for all-purpose narrative research that we have mentioned earlier.
Seeing various applications that Wikiplots dataset found in the NLP community, we believe, that Sto-ryDB would be even more useful due to multiple languages, advanced filtering that guarantees higher quality of obtained data, and genre tagging.To improve reproducibility and make StoryDB usable as Wikipedia is further updated we publish the data as well as the code for the filtering pipeline2 .The stories that form StoryDB are extracted from any Wikipedia article that contains a sub-header that contains the word "plot" (e.g., "Plot", "Plot Summary", etc.) in a corresponding language.

Dataset structure
The dataset consists of several index files and includes a directory plots.Every file in the directory has a similar structure.Two first letters of the filename stand for the ISO 639-13 code of the language for the texts presented in the file.For example, hy_plots.tsvcontains 4 861 plots in Armenian language.The file simple_plots.tsvcontains stories in Simple English.Every entry in the plots file has a similar structure and includes the following fields: • ID -the unique number of a plot that is the same across every language in the dataset; • Lang -the language of this particular entry; • Link -a link to the Wikipedia page containing the plot; • Title -the title of the story; • Text -the text of the story; • Categories -the categories that Wikipedia assigns to this story.
One can navigate across plot files using Sto-ryDB's Index file plot_matrix.tsv.The rows of the file stand for languages.If a given plot is available in a given language then the title of this plot stands in the corresponding cell of the plot_matrix.tsv.For example, if "Wee free men" is available in Simple English it could be found by its title in the corresponding simple_plots.tsv.StoryDB also includes plot_rake.tsvthat contains keywords extracted with RAKE algorithm (Rose et al., 2010) for every story.
Finally, the files ID_lang_tag.tsvand ID_tag_average.tsvinclude information about tags that correspond to the given story.We discuss tagging procedure in detail later.

Preprocessing
Our motivation is to provide a dataset of storylines for various languages including the low-resource ones.Roughly speaking, we want to be sure that every story that ends up in StoryDB is a legitimate storyline description in the corresponding natural language.Thus we are more interested in the precision of the dataset rather than in the recall.To guarantee a higher quality of the obtained stories we implemented several heuristical filters that we briefly describe here.
English Wikipedia is an order of magnitude bigger than any other Wikipedia both in terms of users and in terms of admins4 .This makes the English list of storylines to be the most extensive one.We regard it as the least noisy one and use it as a reference source for the filtering procedure.We exclude every page that includes a plot yet has no plot section in English Wikipedia for the same entry.
If Wikipedia in language X has a page with title A and this page is also available in language Y under title B, we list such pair of stories as [language_X, title_A, language_Y, title_B]".Every entry in this list is an edge in a graph of stories.Every vertex in this graph has a corresponding name language, title.Unlike connected stories from different languages that usually contain similar storylines, the stories listed under the same name in the same language might differ significantly.Say, two stories in language X [language_X, title_A] and [language_X, title_B] are both linked to one story in another language Y [language_Y, title_C].To avoid such ambiguities we exclude fully connected components that contain more than one entry in the same language.Obtained list of stories ends up in the resulting matrix of stories to navigate the dataset.We experimented with various filtering procedures and found this combination to produce a sufficiently rich dataset with a minimum amount of duplicates.
StoryDB is also equipped with a catalog of characters.If a given character that has an individual Wikipedia page is mentioned in a story, its description in the original language is saved into the corresponding tsv-file alongside the ID of a story and the language of the description.

Tagging
We annotate the resulting stories using metainformation on categories from Wiki API5 .For every plot, we list all translated categories marked in every language in which this plot is available.Then we search these category lists for substrings that include tags from the manually created list of tags6 .This allows us to proved language-specific tags for every language, that are listed in ID_lang_tag.tsv.For example, Czech version of Black Night has tags action; crime; drama; superhero; comics; and thriller, while the same story in Persian has no tag comics, but has additional tags neo-noir; psychological; epic; and screenplays;.
File ID_tag_average.tsvincludes the scores of the tags available for every story.The scores are calculated as follows: we count the number of times that a given tag is associated with a given story.Then we divide this number over the total number of languages in which the story is represented.The obtained space of tags could be useful for narrative exploration.Every story becomes a vector with every coordinate on the interval [0, 1]. Figure 1 shows a t-SNE visualisation of this space (Van der Maaten and Hinton, 2008) alongside the centroids of the more distinctive tags.

StoryDB
Figure 2 shows the relative size of the datasets in every language presented in StoryDB.English heavily dominates followed by Italian, French, Russian, and German.
There are more than 20 languages that have three thousand or more stories available, including such languages as Finnish, Hungarian, or Persian.Table 1 summarises some of the resulting parameters of the obtained dataset.
Story DB Number of languages 42 Median # of stories in a language 2 772 Maximal # of stories in a language 63 756 Minimum # of stories in a language 568 Table 1: Some resulting parameters of the StoryDB.

Evaluation
We have used three modern transformer-based architectures for the evaluation: • mBERT 7 (Devlin et al., 2018) -a multilanguage version of BERT; • mDistilBERT 8 (Sanh et al., 2019) -a distilled version of multi-language BERT; • XLM-Roberta 9 (Conneau et al., 2020) -a model that is two times larger than BERT in terms of the number of parameters.
We evaluated three tasks: • Task A. Multilabel classification for tags on a multilanguage corpus of plots; Training data consists of sixteen thousand plots in ten languages, with one tenth of the dataset in every language.
• Task C. Multilabel classification for tags in cross-lingual learning with a corpus of overlapping plots that occur in every language.
Let us now describe every task in detail.

Task A
We have sampled the ten most frequents tags out of StoryDB (tag 'film' was the most frequent yet was excluded as a somewhat redundant one).These tags were: 'drama', 'comedy', 'television', 'fiction', 'series', 'action', 'thriller', 'black-and-white', 'science fiction', 'horror'.These ten tags form a vector, where every dimension corresponds to one particular tag.'1' encodes the presence of the tag and '0' stands for the absence of it.
For every language out of the top ten in StoryDB, we have sampled 2000 plots such that every plot has at least one tag out of the list of the ten most popular tags.In Task A the plots were sampled randomly for every language, so there is some overlap between languages.On average, 2% of the plots in one language reoccur in another one.It is important to note that the set of tags for a given plot might differ across languages and one plot could have several tags simultaneously.Thus, multilabel classification is a natural evaluation task under these circumstances.
Since the dataset is not balanced with respect to tags we used the binary cross-entropy loss11 over the vector of tags.Table 2 and Table 3 sum up the results of three models on a multilanguage dataset of plots.Further details across languages and tags are available online12 .

Task B
Now let us do a similar setup yet train every model on one language in StoryDB and test its accuracy  on another language.The parameters of the training datasets and labels are the same as in Task A above but every model is trained on one dataset and is then tested on other languages.Table 4 show the performance of mBERT, yet mDistillBERT and XLM-RoBERTa demonstrate similar behavior.The detailed results could be found online 13 .Table 4 demonstrates that if we train the model on one language and validate it on the other the quality of the multilabel tag classification drops.This drop varies across languages and tends to be smaller for the languages that belong to the same language family.

Task C
The last validation is similar to Task B, yet now we sample plots that overlap in every language.This limits us to 1500 plots in six languages that we split into train and test.Now every plot occurs in every language.Table 5 shows the model manages to recover certain tags in one language after the pre-training on the other.Table 5 shows the performance of XLM-RoBERTa, yet mDistillBERT and mBERT demonstrate similar behavior.The performance of the models tends to be better on overlapping plots if we compare it to Task B. The detailed results could be found online14 .
This series of evaluation tasks demonstrates two crucial properties of StoryDB: • StoryDB could be used to work with narrative structures on the most abstract cross-lingual level; • StoryDB allows controlling for various crosslingual similarities of plots during ablation experiments with models of narrative.

Discussion
We believe that a broad multilanguage dataset of narratives can facilitate several areas of narrative research.
• Cross-cultural research of narrative structure.StoryDB provides possibilities to compare the structure of narrative in various languages.Since StoryDB includes every story in its original language and is equipped with a universal system of tags it is a natural source for such cross-cultural research.
• Classification of narratives.StoryDB includes an extensive amount of narratives for various languages alongside their genre tags.This allows to develop new methods for narrative classification as well as extensively test the ones that already exist, see for example (Reiter et al., 2014).
• Quantitative research of the narrative structure.
(y Pérez, 2007) represents a story as a cluster of emotional links and tensions between characters that progress over storytime.StoryDB includes the description of the plots alongside the key characters.Such information could be insightful for a deeper quantitative understanding of narrative as a by-product of character interaction.
• Summarization of narrative.Parallel corpora in different languages contain similar descriptions of the narrative that could vary in terms of details and length.That makes StoryDB a useful tool for potential narrative summarization research such as (Barros et al., 2019).
• End-to-end narrative generation.StoryDB is the first dataset of narratives that we know of that contains narrative descriptions in various natural languages.

Conclusion
This paper presents StoryDB -a broad multilanguage dataset of narratives.We describe the construction of the dataset, provide the code for the whole pipeline, list the parameters of the resulting dataset, and briefly discuss several areas of natural language processing research, where StoryDB could be useful for the community.We hope that StoryDB could be broadened as more plot descriptions are added to various languages.These considerations make StoryDB a flexible resource that would be relevant for the NLP community as the subfield of quantitative narrative research moves on.

Figure 1
Figure 1: t-SNE visualisation for plots in StoryDB clustered according to their tags.Figure shows centroids of the tags with higher variance across the dataset.

Figure 2 :
Figure 2: Number of stories in every language that has more that five hundred entries in StoryDB.

Table 2 :
Task A. Hamming score and multilabel accuracy for the vector of predicted tags on a validation set.

Table 3 :
Task A. AUC-ROC for binary tag classifiers on a validation set.Training data consists of sixteen thousand plots in ten languages, with one tenth of the dataset in every language.