AbstractCreativity is an essential element of human nature used for many activities, such as telling a story. Based on human creativity, researchers have attempted to teach a computer to generate stories automatically or support this creative process. In this study, we undertake the task of story ending generation. This is a relatively new task, in which the last sentence of a given incomplete story is automatically generated. This is challenging because, in order to predict an appropriate ending, the generation method should comprehend the context of events. Despite the importance of this task, no clear evaluation metric has been established thus far; hence, it has remained an open problem. Therefore, we study the various elements involved in evaluating an automatic method for generating story endings. First, we introduce a baseline hierarchical sequence-to-sequence method for story ending generation. Then, we conduct a pairwise comparison against human-written endings, in which annotators choose the preferable ending. In addition to a quantitative evaluation, we conduct a qualitative evaluation by asking annotators to specify the reason for their choice. From the collected reasons, we discuss what elements the evaluation should focus on, to thereby propose effective metrics for the task.