It has become increasingly common for people to share cooking recipes on the Internet. Along with the increase in the number of shared recipes, there have been corresponding increases in recipe-related studies and datasets. However, there are still few datasets that provide linguistic annotations for the recipe-related studies even though such annotations should form the basis of the studies. This paper introduces a novel recipe-related dataset, named Cookpad Parsed Corpus, which contains linguistic annotations for Japanese recipes. We randomly extracted 500 recipes from the largest recipe-related dataset, the Cookpad Recipe Dataset, and annotated 4; 738 sentences in the recipes with morphemes, named entities, and dependency relations. This paper also reports benchmark results on our corpus for Japanese morphological analysis, named entity recognition, and dependency parsing. We show that there is still room for improvement in the analyses of recipes.
Non-ingredient Detection in User-generated Recipes using the Sequence Tagging Approach
Yasuhiro Yamaguchi | Shintaro Inuzuka | Makoto Hiramatsu | Jun Harashima
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Recently, the number of user-generated recipes on the Internet has increased. In such recipes, users are generally supposed to write a title, an ingredient list, and steps to create a dish. However, some items in an ingredient list in a user-generated recipe are not actually edible ingredients. For example, headings, comments, and kitchenware sometimes appear in an ingredient list because users can freely write the list in their recipes. Such noise makes it difficult for computers to use recipes for a variety of tasks, such as calorie estimation. To address this issue, we propose a non-ingredient detection method inspired by a neural sequence tagging model. In our experiment, we annotated 6,675 ingredients in 600 user-generated recipes and showed that our proposed method achieved a 93.3 F1 score.