Atsushi Hashimoto


pdf bib
Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning
Ukyo Honda | Yoshitaka Ushiku | Atsushi Hashimoto | Taro Watanabe | Yuji Matsumoto
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images. In previous work, pseudo-captions, i.e., sentences that contain the detected object labels, were assigned to a given image. The focus of the previous work was on the alignment of input images and pseudo-captions at the sentence level. However, pseudo-captions contain many words that are irrelevant to a given image. In this work, we investigate the effect of removing mismatched words from image-sentence alignment to determine how they make this task difficult. We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions: the detected object labels. The experimental results show that our proposed method outperforms the previous methods without introducing complex sentence-level learning objectives. Combined with the sentence-level alignment method of previous work, our method further improves its performance. These results confirm the importance of careful alignment in word-level details.


pdf bib
Visual Grounding Annotation of Recipe Flow Graph
Taichi Nishimura | Suzushi Tomori | Hayato Hashimoto | Atsushi Hashimoto | Yoko Yamakata | Jun Harashima | Yoshitaka Ushiku | Shinsuke Mori
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we provide a dataset that gives visual grounding annotations to recipe flow graphs. A recipe flow graph is a representation of the cooking workflow, which is designed with the aim of understanding the workflow from natural language processing. Such a workflow will increase its value when grounded to real-world activities, and visual grounding is a way to do so. Visual grounding is provided as bounding boxes to image sequences of recipes, and each bounding box is linked to an element of the workflow. Because the workflows are also linked to the text, this annotation gives visual grounding with workflow’s contextual information between procedural text and visual observation in an indirect manner. We subsidiarily annotated two types of event attributes with each bounding box: “doing-the-action,” or “done-the-action”. As a result of the annotation, we got 2,300 bounding boxes in 272 flow graph recipes. Various experiments showed that the proposed dataset enables us to estimate contextual information described in recipe flow graphs from an image sequence.


pdf bib
Procedural Text Generation from a Photo Sequence
Taichi Nishimura | Atsushi Hashimoto | Shinsuke Mori
Proceedings of the 12th International Conference on Natural Language Generation

Multimedia procedural texts, such as instructions and manuals with pictures, support people to share how-to knowledge. In this paper, we propose a method for generating a procedural text given a photo sequence allowing users to obtain a multimedia procedural text. We propose a single embedding space both for image and text enabling to interconnect them and to select appropriate words to describe a photo. We implemented our method and tested it on cooking instructions, i.e., recipes. Various experimental results showed that our method outperforms standard baselines.


pdf bib
Procedural Text Generation from an Execution Video
Atsushi Ushiku | Hayato Hashimoto | Atsushi Hashimoto | Shinsuke Mori
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In recent years, there has been a surge of interest in automatically describing images or videos in a natural language. These descriptions are useful for image/video search, etc. In this paper, we focus on procedure execution videos, in which a human makes or repairs something and propose a method for generating procedural texts from them. Since video/text pairs available are limited in size, the direct application of end-to-end deep learning is not feasible. Thus we propose to train Faster R-CNN network for object recognition and LSTM for text generation and combine them at run time. We took pairs of recipe and cooking video, generated a recipe from a video, and compared it with the original recipe. The experimental results showed that our method can produce a recipe as accurate as the state-of-the-art scene descriptions.


pdf bib
FlowGraph2Text: Automatic Sentence Skeleton Compilation for Procedural Text Generation
Shinsuke Mori | Hirokuni Maeta | Tetsuro Sasada | Koichiro Yoshino | Atsushi Hashimoto | Takuya Funatomi | Yoko Yamakata
Proceedings of the 8th International Natural Language Generation Conference (INLG)