Proceedings of the Sixth Workshop on Vision and Language

Proceedings of the Sixth Workshop on Vision and Language Anya Belz Erkut Erdem Katerina Pastra Krystian Mikolajczyk April 2017

Valencia, Spain

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-20 book VL17:2017 The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings YanchaoYu ArashEshghi GregoryMills OliverLemon Proceedings of the Sixth Workshop on Vision and Language April 2017

Valencia, Spain

Association for Computational Linguistics 1–10 http://www.aclweb.org/anthology/W17-2001 We motivate and describe a new freely available human-human dialogue data set for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; anon.) with a novel task, where a Learner needs to learn invented visual attribute words (such as "burchak" for square) from a tutor. As such, the text-based interactions closely resemble face-to-face conversation and thus contain many of the linguistic phenomena encountered in natural, spontaneous dialogue. These include self- and other-correction, mid-sentence continuations, interruptions, turn overlaps, fillers, hedges and many kinds of ellipsis. We also present a generic n-gram framework for building user (i.e. tutor) simulations from this type of incremental dialogue data, which is freely available to researchers. We show that the simulations produce outputs that are similar to the original data (e.g. 78% turn match similarity). Finally, we train and evaluate a Reinforcement Learning dialogue control agent for learning visually grounded word meanings, trained from the BURCHAK corpus. The learned policy shows comparable performance to a rule-based system built previously. inproceedings yu-EtAl:2017:VL17 The Use of Object Labels and Spatial Prepositions as Keywords in a Web-Retrieval-Based Image Caption Generation System BrandonBirmingham AdrianMuscat Proceedings of the Sixth Workshop on Vision and Language April 2017

Valencia, Spain

Association for Computational Linguistics 11–20 http://www.aclweb.org/anthology/W17-2002 In this paper, a retrieval-based caption generation system that searches the web for suitable image descriptions is studied. Google's reverse image search is used to find potentially relevant web multimedia content for query images. Sentences are extracted from web pages and the likelihood of the descriptions is computed to select one sentence from the retrieved text documents. The search mechanism is modified to replace the caption generated by Google with a caption composed of labels and spatial prepositions as part of the query's text alongside the image. The object labels are obtained using an off-the-shelf R-CNN and a machine learning model is developed to predict the prepositions. The effect on the caption generation system performance when using the generated text is investigated. Both human evaluations and automatic metrics are used to evaluate the retrieved descriptions. Results show that the web-retrieval-based approach performed better when describing single-object images with sentences extracted from stock photography websites. On the other hand, images with two image objects were better described with template-generated sentences composed of object labels and prepositions. inproceedings birmingham-muscat:2017:VL17 Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision AparnaNurani Venkitasubramanian TinneTuytelaars Marie-FrancineMoens Proceedings of the Sixth Workshop on Vision and Language April 2017

Valencia, Spain

Association for Computational Linguistics 21–30 http://www.aclweb.org/anthology/W17-2003 We investigate animal recognition models learned from wildlife video documentaries by using the weak supervision of the textual subtitles. This is a particularly challenging setting, since i) the animals occur in their natural habitat and are often largely occluded and ii) subtitles are to a large degree complementary to the visual content, providing a very weak supervisory signal. This is in contrast to most work on integrated vision and language in the literature, where textual descriptions are tightly linked to the image content, and often generated in a curated fashion for the task at hand. In particular, we investigate different image representations and models, including a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a 'bag-of-activations' image representation, where each element of the bag is considered separately. This representation allows key components in the image to be isolated, in spite of largely varying backgrounds and image clutter, without an object detection or image segmentation step. The methods are evaluated based on how well they transfer to unseen camera-trap images captured across diverse topographical regions under different environmental conditions and illumination settings, involving a large domain shift. inproceedings nuranivenkitasubramanian-tuytelaars-moens:2017:VL17 Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles IacerCalixto DanielStein EvgenyMatusov SheilaCastilho AndyWay Proceedings of the Sixth Workshop on Vision and Language April 2017

Valencia, Spain

Association for Computational Linguistics 31–37 http://www.aclweb.org/anthology/W17-2004 In this paper, we study how humans perceive the use of images as an additional knowledge source to machine-translate user-generated product listings in an e-commerce company. We conduct a human evaluation where we assess how a multi-modal neural machine translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attention-based NMT and a phrase-based statistical machine translation (PBSMT) model. We evaluate translations obtained with different systems and also discuss the data set of user-generated product listings, which in our case comprises both product listings and associated images. We found that humans preferred translations obtained with a PBSMT system to both text-only and multi-modal NMT over 56% of the time. Nonetheless, human evaluators ranked translations from a multi-modal NMT model as better than those of a text-only NMT over 88% of the time, which suggests that images do help NMT in this use-case. inproceedings calixto-EtAl:2017:VL17 The BreakingNews Dataset ArnauRamisa FeiYan FrancescMoreno-Noguer KrystianMikolajczyk Proceedings of the Sixth Workshop on Vision and Language April 2017

Valencia, Spain

Association for Computational Linguistics 38–39 http://www.aclweb.org/anthology/W17-2005 We present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (e.g. GPS coordinates and popularity metrics). The tenuous connection between the images and text in news data is appropriate to take work at the intersection of Computer Vision and Natural Language Processing to the next step, hence we hope this dataset will help spur progress in the field. inproceedings ramisa-EtAl:2017:VL17 Automatic identification of head movements in video-recorded conversations: can words help? PatriziaPaggio CostanzaNavarretta BartJongejan Proceedings of the Sixth Workshop on Vision and Language April 2017

Valencia, Spain

Association for Computational Linguistics 40–42 http://www.aclweb.org/anthology/W17-2006 We present an approach where an SVM classifier learns to classify head movements based on measurements of velocity, acceleration, and the third derivative of position with respect to time, jerk. Consequently, annotations of head movements are added to new video data. The results of the automatic annotation are evaluated against manual annotations in the same data and show an accuracy of 68% with respect to these. The results also show that using jerk improves accuracy. We then conduct an investigation of the overlap between temporal sequences classified as either movement or non-movement and the speech stream of the person performing the gesture. The statistics derived from this analysis show that using word features may help increase the accuracy of the model. inproceedings paggio-navarretta-jongejan:2017:VL17 Multi-Modal Fashion Product Retrieval AntonioRubio Romano LongLongYu EdgarSimo-Serra FrancescMoreno-Noguer Proceedings of the Sixth Workshop on Vision and Language April 2017

Valencia, Spain

Association for Computational Linguistics 43–45 http://www.aclweb.org/anthology/W17-2007 Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. inproceedings rubioromano-EtAl:2017:VL17