<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="2000">
    <title>Proceedings of the Sixth Workshop on Vision and Language</title>
    <editor>Anya Belz</editor>
    <editor>Erkut Erdem</editor>
    <editor>Katerina Pastra</editor>
    <editor>Krystian Mikolajczyk</editor>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W17-20</url>
    <bibtype>book</bibtype>
    <bibkey>VL17:2017</bibkey>
  </paper>

  <paper id="2001">
    <title>The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings</title>
    <author><first>Yanchao</first><last>Yu</last></author>
    <author><first>Arash</first><last>Eshghi</last></author>
    <author><first>Gregory</first><last>Mills</last></author>
    <author><first>Oliver</first><last>Lemon</last></author>
    <booktitle>Proceedings of the Sixth Workshop on Vision and Language</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;10</pages>
    <url>http://www.aclweb.org/anthology/W17-2001</url>
    <abstract>We motivate and describe a new freely available human-human dialogue data set
	for interactive learning of visually grounded word meanings through ostensive
	definition by a tutor to a learner. The data has been collected using a novel,
	character-by-character variant of the DiET
	chat tool (Healey et al., 2003; anon.) with a novel task, where a Learner needs
	to learn invented visual attribute words (such as "burchak" for square)
	from a tutor. As such, the text-based interactions closely resemble
	face-to-face conversation and thus contain many of the linguistic
	phenomena encountered in natural, spontaneous dialogue. These include self- and
	other-correction, mid-sentence continuations, interruptions, turn overlaps,
	fillers, hedges and many kinds of ellipsis. We also present a generic n-gram
	framework for building user (i.e. tutor) simulations from this type of
	incremental dialogue data, which is freely available to researchers. We show
	that the simulations produce outputs that are similar to the original data
	(e.g. 78% turn match similarity). Finally, we train and evaluate a
	Reinforcement Learning dialogue control agent for learning visually grounded
	word meanings, trained from the BURCHAK corpus. The learned policy shows
	comparable performance to a rule-based system built previously.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>yu-EtAl:2017:VL17</bibkey>
  </paper>

  <paper id="2002">
    <title>The Use of Object Labels and Spatial Prepositions as Keywords in a Web-Retrieval-Based Image Caption Generation System</title>
    <author><first>Brandon</first><last>Birmingham</last></author>
    <author><first>Adrian</first><last>Muscat</last></author>
    <booktitle>Proceedings of the Sixth Workshop on Vision and Language</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>11&#8211;20</pages>
    <url>http://www.aclweb.org/anthology/W17-2002</url>
    <abstract>In this paper, a retrieval-based caption generation system that searches the
	web for suitable image descriptions is studied. Google's reverse image search
	is used to find potentially relevant web multimedia content for query images.
	Sentences are extracted from web pages and the likelihood of the descriptions
	is
	computed to select one sentence from the retrieved text documents. The search
	mechanism is modified to replace the caption generated by Google with a caption
	composed of labels and spatial prepositions as part of the query's text
	alongside the image. The object labels are obtained using an off-the-shelf
	R-CNN and a machine learning model is developed to predict the prepositions.
	The effect on the caption generation system performance when using the
	generated text is investigated. Both human evaluations and automatic metrics
	are used to evaluate the retrieved descriptions. Results show that the
	web-retrieval-based approach performed better when describing single-object
	images with sentences extracted from stock photography websites. On the other
	hand, images with two image objects were better described with
	template-generated sentences composed of object labels and prepositions.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>birmingham-muscat:2017:VL17</bibkey>
  </paper>

  <paper id="2003">
    <title>Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision</title>
    <author><first>Aparna</first><last>Nurani Venkitasubramanian</last></author>
    <author><first>Tinne</first><last>Tuytelaars</last></author>
    <author><first>Marie-Francine</first><last>Moens</last></author>
    <booktitle>Proceedings of the Sixth Workshop on Vision and Language</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>21&#8211;30</pages>
    <url>http://www.aclweb.org/anthology/W17-2003</url>
    <abstract>We investigate animal recognition models learned from wildlife video
	documentaries by using the weak supervision of the textual subtitles. This is a
	particularly challenging setting, since i) the animals occur in their natural
	habitat and are often largely occluded and ii) subtitles are to a large degree
	complementary to the visual content, providing a very weak supervisory signal.
	This is in contrast to most work on integrated vision and language in the
	literature, where textual descriptions are tightly linked to the image content,
	and often generated in a curated fashion for the task at hand. In particular,
	we investigate different image representations and models, including a support
	vector machine on top of activations of a pretrained convolutional neural
	network, as well as a Naive Bayes framework on a 'bag-of-activations' image
	representation, where each element of the bag is considered separately. This
	representation allows key components in the image to be isolated, in spite of
	largely varying backgrounds and image clutter, without an object detection or
	image segmentation step. The methods are evaluated based on how well they
	transfer to unseen camera-trap images captured across diverse topographical
	regions under different environmental conditions and illumination settings,
	involving a large domain shift.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>nuranivenkitasubramanian-tuytelaars-moens:2017:VL17</bibkey>
  </paper>

  <paper id="2004">
    <title>Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles</title>
    <author><first>Iacer</first><last>Calixto</last></author>
    <author><first>Daniel</first><last>Stein</last></author>
    <author><first>Evgeny</first><last>Matusov</last></author>
    <author><first>Sheila</first><last>Castilho</last></author>
    <author><first>Andy</first><last>Way</last></author>
    <booktitle>Proceedings of the Sixth Workshop on Vision and Language</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>31&#8211;37</pages>
    <url>http://www.aclweb.org/anthology/W17-2004</url>
    <abstract>In this paper, we study how humans perceive the use of images as an additional
	knowledge source to machine-translate user-generated product listings in an
	e-commerce company. We conduct a human evaluation where we assess how a
	multi-modal neural machine translation (NMT) model compares to two text-only
	approaches: a conventional state-of-the-art attention-based NMT and a
	phrase-based statistical machine translation (PBSMT) model. We evaluate
	translations obtained with different systems and also discuss the data set of
	user-generated product listings, which in our case comprises both product
	listings and associated images. We found that humans preferred translations
	obtained with a PBSMT system to both text-only and multi-modal NMT over 56% of
	the time. Nonetheless, human evaluators ranked translations from a multi-modal
	NMT model as better than those of a text-only NMT over 88% of the time, which
	suggests that images do help NMT in this use-case.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>calixto-EtAl:2017:VL17</bibkey>
  </paper>

  <paper id="2005">
    <title>The BreakingNews Dataset</title>
    <author><first>Arnau</first><last>Ramisa</last></author>
    <author><first>Fei</first><last>Yan</last></author>
    <author><first>Francesc</first><last>Moreno-Noguer</last></author>
    <author><first>Krystian</first><last>Mikolajczyk</last></author>
    <booktitle>Proceedings of the Sixth Workshop on Vision and Language</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>38&#8211;39</pages>
    <url>http://www.aclweb.org/anthology/W17-2005</url>
    <abstract>We present BreakingNews, a novel dataset with approximately 100K news articles
	including images, text and captions, and enriched with heterogeneous meta-data
	(e.g. GPS coordinates and popularity metrics). The tenuous connection between
	the images and text in news data is appropriate to take work at the
	intersection of Computer Vision and Natural Language Processing to the next
	step, hence we hope this dataset will help spur progress in the field.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ramisa-EtAl:2017:VL17</bibkey>
  </paper>

  <paper id="2006">
    <title>Automatic identification of head movements in video-recorded conversations: can words help?</title>
    <author><first>Patrizia</first><last>Paggio</last></author>
    <author><first>Costanza</first><last>Navarretta</last></author>
    <author><first>Bart</first><last>Jongejan</last></author>
    <booktitle>Proceedings of the Sixth Workshop on Vision and Language</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>40&#8211;42</pages>
    <url>http://www.aclweb.org/anthology/W17-2006</url>
    <abstract>We present an approach where an SVM classifier learns to classify head
	movements based on measurements of velocity, acceleration, and the third
	derivative of position with respect to time, jerk. Consequently,
	annotations of head movements are added to new video data. The results of the
	automatic annotation are evaluated against manual annotations in the same data
	and show an accuracy of 68% with respect to these. The results also show that
	using jerk improves accuracy. We then conduct an investigation of the
	overlap between temporal sequences classified as either movement or
	non-movement and the speech
	stream of the person performing the gesture. The statistics derived from this
	analysis show that using word features may help increase the accuracy of the
	model.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>paggio-navarretta-jongejan:2017:VL17</bibkey>
  </paper>

  <paper id="2007">
    <title>Multi-Modal Fashion Product Retrieval</title>
    <author><first>Antonio</first><last>Rubio Romano</last></author>
    <author><first>LongLong</first><last>Yu</last></author>
    <author><first>Edgar</first><last>Simo-Serra</last></author>
    <author><first>Francesc</first><last>Moreno-Noguer</last></author>
    <booktitle>Proceedings of the Sixth Workshop on Vision and Language</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>43&#8211;45</pages>
    <url>http://www.aclweb.org/anthology/W17-2007</url>
    <abstract>Finding a product in the fashion world can be a daunting task. Everyday,
	e-commerce sites are updating with thousands of images and their associated
	metadata (textual information), deepening the problem. In this paper, we
	leverage both the images and textual metadata and propose a joint multi-modal
	embedding that maps both the text and images into a common latent space.
	Distances in the latent space correspond to similarity between products,
	allowing us to effectively perform retrieval in this latent space. We compare
	against existing approaches and show significant improvements in retrieval
	tasks on a large-scale e-commerce dataset.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>rubioromano-EtAl:2017:VL17</bibkey>
  </paper>

</volume>

