<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="2800">
    <title>Proceedings of the First Workshop on Language Grounding for Robotics</title>
    <editor>Mohit Bansal</editor>
    <editor>Cynthia Matuszek</editor>
    <editor>Jacob Andreas</editor>
    <editor>Yoav Artzi</editor>
    <editor>Yonatan Bisk</editor>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W17-28</url>
    <bibtype>book</bibtype>
    <bibkey>RoboNLP:2017</bibkey>
  </paper>

  <paper id="2801">
    <title>Grounding Language for Interactive Task Learning</title>
    <author><first>Peter</first><last>Lindes</last></author>
    <author><first>Aaron</first><last>Mininger</last></author>
    <author><first>James R.</first><last>Kirk</last></author>
    <author><first>John E.</first><last>Laird</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;9</pages>
    <url>http://www.aclweb.org/anthology/W17-2801</url>
    <dataset>W17-2801.Datasets.zip</dataset>
    <attachment type="poster">W17-2801.Poster.pdf</attachment>
    <abstract>This paper describes how language is
	grounded by a comprehension system
	called Lucia within a robotic agent called
	Rosie that can manipulate objects and
	navigate indoors. The whole system is
	built within the Soar cognitive architecture
	and uses Embodied Construction Grammar
	(ECG) as a formalism for describing
	linguistic knowledge. Grounding is performed
	using knowledge from the grammar
	itself, from the linguistic context,
	from the agents perception, and from an
	ontology of long-term knowledge about
	object categories and properties and actions
	the agent can perform. The paper
	also describes a benchmark corpus of 200
	sentences in this domain along with test
	versions of the world model and ontology
	and gold-standard meanings for each
	of the sentences. The benchmark is contained
	in the supplemental materials.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>lindes-EtAl:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2802">
    <title>Learning how to Learn: An Adaptive Dialogue Agent for Incrementally Learning Visually Grounded Word Meanings</title>
    <author><first>Yanchao</first><last>Yu</last></author>
    <author><first>Arash</first><last>Eshghi</last></author>
    <author><first>Oliver</first><last>Lemon</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>10&#8211;19</pages>
    <url>http://www.aclweb.org/anthology/W17-2802</url>
    <abstract>We present an optimised multi-modal dialogue agent for interactive learning of
	visually grounded word meanings from a human tutor, trained on real human-human
	tutoring data. Within a life-long interactive learning period, the agent,
	trained using Reinforcement Learning (RL), must be able to handle natural
	conversations with human users, and achieve  good learning performance (i.e.
	accuracy) while minimising human effort in the learning process. We train and
	evaluate this  system in interaction with a simulated human tutor, which is
	built on the BURCHAK corpus &#8211; a Human-Human Dialogue dataset for the visual
	learning task. The results show that: 1) The learned policy can coherently
	interact with the simulated user to achieve the goal of the task (i.e. learning
	visual attributes of  objects, e.g. colour and shape); and 2) it finds a better
	trade-off between  classifier accuracy and tutoring costs than hand-crafted
	rule-based policies, including ones with dynamic policies.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>yu-eshghi-lemon:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2803">
    <title>Guiding Interaction Behaviors for Multi-modal Grounded Language Learning</title>
    <author><first>Jesse</first><last>Thomason</last></author>
    <author><first>Jivko</first><last>Sinapov</last></author>
    <author><first>Raymond</first><last>Mooney</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>20&#8211;24</pages>
    <url>http://www.aclweb.org/anthology/W17-2803</url>
    <abstract>Multi-modal grounded language learning connects language predicates to physical
	properties of objects in the world. Sensing with multiple modalities, such as
	audio, haptics, and visual colors and shapes while performing interaction
	behaviors like lifting, dropping, and looking on objects enables a robot to
	ground non-visual predicates like &#x201c;empty&#x201d; as well as visual predicates like
	&#x201c;red&#x201d;. Previous work has established that grounding in multi-modal space
	improves performance on object retrieval from human descriptions. In this work,
	we gather behavior annotations from humans and demonstrate that these improve
	language grounding performance by allowing a system to focus on relevant
	behaviors for words like &#x201c;white&#x201d; or &#x201c;half-full&#x201d; that can be understood by
	looking or lifting, respectively. We also explore adding modality annotations
	(whether to focus on audio or haptics when performing a behavior), which
	improves performance, and sharing information between linguistically related
	predicates (if &#x201c;green&#x201d; is a color, &#x201c;white&#x201d; is a color), which improves
	grounding recall but at the cost of precision.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>thomason-sinapov-mooney:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2804">
    <title>Structured Learning for Context-aware Spoken Language Understanding of Robotic Commands</title>
    <author><first>Andrea</first><last>Vanzo</last></author>
    <author><first>Danilo</first><last>Croce</last></author>
    <author><first>Roberto</first><last>Basili</last></author>
    <author><first>Daniele</first><last>Nardi</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>25&#8211;34</pages>
    <url>http://www.aclweb.org/anthology/W17-2804</url>
    <abstract>Service robots are expected to operate in specific environments, where the
	presence of humans plays a key role. A major feature of such robotics platforms
	is thus the ability to react to spoken commands. This requires the
	understanding of the user utterance with an accuracy able to trigger the robot
	reaction.
	Such correct interpretation of linguistic exchanges depends on physical,
	cognitive and language-dependent aspects related to the environment. In this
	work, we present the empirical evaluation of an adaptive Spoken Language
	Understanding chain for robotic commands, that explicitly depends on the
	operational environment during both the learning and recognition stages. The
	effectiveness of such a context-sensitive command interpretation is tested
	against an extension of an already existing corpus of commands, that introduced
	explicit perceptual knowledge: this enabled deeper measures proving that more
	accurate disambiguation capabilities can be actually obtained.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>vanzo-EtAl:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2805">
    <title>Natural Language Grounding and Grammar Induction for Robotic Manipulation Commands</title>
    <author><first>Muhannad</first><last>Alomari</last></author>
    <author><first>Paul</first><last>Duckworth</last></author>
    <author><first>Majd</first><last>Hawasly</last></author>
    <author><first>David C.</first><last>Hogg</last></author>
    <author><first>Anthony G.</first><last>Cohn</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>35&#8211;43</pages>
    <url>http://www.aclweb.org/anthology/W17-2805</url>
    <abstract>We present a cognitively plausible system capable of acquiring knowledge in
	language and vision from pairs of short video clips and linguistic
	descriptions. The aim of this work is to teach a robot manipulator how to
	execute natural language commands by demonstration. This is achieved by first
	learning a set of visual `concepts' that abstract the visual feature spaces
	into concepts that have human-level meaning. Second, learning the
	mapping/grounding between words and the extracted visual concepts. Third,
	inducing grammar rules via a semantic representation known as Robot Control
	Language (RCL).
	We evaluate our approach against state-of-the-art supervised and unsupervised
	grounding and grammar induction systems, and show that a robot can learn to
	execute never seen-before commands from pairs of unlabelled linguistic and
	visual inputs.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>alomari-EtAl:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2806">
    <title>Communication with Robots using Multilayer Recurrent Networks</title>
    <author><first>Bed&#x159;ich</first><last>Pi&#x161;l</last></author>
    <author><first>David</first><last>Mare&#x10D;ek</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>44&#8211;48</pages>
    <url>http://www.aclweb.org/anthology/W17-2806</url>
    <abstract>In this paper, we describe an improvement on the task of giving instructions to
	robots in a simulated block world using unrestricted natural language commands.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>pivsl-marevcek:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2807">
    <title>Grounding Symbols in Multi-Modal Instructions</title>
    <author><first>Yordan</first><last>Hristov</last></author>
    <author><first>Svetlin</first><last>Penkov</last></author>
    <author><first>Alex</first><last>Lascarides</last></author>
    <author><first>Subramanian</first><last>Ramamoorthy</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>49&#8211;57</pages>
    <url>http://www.aclweb.org/anthology/W17-2807</url>
    <abstract>As robots begin to cohabit with humans in semi-structured environments, the
	need arises to understand instructions involving rich variability&#8211;-for
	instance, learning to ground symbols in the physical world. Realistically, this
	task must cope with small datasets consisting of a particular users' contextual
	assignment of meaning to terms. We present a method for processing a raw stream
	of cross-modal input&#8211;-i.e., linguistic instructions, visual perception of a
	scene and a concurrent trace of 3D eye tracking fixations&#8211;-to produce the
	segmentation of objects with a correspondent association to high-level
	concepts. To test our framework we present experiments in a table-top object
	manipulation scenario. Our results show our model learns the user's notion of
	colour and shape from a small number of physical demonstrations, generalising
	to identifying physical referents for novel combinations of the words.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>hristov-EtAl:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2808">
    <title>Exploring Variation of Natural Human Commands to a Robot in a Collaborative Navigation Task</title>
    <author><first>Matthew</first><last>Marge</last></author>
    <author><first>Claire</first><last>Bonial</last></author>
    <author><first>Ashley</first><last>Foots</last></author>
    <author><first>Cory</first><last>Hayes</last></author>
    <author><first>Cassidy</first><last>Henry</last></author>
    <author><first>Kimberly</first><last>Pollard</last></author>
    <author><first>Ron</first><last>Artstein</last></author>
    <author><first>Clare</first><last>Voss</last></author>
    <author><first>David</first><last>Traum</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>58&#8211;66</pages>
    <url>http://www.aclweb.org/anthology/W17-2808</url>
    <attachment type="poster">W17-2808.Poster.pdf</attachment>
    <abstract>Robot-directed communication is variable, and may change based on human
	perception of robot capabilities. To collect training data for a dialogue
	system and to investigate possible communication changes over time, we
	developed a Wizard-of-Oz study that (a) simulates a robot's limited
	understanding, and (b) collects dialogues where human participants build a
	progressively better mental model of the robot's understanding. With ten
	participants, we collected ten hours of human-robot dialogue. We analyzed the
	structure of instructions that participants gave to a remote robot before it
	responded. Our findings show a general initial preference for including metric
	information (e.g., move forward 3 feet) over landmarks (e.g., move to the desk)
	in motion commands, but this decreased over time, suggesting changes in
	perception.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>marge-EtAl:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2809">
    <title>A Tale of Two DRAGGNs: A Hybrid Approach for Interpreting Action-Oriented and Goal-Oriented Instructions</title>
    <author><first>Siddharth</first><last>Karamcheti</last></author>
    <author><first>Edward Clem</first><last>Williams</last></author>
    <author><first>Dilip</first><last>Arumugam</last></author>
    <author><first>Mina</first><last>Rhee</last></author>
    <author><first>Nakul</first><last>Gopalan</last></author>
    <author><first>Lawson L.S.</first><last>Wong</last></author>
    <author><first>Stefanie</first><last>Tellex</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>67&#8211;75</pages>
    <url>http://www.aclweb.org/anthology/W17-2809</url>
    <abstract>Robots operating alongside humans in diverse, stochastic environments must be
	able to accurately interpret natural language commands. These instructions
	often fall into one of two categories: those that specify a goal condition or
	target state, and those that specify explicit actions, or how to perform a
	given task. Recent approaches have used reward functions as a semantic
	representation of goal-based commands, which allows for the use of a
	state-of-the-art planner to find a policy for the given task. However, these
	reward functions cannot be directly used to represent action-oriented commands.
	We introduce a new hybrid approach, the Deep Recurrent Action-Goal Grounding
	Network (DRAGGN), for task grounding and execution that handles natural
	language from either category as input, and generalizes to unseen environments.
	Our robot-simulation results demonstrate that a system successfully
	interpreting both goal-oriented and action-oriented task specifications brings
	us closer to robust natural language understanding for human-robot interaction.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>karamcheti-EtAl:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2810">
    <title>Are Distributional Representations Ready for the Real World? Evaluating Word Vectors for Grounded Perceptual Meaning</title>
    <author><first>Li</first><last>Lucy</last></author>
    <author><first>Jon</first><last>Gauthier</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>76&#8211;85</pages>
    <url>http://www.aclweb.org/anthology/W17-2810</url>
    <abstract>Distributional word representation methods exploit word co-occurrences to build
	compact vector encodings of words. While these representations enjoy widespread
	use in modern natural language processing, it is unclear whether they
	accurately encode all necessary facets of conceptual meaning. In this paper, we
	evaluate how well these representations can predict perceptual and conceptual
	features of concrete concepts, drawing on two semantic norm datasets sourced
	from human participants. We find that several standard word representations
	fail to encode many salient perceptual features of concepts, and show that
	these deficits correlate with word-word similarity prediction errors. Our
	analyses provide motivation for grounded and embodied language learning
	approaches, which may help to remedy these deficits.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>lucy-gauthier:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2811">
    <title>Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of Multimodal Features in Spoken Human-Robot Interaction</title>
    <author><first>Jekaterina</first><last>Novikova</last></author>
    <author><first>Christian</first><last>Dondrup</last></author>
    <author><first>Ioannis</first><last>Papaioannou</last></author>
    <author><first>Oliver</first><last>Lemon</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>86&#8211;94</pages>
    <url>http://www.aclweb.org/anthology/W17-2811</url>
    <abstract>Recognition of social signals, coming from human facial expressions or prosody
	of human speech, is a popular research topic in human-robot interaction
	studies. There is also a long line of research in the spoken dialogue community
	that investigates user satisfaction in relation to dialogue characteristics.
	However, very little research relates a combination of multimodal social
	signals and language features detected during spoken face-to-face human-robot
	interaction to the resulting user perception of a robot. In this paper we show
	how different emotional facial expressions of human users, in combination with
	prosodic characteristics of human speech and features of human-robot dialogue,
	correlate with users’ impressions of the robot after a conversation. We find
	that happiness in the user’s recognised facial expression strongly correlates
	with likeability of a robot, while dialogue-related features (such as number of
	human turns or number of sentences per robot utterance) correlate with
	perceiving a robot as intelligent. In addition, we show that the facial
	expression emotional features and prosody are better predictors of
	human ratings related to perceived robot likeability and anthropomorphism,
	while linguistic and non-linguistic features more often predict perceived robot
	intelligence and interpretability. As such, these characteristics may in future
	be used as an online reward signal for in-situ Reinforcement Learning-based
	adaptive human-robot dialogue systems.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>novikova-EtAl:2017:RoboNLP</bibkey>
  </paper>

  <paper id="2812">
    <title>Towards Problem Solving Agents that Communicate and Learn</title>
    <author><first>Anjali</first><last>Narayan-Chen</last></author>
    <author><first>Colin</first><last>Graber</last></author>
    <author><first>Mayukh</first><last>Das</last></author>
    <author><first>Md Rakibul</first><last>Islam</last></author>
    <author><first>Soham</first><last>Dan</last></author>
    <author><first>Sriraam</first><last>Natarajan</last></author>
    <author><first>Janardhan Rao</first><last>Doppa</last></author>
    <author><first>Julia</first><last>Hockenmaier</last></author>
    <author><first>Martha</first><last>Palmer</last></author>
    <author><first>Dan</first><last>Roth</last></author>
    <booktitle>Proceedings of the First Workshop on Language Grounding for Robotics</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>95&#8211;103</pages>
    <url>http://www.aclweb.org/anthology/W17-2812</url>
    <abstract>Agents that communicate back and forth with humans to help them execute
	non-linguistic tasks are a long sought goal of AI. These agents need to
	translate between utterances and actionable meaning representations that can be
	interpreted by task-specific problem solvers in a context-dependent manner.
	They should also be able to learn such actionable interpretations for new
	predicates on the fly. We define an agent architecture for this scenario and
	present a series of experiments in the Blocks World domain that illustrate how
	our architecture supports language learning and problem solving in this domain.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>narayanchen-EtAl:2017:RoboNLP</bibkey>
  </paper>

</volume>

