Proceedings of the First Workshop on Language Grounding for Robotics

Proceedings of the First Workshop on Language Grounding for Robotics Mohit Bansal Cynthia Matuszek Jacob Andreas Yoav Artzi Yonatan Bisk August 2017

Vancouver, Canada

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-28 book RoboNLP:2017 Grounding Language for Interactive Task Learning PeterLindes AaronMininger James R.Kirk John E.Laird Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 1–9 http://www.aclweb.org/anthology/W17-2801 W17-2801.Datasets.zip W17-2801.Poster.pdf This paper describes how language is grounded by a comprehension system called Lucia within a robotic agent called Rosie that can manipulate objects and navigate indoors. The whole system is built within the Soar cognitive architecture and uses Embodied Construction Grammar (ECG) as a formalism for describing linguistic knowledge. Grounding is performed using knowledge from the grammar itself, from the linguistic context, from the agents perception, and from an ontology of long-term knowledge about object categories and properties and actions the agent can perform. The paper also describes a benchmark corpus of 200 sentences in this domain along with test versions of the world model and ontology and gold-standard meanings for each of the sentences. The benchmark is contained in the supplemental materials. inproceedings lindes-EtAl:2017:RoboNLP Learning how to Learn: An Adaptive Dialogue Agent for Incrementally Learning Visually Grounded Word Meanings YanchaoYu ArashEshghi OliverLemon Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 10–19 http://www.aclweb.org/anthology/W17-2802 We present an optimised multi-modal dialogue agent for interactive learning of visually grounded word meanings from a human tutor, trained on real human-human tutoring data. Within a life-long interactive learning period, the agent, trained using Reinforcement Learning (RL), must be able to handle natural conversations with human users, and achieve good learning performance (i.e. accuracy) while minimising human effort in the learning process. We train and evaluate this system in interaction with a simulated human tutor, which is built on the BURCHAK corpus – a Human-Human Dialogue dataset for the visual learning task. The results show that: 1) The learned policy can coherently interact with the simulated user to achieve the goal of the task (i.e. learning visual attributes of objects, e.g. colour and shape); and 2) it finds a better trade-off between classifier accuracy and tutoring costs than hand-crafted rule-based policies, including ones with dynamic policies. inproceedings yu-eshghi-lemon:2017:RoboNLP Guiding Interaction Behaviors for Multi-modal Grounded Language Learning JesseThomason JivkoSinapov RaymondMooney Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 20–24 http://www.aclweb.org/anthology/W17-2803 Multi-modal grounded language learning connects language predicates to physical properties of objects in the world. Sensing with multiple modalities, such as audio, haptics, and visual colors and shapes while performing interaction behaviors like lifting, dropping, and looking on objects enables a robot to ground non-visual predicates like “empty” as well as visual predicates like “red”. Previous work has established that grounding in multi-modal space improves performance on object retrieval from human descriptions. In this work, we gather behavior annotations from humans and demonstrate that these improve language grounding performance by allowing a system to focus on relevant behaviors for words like “white” or “half-full” that can be understood by looking or lifting, respectively. We also explore adding modality annotations (whether to focus on audio or haptics when performing a behavior), which improves performance, and sharing information between linguistically related predicates (if “green” is a color, “white” is a color), which improves grounding recall but at the cost of precision. inproceedings thomason-sinapov-mooney:2017:RoboNLP Structured Learning for Context-aware Spoken Language Understanding of Robotic Commands AndreaVanzo DaniloCroce RobertoBasili DanieleNardi Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 25–34 http://www.aclweb.org/anthology/W17-2804 Service robots are expected to operate in specific environments, where the presence of humans plays a key role. A major feature of such robotics platforms is thus the ability to react to spoken commands. This requires the understanding of the user utterance with an accuracy able to trigger the robot reaction. Such correct interpretation of linguistic exchanges depends on physical, cognitive and language-dependent aspects related to the environment. In this work, we present the empirical evaluation of an adaptive Spoken Language Understanding chain for robotic commands, that explicitly depends on the operational environment during both the learning and recognition stages. The effectiveness of such a context-sensitive command interpretation is tested against an extension of an already existing corpus of commands, that introduced explicit perceptual knowledge: this enabled deeper measures proving that more accurate disambiguation capabilities can be actually obtained. inproceedings vanzo-EtAl:2017:RoboNLP Natural Language Grounding and Grammar Induction for Robotic Manipulation Commands MuhannadAlomari PaulDuckworth MajdHawasly David C.Hogg Anthony G.Cohn Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 35–43 http://www.aclweb.org/anthology/W17-2805 We present a cognitively plausible system capable of acquiring knowledge in language and vision from pairs of short video clips and linguistic descriptions. The aim of this work is to teach a robot manipulator how to execute natural language commands by demonstration. This is achieved by first learning a set of visual `concepts' that abstract the visual feature spaces into concepts that have human-level meaning. Second, learning the mapping/grounding between words and the extracted visual concepts. Third, inducing grammar rules via a semantic representation known as Robot Control Language (RCL). We evaluate our approach against state-of-the-art supervised and unsupervised grounding and grammar induction systems, and show that a robot can learn to execute never seen-before commands from pairs of unlabelled linguistic and visual inputs. inproceedings alomari-EtAl:2017:RoboNLP Communication with Robots using Multilayer Recurrent Networks BedřichPišl DavidMareček Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 44–48 http://www.aclweb.org/anthology/W17-2806 In this paper, we describe an improvement on the task of giving instructions to robots in a simulated block world using unrestricted natural language commands. inproceedings pivsl-marevcek:2017:RoboNLP Grounding Symbols in Multi-Modal Instructions YordanHristov SvetlinPenkov AlexLascarides SubramanianRamamoorthy Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 49–57 http://www.aclweb.org/anthology/W17-2807 As robots begin to cohabit with humans in semi-structured environments, the need arises to understand instructions involving rich variability–-for instance, learning to ground symbols in the physical world. Realistically, this task must cope with small datasets consisting of a particular users' contextual assignment of meaning to terms. We present a method for processing a raw stream of cross-modal input–-i.e., linguistic instructions, visual perception of a scene and a concurrent trace of 3D eye tracking fixations–-to produce the segmentation of objects with a correspondent association to high-level concepts. To test our framework we present experiments in a table-top object manipulation scenario. Our results show our model learns the user's notion of colour and shape from a small number of physical demonstrations, generalising to identifying physical referents for novel combinations of the words. inproceedings hristov-EtAl:2017:RoboNLP Exploring Variation of Natural Human Commands to a Robot in a Collaborative Navigation Task MatthewMarge ClaireBonial AshleyFoots CoryHayes CassidyHenry KimberlyPollard RonArtstein ClareVoss DavidTraum Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 58–66 http://www.aclweb.org/anthology/W17-2808 W17-2808.Poster.pdf Robot-directed communication is variable, and may change based on human perception of robot capabilities. To collect training data for a dialogue system and to investigate possible communication changes over time, we developed a Wizard-of-Oz study that (a) simulates a robot's limited understanding, and (b) collects dialogues where human participants build a progressively better mental model of the robot's understanding. With ten participants, we collected ten hours of human-robot dialogue. We analyzed the structure of instructions that participants gave to a remote robot before it responded. Our findings show a general initial preference for including metric information (e.g., move forward 3 feet) over landmarks (e.g., move to the desk) in motion commands, but this decreased over time, suggesting changes in perception. inproceedings marge-EtAl:2017:RoboNLP A Tale of Two DRAGGNs: A Hybrid Approach for Interpreting Action-Oriented and Goal-Oriented Instructions SiddharthKaramcheti Edward ClemWilliams DilipArumugam MinaRhee NakulGopalan Lawson L.S.Wong StefanieTellex Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 67–75 http://www.aclweb.org/anthology/W17-2809 Robots operating alongside humans in diverse, stochastic environments must be able to accurately interpret natural language commands. These instructions often fall into one of two categories: those that specify a goal condition or target state, and those that specify explicit actions, or how to perform a given task. Recent approaches have used reward functions as a semantic representation of goal-based commands, which allows for the use of a state-of-the-art planner to find a policy for the given task. However, these reward functions cannot be directly used to represent action-oriented commands. We introduce a new hybrid approach, the Deep Recurrent Action-Goal Grounding Network (DRAGGN), for task grounding and execution that handles natural language from either category as input, and generalizes to unseen environments. Our robot-simulation results demonstrate that a system successfully interpreting both goal-oriented and action-oriented task specifications brings us closer to robust natural language understanding for human-robot interaction. inproceedings karamcheti-EtAl:2017:RoboNLP Are Distributional Representations Ready for the Real World? Evaluating Word Vectors for Grounded Perceptual Meaning LiLucy JonGauthier Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 76–85 http://www.aclweb.org/anthology/W17-2810 Distributional word representation methods exploit word co-occurrences to build compact vector encodings of words. While these representations enjoy widespread use in modern natural language processing, it is unclear whether they accurately encode all necessary facets of conceptual meaning. In this paper, we evaluate how well these representations can predict perceptual and conceptual features of concrete concepts, drawing on two semantic norm datasets sourced from human participants. We find that several standard word representations fail to encode many salient perceptual features of concepts, and show that these deficits correlate with word-word similarity prediction errors. Our analyses provide motivation for grounded and embodied language learning approaches, which may help to remedy these deficits. inproceedings lucy-gauthier:2017:RoboNLP Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of Multimodal Features in Spoken Human-Robot Interaction JekaterinaNovikova ChristianDondrup IoannisPapaioannou OliverLemon Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 86–94 http://www.aclweb.org/anthology/W17-2811 Recognition of social signals, coming from human facial expressions or prosody of human speech, is a popular research topic in human-robot interaction studies. There is also a long line of research in the spoken dialogue community that investigates user satisfaction in relation to dialogue characteristics. However, very little research relates a combination of multimodal social signals and language features detected during spoken face-to-face human-robot interaction to the resulting user perception of a robot. In this paper we show how different emotional facial expressions of human users, in combination with prosodic characteristics of human speech and features of human-robot dialogue, correlate with users’ impressions of the robot after a conversation. We find that happiness in the user’s recognised facial expression strongly correlates with likeability of a robot, while dialogue-related features (such as number of human turns or number of sentences per robot utterance) correlate with perceiving a robot as intelligent. In addition, we show that the facial expression emotional features and prosody are better predictors of human ratings related to perceived robot likeability and anthropomorphism, while linguistic and non-linguistic features more often predict perceived robot intelligence and interpretability. As such, these characteristics may in future be used as an online reward signal for in-situ Reinforcement Learning-based adaptive human-robot dialogue systems. inproceedings novikova-EtAl:2017:RoboNLP Towards Problem Solving Agents that Communicate and Learn AnjaliNarayan-Chen ColinGraber MayukhDas Md RakibulIslam SohamDan SriraamNatarajan Janardhan RaoDoppa JuliaHockenmaier MarthaPalmer DanRoth Proceedings of the First Workshop on Language Grounding for Robotics August 2017

Vancouver, Canada

Association for Computational Linguistics 95–103 http://www.aclweb.org/anthology/W17-2812 Agents that communicate back and forth with humans to help them execute non-linguistic tasks are a long sought goal of AI. These agents need to translate between utterances and actionable meaning representations that can be interpreted by task-specific problem solvers in a context-dependent manner. They should also be able to learn such actionable interpretations for new predicates on the fly. We define an agent architecture for this scenario and present a series of experiments in the Blocks World domain that illustrate how our architecture supports language learning and problem solving in this domain. inproceedings narayanchen-EtAl:2017:RoboNLP