We consider a new perspective on dialog state tracking (DST), the task of estimating a user’s goal through the course of a dialog. By formulating DST as a semantic parsing task over hierarchical representations, we can incorporate semantic compositionality, cross-domain knowledge sharing and co-reference. We present TreeDST, a dataset of 27k conversations annotated with tree-structured dialog states and system acts. We describe an encoder-decoder framework for DST with hierarchical representations, which leads to ~20% improvement over state-of-the-art DST approaches that operate on a flat meaning space of slot-value pairs.
Active learning (AL) is often used in corpus construction (CC) for selecting “informative” documents for annotation. This is ideal for focusing annotation efforts when all documents cannot be annotated, but has the limitation that it is carried out in a closed-loop, selecting points that will improve an existing model. For phenomena-driven and exploratory CC, the lack of existing-models and specific task(s) for using it make traditional AL inapplicable. In this paper we propose a novel method for model-free AL utilising characteristics of phenomena for applying AL to select documents for annotation. The method can also supplement traditional closed-loop AL-based CC to extend the utility of the corpus created beyond a single task. We introduce our tool, MOVE, and show its potential with a real world case-study.
Proper annotation process management is crucial to the construction of corpora, which are in turn indispensable to the data-driven techniques that have come to the forefront in NLP during the last two decades. It is still common to see ad-hoc tools created for a specific annotation project, but it is time this changed; creation of such tools is labor and time expensive, and is secondary to corpus creation. In addition, such tools likely lack proper annotation process management, increasingly more important as corpora sizes grow in size and complexity. This paper first raises a list of ten needs that any general purpose annotation system should address moving forward, such as user & role management, delegation & monitoring of work, diffing & merging annotators work, versioning of corpora, multilingual support, import/export format flexibility, and so on. A framework to address these needs is then proposed, and how having proper annotation process management can be beneficial to the creation and maintenance of corpora explained. The paper then introduces SLATE (Segment and Link-based Annotation Tool Enhanced), the second iteration of a web-based annotation tool, which is being rewritten to implement the proposed framework.
Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework.