16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation PROCEEDINGS
Harry Bunt (Editor)
This paper discusses the current state of developing an ISO standard annotation scheme for quantification phenomena in natural language, as part of the ISO Semantic Annotation Framework (ISO 24617). A proposed approach that combines ideas from the theory of generalised quantifiers and from neo-Davidsonian event semantics was adopted by the ISO organisation in 2019 as a starting point for developing such an annotation scheme. * This scheme consists of (1) a conceptual ‘metamodel’ that visualises the types of entities, functions and relations that go into annotations of quantification; (2) an abstract syntax which defines ‘annotation structures’ as triples and other set-theoretic constructs; (3) an XML-based representation of annotation structures (‘concrete syntax’); and (4) a compositional semantics of annotation structures. The latter three components together define the interpreted markup language QuantML. The focus in this paper is on the structuring of the semantic information needed to characterise quantification in natural language and the representation of these structures in QuantML.
ISO-TimeML is an international standard for multilingual event annotation, detection, categorization and linking. In this paper, we present the Hindi TimeBank, an ISO-TimeML annotated reference corpus for the detection and classification of events, states and time expressions, and the links between them. Based on contemporary developments in Hindi event recognition, we propose language independent and language-specific deviations from the ISO-TimeML guidelines, but preserve the schema. These deviations include the inclusion of annotator confidence, and an independent mechanism of identifying and annotating states such as copulars and existentials) With this paper, we present an open-source corpus, the Hindi TimeBank. The Hindi TimeBank is a 1,000 article dataset, with over 25,000 events, 3,500 states and 2,000 time expressions. We analyze the dataset in detail and provide a class-wise distribution of events, states and time expressions. Our guidelines and dataset are backed by high average inter-annotator agreement scores.
The paper presents an annotation schema with the following characteristics: it is formally compact; it systematically and compositionally expands into fullfledged analytic representations, exploiting simple algorithms of typed feature structures; its representation of various dimensions of semantic content is systematically integrated with morpho-syntactic and lexical representation; it is integrated with a ‘deep’ parsing grammar. Its compactness allows for efficient handling of large amounts of structures and data, and it is interoperable in covering multiple aspects of grammar and meaning. The code and its analytic expansions represent a cross-linguistically wide range of phenomena of languages and language structures. This paper presents its syntactic-semantic interoperability first from a theoretical point of view and then as applied in linguistic description.
People’s visual perception is very pronounced and therefore it is usually no problem for them to describe the space around them in words. Conversely, people also have no problems imagining a concept of a described space. In recent years many efforts have been made to develop a linguistic concept for spatial and spatial-temporal relations. However, the systems have not really caught on so far, which in our opinion is due to the complex models on which they are based and the lack of available training data and automated taggers. In this paper we describe a project to support spatial annotation, which could facilitate annotation by its many functions, but also enrich it with many more information. This is to be achieved by an extension by means of a VR environment, with which spatial relations can be better visualized and connected with real objects. And we want to use the available data to develop a new state-of-the-art tagger and thus lay the foundation for future systems such as improved text understanding for Text2Scene.
This paper proposes a semantics ABS for the model-theoretic interpretation of annotation structures. It provides a language ABSr, that represents semantic forms in a (possibly 𝜆-free) type-theoretic first-order logic. For semantic compositionality, the representation language introduces two operators ⊕ and ⊘ with subtypes for the conjunctive or distributive composition of semantic forms. ABS also introduces a small set of logical predicates to represent semantic forms in a simplified format. The use of ABSr is illustrated with some annotation structures that conform to ISO 24617 standards on semantic annotation such as ISO-TimeML and ISO-Space.
This short research paper presents the results of a corpus-based metonymy annotation exercise on a sample of 101 Croatian verb entries – corresponding to 457 patters and over 20,000 corpus lines – taken from CROATPAS (Marini & Ježek, 2019), a digital repository of verb argument structures manually annotated with Semantic Type labels on their argument slots following a methodology inspired by Corpus Pattern Analysis (Hanks, 2004 & 2013; Hanks & Pustejovsky, 2005). CROATPAS will be made available online in 2020. Semantic Type labelling is not only well-suited to annotate verbal polysemy, but also metonymic shifts in verb argument combinations, which in Generative Lexicon (Pustejovsky, 1995 & 1998; Pustejovsky & Ježek, 2008) are called Semantic Type coercions. From a sub lexical point of view, Semantic Type coercions can be considered as exploitations of one of the qualia roles of those Semantic Types which do not satisfy a verb’s selectional requirements, but do not trigger a different verb sense. Overall, we were able to identify 62 different Semantic Type coercions linked to 1,052 metonymic corpus lines. In the future, we plan to compare our results with those from an equivalent study on Italian verbs (Romani, 2020) for a crosslinguistic analysis of metonymic shifts.
In this paper, we present the ForwardQuestions data set, made of human-generated questions related to knowledge triples. This data set results from the conversion and merger of the existing SimpleDBPediaQA and SimpleQuestionsWikidata data sets, including the mapping of predicates from DBPedia to Wikidata, and the selection of ‘forward’ questions as opposed to ‘backward’ ones. The new data set can be used to generate novel questions given an unseen Wikidata triple, by replacing the subjects of existing questions with the new one and then selecting the best candidate questions using semantic and syntactic criteria. Evaluation results indicate that the question generation method using ForwardQuestions improves the quality of questions by about 20% with respect to a baseline not using ranking criteria.
We present some issues in the development of the semantic annotation of IMAGACT, a multimodal and multilingual ontology of actions. The resource is structured on action concepts that are meant to be cognitive entities and to which a linguistic caption is attached. For each of these concepts, we annotate the minimal thematic structure of the caption and the possible argument alternations allowed. We present some insights on this process with regards to the notion of thematic structure and the relationship between action concepts and linguistic expressions. From the empirical evidence provided by the annotation, we discuss on the very nature of thematic structure, arguing that it is neither a property of the verb itself nor a property of action concepts. We further show what is the relation between thematic structure and 1- the semantic variation of action verbs; 2- the lexical variation of action concepts.
Effective, professional and socially competent dialogue of health care providers with their patients is essential to best practice in medicine. To identify, categorize and quantify salient features of patient-provider communication, to model interactive processes in medical encounters and to design digital interactive medical services, two important instruments have been developed: (1) medical interaction analysis systems with the Roter Interaction Analysis System (RIAS) as the most widely used by medical practitioners and (2) dialogue act annotation schemes with ISO 24617-2 as a multidimensional taxonomy of interoperable semantic concepts widely used for corpus annotation and dialogue systems design. Neither instrument fits all purposes. In this paper, we perform a systematic comparative analysis of the categories defined in the RIAS and ISO taxonomies. Overcoming the deficiencies and gaps that were found, we propose a number of extensions to the ISO annotation scheme, making it a powerful analytical and modelling instrument for the analysis, modelling and assessment of medical communication.
In this paper, we provide the basic guidelines towards the detection and linguistic analysis of events in Kannada. Kannada is a morphologically rich, resource poor Dravidian language spoken in southern India. As most information retrieval and extraction tasks are resource intensive, very little work has been done on Kannada NLP, with almost no efforts in discourse analysis and dataset creation for representing events or other semantic annotations in the text. In this paper, we linguistically analyze what constitutes an event in this language, the challenges faced with discourse level annotation and representation due to the rich derivational morphology of the language that allows free word order, numerous multi-word expressions, adverbial participle constructions and constraints on subject-verb relations. Therefore, this paper is one of the first attempts at a large scale discourse level annotation for Kannada, which can be used for semantic annotation and corpus development for other tasks in the language.
The purpose of this paper is to present a prospective and interdisciplinary research project seeking to ontologize knowledge of the domain of Outsider Art, that is, the art created outside the boundaries of official culture. The goal is to combine ontology engineering methodologies to develop a knowledge base which i) examines the relation between social exclusion and cultural productions, ii) standardizes the terminology of Outsider Art and iii) enables semantic interoperability between cultural metadata relevant to Outsider Art. The Outsider Art ontology will integrate some existing ontologies and terminologies, such as the CIDOC - Conceptual Reference Model (CRM), the Art & Architecture Thesaurus and the Getty Union List of Artist Names, among other resources. Natural Language Processing and Machine Learning techniques will be fundamental instruments for knowledge acquisition and elicitation. NLP techniques will be used to annotate bibliographies of relevant outsider artists and descriptions of outsider artworks with linguistic information. Machine Learning techniques will be leveraged to acquire knowledge from linguistic features embedded in both types of texts.
In this paper we focus on creation of interoperable annotation resources that make up a significant proportion of an on-going project on the development of conceptually annotated multilingual corpora for the domain of terrorist attacks in three languages (English, French and Russian) that can be used for comparative linguistic research, intelligent content and trend analysis, summarization, machine translation, etc. Conceptual annotation is understood as a type of task-oriented domain-specific semantic annotation. The annotation process in our project relies on ontological analysis. The paper details on the issues of the development of both static and dynamic resources such as a universal conceptual annotation scheme, multilingual domain ontology and multipurpose annotation platform with flexible settings, which can be used for the automation of the conceptual resource acquisition and of the annotation process, as well as for the documentation of the annotated corpora specificities. The resources constructed in the course of the research are also to be used for developing concept disambiguation metrics by means of qualitative and quantitative analysis of the golden portion of the conceptually annotated multilingual corpora and of the annotation platform linguistic knowledge.