Linguistic Annotation Workshop (2021)


up

bib (full) Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

pdf bib
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop
Claire Bonial | Nianwen Xue

pdf bib
Zero-shot cross-lingual Meaning Representation Transfer: Annotation of Hungarian using the Prague Functional Generative Description
Attila Novák | Borbála Novák | Csilla Novák

In this paper, we present the results of our experiments concerning the zero-shot cross-lingual performance of the PERIN sentence-to-graph semantic parser. We applied the PTG model trained using the PERIN parser on a 740k-token Czech newspaper corpus to Hungarian. We evaluated the performance of the parser using the official evaluation tool of the MRP 2020 shared task. The gold standard Hungarian annotation was created by manual correction of the output of the parser following the annotation manual of the tectogrammatical level of the Prague Dependency Treebank. An English model trained on a larger one-million-token English newspaper corpus is also available, however, we found that the Czech model performed significantly better on Hungarian input due to the fact that Hungarian is typologically more similar to Czech than to English. We have found that zero-shot transfer of the PTG meaning representation across typologically not-too-distant languages using a neural parser model based on a multilingual contextual language model followed by a manual correction by linguist experts seems to be a viable scenario.

pdf bib
Theoretical and Practical Issues in the Semantic Annotation of Four Indigenous Languages
Jens E. L. Van Gysel | Meagan Vigus | Lukas Denk | Andrew Cowell | Rosa Vallejos | Tim O’Gorman | William Croft

Computational resources such as semantically annotated corpora can play an important role in enabling speakers of indigenous minority languages to participate in government, education, and other domains of public life in their own language. However, many languages – mainly those with small native speaker populations and without written traditions – have little to no digital support. One hurdle in creating such resources is that for many languages, few speakers would be capable of annotating texts – a task which requires literacy and some linguistic training – and that these experts’ time is typically in high demand for language planning work. This paper assesses whether typologically trained non-speakers of an indigenous language can feasibly perform semantic annotation using Uniform Meaning Representations, thus allowing for the creation of computational materials without putting further strain on community resources.

pdf bib
Representing Implicit Positive Meaning of Negated Statements in AMR
Katharina Stein | Lucia Donatelli

Abstract Meaning Representation (AMR) has become popular for representing the meaning of natural language in graph structures. However, AMR does not represent scope information, posing a problem for its overall expressivity and specifically for drawing inferences from negated statements. This is the case with so-called “positive interpretations” of negated statements, in which implicit positive meaning is identified by inferring the opposite of the negation’s focus. In this work, we investigate how potential positive interpretations (PPIs) can be represented in AMR. We propose a logically motivated AMR structure for PPIs that makes the focus of negation explicit and sketch an initial proposal for a systematic methodology to generate this more expressive structure.

pdf bib
AutoAspect: Automatic Annotation of Tense and Aspect for Uniform Meaning Representations
Daniel Chen | Martha Palmer | Meagan Vigus

We present AutoAspect, a novel, rule-based annotation tool for labeling tense and aspect. The pilot version annotates English data. The aspect labels are designed specifically for Uniform Meaning Representations (UMR), an annotation schema that aims to encode crosslingual semantic information. The annotation tool combines syntactic and semantic cues to assign aspects on a sentence-by-sentence basis, following a sequence of rules that each output a UMR aspect. Identified events proceed through the sequence until they are assigned an aspect. We achieve a recall of 76.17% for identifying UMR events and an accuracy of 62.57% on all identified events, with high precision values for 2 of the aspect labels.

pdf bib
Can predicate-argument relationships be extracted from UD trees?
Adam Ek | Jean-Philippe Bernardy | Stergios Chatzikyriakidis

In this paper we investigate the possibility of extracting predicate-argument relations from UD trees (and enhanced UD graphs). Con- cretely, we apply UD parsers on an En- glish question answering/semantic-role label- ing data set (FitzGerald et al., 2018) and check if the annotations reflect the relations in the resulting parse trees, using a small number of rules to extract this information. We find that 79.1% of the argument-predicate pairs can be found in this way, on the basis of Ud- ify (Kondratyuk and Straka, 2019). Error anal- ysis reveals that half of the error cases are at- tributable to shortcomings in the dataset. The remaining errors are mostly due to predicate- argument relations not being extractible algo- rithmically from the UD trees (requiring se- mantic reasoning to be resolved). The parser itself is only responsible for a small portion of errors. Our analysis suggests a number of improvements to the UD annotation schema: we propose to enhance the schema in four ways, in order to capture argument-predicate relations. Additionally, we propose improve- ments regarding data collection for question answering/semantic-role labeling data.

pdf bib
Classifying Divergences in Cross-lingual AMR Pairs
Shira Wein | Nathan Schneider

Translation divergences are varied and widespread, challenging approaches that rely on parallel text. To annotate translation divergences, we propose a schema grounded in the Abstract Meaning Representation (AMR), a sentence-level semantic framework instantiated for a number of languages. By comparing parallel AMR graphs, we can identify specific points of divergence. Each divergence is labeled with both a type and a cause. We release a small corpus of annotated English-Spanish data, and analyze the annotations in our corpus.

pdf bib
A Linguistic Annotation Framework to Study Interactions in Multilingual Healthcare Conversational Forums
Ishani Mondal | Kalika Bali | Mohit Jain | Monojit Choudhury | Ashish Sharma | Evans Gitau | Jacki O’Neill | Kagonya Awori | Sarah Gitau

In recent years, remote digital healthcare using online chats has gained momentum, especially in the Global South. Though prior work has studied interaction patterns in online (health) forums, such as TalkLife, Reddit and Facebook, there has been limited work in understanding interactions in small, close-knit community of instant messengers. In this paper, we propose a linguistic annotation framework to facilitate analysis of health-focused WhatsApp groups. The primary aim of the framework is to understand interpersonal relationships among peer supporters in order to help develop NLP solutions for remote patient care and reduce burden of overworked healthcare providers. Our framework consists of fine-grained peer support categorization and message-level sentiment tagging. Additionally, due to the prevalence of code-mixing in such groups, we incorporate word-level language annotations. We use the proposed framework to study two WhatsApp groups in Kenya for youth living with HIV, facilitated by a healthcare provider.

pdf bib
Sister Help: Data Augmentation for Frame-Semantic Role Labeling
Ayush Pancholy | Miriam R L Petruck | Swabha Swayamdipta

While FrameNet is widely regarded as a rich resource of semantics in natural language processing, a major criticism concerns its lack of coverage and the relative paucity of its labeled data compared to other commonly used lexical resources such as PropBank and VerbNet. This paper reports on a pilot study to address these gaps. We propose a data augmentation approach, which uses existing frame-specific annotation to automatically annotate other lexical units of the same frame which are unannotated. Our rule-based approach defines the notion of a **sister lexical unit** and generates frame-specific augmented data for training. We present experiments on frame-semantic role labeling which demonstrate the importance of this data augmentation: we obtain a large improvement to prior results on frame identification and argument identification for FrameNet, utilizing both full-text and lexicographic annotations under FrameNet. Our findings on data augmentation highlight the value of automatic resource creation for improved models in frame-semantic parsing.

pdf bib
A Corpus Study of Creating Rule-Based Enhanced Universal Dependencies for German
Teresa Bürkle | Stefan Grünewald | Annemarie Friedrich

In this paper, we present a first attempt at enriching German Universal Dependencies (UD) treebanks with enhanced dependencies. Similarly to the converter for English (Schuster and Manning, 2016), we develop a rule-based system for deriving enhanced dependencies from the basic layer, covering three linguistic phenomena: relative clauses, coordination, and raising/control. For quality control, we manually correct or validate a set of 196 sentences, finding that around 90% of added relations are correct. Our data analysis reveals that difficulties arise mainly due to inconsistencies in the basic layer annotations. We show that the English system is in general applicable to German data, but that adapting to the particularities of the German treebanks and language increases precision and recall by up to 10%. Comparing the application of our converter on gold standard dependencies vs. automatic parses, we find that F1 drops by around 10% in the latter setting due to error propagation. Finally, an enhanced UD parser trained on a converted treebank performs poorly when evaluated against our annotations, indicating that more work remains to be done to create gold standard enhanced German treebanks.

pdf bib
Subcategorizing Adverbials in Universal Conceptual Cognitive Annotation
Zhuxin Wang | Jakob Prange | Nathan Schneider

Universal Conceptual Cognitive Annotation (UCCA) is a semantic annotation scheme that organizes texts into coarse predicate-argument structure, offering broad coverage of semantic phenomena. At the same time, there is still need for a finer-grained treatment of many of the categories. The Adverbial category is of special interest, as it covers a wide range of fundamentally different meanings such as negation, causation, aspect, and event quantification. In this paper we introduce a refinement annotation scheme for UCCA’s Adverbial category, showing that UCCA Adverbials can indeed be subcategorized into at least 7 semantic types, and doing so can help clarify and disambiguate the otherwise coarse-grained labels. We provide a preliminary set of annotation guidelines, as well as pilot annotation experiments with high inter-annotator agreement, confirming the validity of the scheme.

pdf bib
Simplifying annotation of intersections in time normalization annotation: exploring syntactic and semantic validation
Peiwen Su | Steven Bethard

While annotating normalized times in food security documents, we found that the semantically compositional annotation for time normalization (SCATE) scheme required several near-duplicate annotations to get the correct semantics for expressions like Nov. 7th to 11th 2021. To reduce this problem, we explored replacing SCATE’s Sub-Interval property with a Super-Interval property, that is, making the smallest units (e.g., 7th and 11th) rather than the largest units (e.g., 2021) the heads of the intersection chains. To ensure that the semantics of annotated time intervals remained unaltered despite our changes to the syntax of the annotation scheme, we applied several different techniques to validate our changes. These validation techniques detected and allowed us to resolve several important bugs in our automated translation from Sub-Interval to Super-Interval syntax.

pdf bib
Overcoming the challenges in morphological annotation of Turkish in universal dependencies framework
Talha Bedir | Karahan Şahin | Onur Gungor | Suzan Uskudarli | Arzucan Özgür | Tunga Güngör | Balkiz Ozturk Basaran

This paper presents several challenges faced when annotating Turkish treebanks in accordance with the Universal Dependencies (UD) guidelines and proposes solutions to address them. Most of these challenges stem from the lack of adequate support in the UD framework to accurately represent null morphemes and complex derivations, which results in a significant loss of information for Turkish. This loss negatively impacts the tools that are developed based on these treebanks. We raised and discussed these issues within the community on the official UD portal. This paper presents these issues and our proposals to more accurately represent morphosyntactic information for Turkish while adhering to guidelines of UD. This work aims to contribute to the representation of Turkish and other agglutinative languages in UD-based treebanks, which in turn aids to develop more accurately annotated datasets for such languages.

pdf bib
Automatic Entity State Annotation using the VerbNet Semantic Parser
Ghazaleh Kazeminejad | Martha Palmer | Tao Li | Vivek Srikumar

Tracking entity states is a natural language processing task assumed to require human annotation. In order to reduce the time and expenses associated with annotation, we introduce a new method to automatically extract entity states, including location and existence state of entities, following Dalvi et al. (2018) and Tandon et al. (2020). For this purpose, we rely primarily on the semantic representations generated by the state of the art VerbNet parser (Gung, 2020), and extract the entities (event participants) and their states, based on the semantic predicates of the generated VerbNet semantic representation, which is in propositional logic format. For evaluation, we used ProPara (Dalvi et al., 2018), a reading comprehension dataset which is annotated with entity states in each sentence, and tracks those states in paragraphs of natural human-authored procedural texts. Given the presented limitations of the method, the peculiarities of the ProPara dataset annotations, and that our system, Lexis, makes no use of task-specific training data and relies solely on VerbNet, the results are promising, showcasing the value of lexical resources.

pdf bib
On Releasing Annotator-Level Labels and Information in Datasets
Vinodkumar Prabhakaran | Aida Mostafazadeh Davani | Mark Diaz

A common practice in building NLP datasets, especially using crowd-sourced annotations, involves obtaining multiple annotator judgements on the same data instances, which are then flattened to produce a single “ground truth” label or score, through majority voting, averaging, or adjudication. While these approaches may be appropriate in certain annotation tasks, such aggregations overlook the socially constructed nature of human perceptions that annotations for relatively more subjective tasks are meant to capture. In particular, systematic disagreements between annotators owing to their socio-cultural backgrounds and/or lived experiences are often obfuscated through such aggregations. In this paper, we empirically demonstrate that label aggregation may introduce representational biases of individual and group perspectives. Based on this finding, we propose a set of recommendations for increased utility and transparency of datasets for downstream use cases.

pdf bib
Increasing Sentence-Level Comprehension Through Text Classification of Epistemic Functions
Maria Berger | Elizabeth Goldstein

Word embeddings capture semantic meaning of individual words. How to bridge word-level linguistic knowledge with sentence-level language representation is an open problem. This paper examines whether sentence-level representations can be achieved by building a custom sentence database focusing on one aspect of a sentence’s meaning. Our three separate semantic aspects are whether the sentence: (1) communicates a causal relationship, (2) indicates that two things are correlated with each other, and (3) expresses information or knowledge. The three classifiers provide epistemic information about a sentence’s content.

pdf bib
Towards a Methodology Supporting Semiautomatic Annotation of HeadMovements in Video-recorded Conversations
Patrizia Paggio | Costanza Navarretta | Bart Jongejan | Manex Agirrezabal

We present a method to support the annotation of head movements in video-recorded conversations. Head movement segments from annotated multimodal data are used to train a model to detect head movements in unseen data. The resulting predicted movement sequences are uploaded to the ANVIL tool for post-annotation editing. The automatically identified head movements and the original annotations are compared to assess the overlap between the two. This analysis showed that movement onsets were more easily detected than offsets, and pointed at a number of patterns in the mismatches between original annotations and model predictions that could be dealt with in general terms in post-annotation guidelines.

pdf bib
Intensionalizing Abstract Meaning Representations: Non-Veridicality and Scope
Gregor Williamson | Patrick Elliott | Yuxin Ji

Abstract Meaning Representation (AMR) is a graphical meaning representation language designed to represent propositional information about argument structure. However, at present it is unable to satisfyingly represent non-veridical intensional contexts, often licensing inappropriate inferences. In this paper, we show how to resolve the problem of non-veridicality without appealing to layered graphs through a mapping from AMRs into Simply-Typed Lambda Calculus (STLC). At least for some cases, this requires the introduction of a new role :content which functions as an intensional operator. The translation proposed is inspired by the formal linguistics literature on the event semantics of attitude reports. Next, we address the interaction of quantifier scope and intensional operators in so-called de re/de dicto ambiguities. We adopt a scope node from the literature and provide an explicit multidimensional semantics utilizing Cooper storage which allows us to derive the de re and de dicto scope readings as well as intermediate scope readings which prove difficult for accounts without a scope node.

pdf bib
WikiGUM: Exhaustive Entity Linking for Wikification in 12 Genres
Jessica Lin | Amir Zeldes

Previous work on Entity Linking has focused on resources targeting non-nested proper named entity mentions, often in data from Wikipedia, i.e. Wikification. In this paper, we present and evaluate WikiGUM, a fully wikified dataset, covering all mentions of named entities, including their non-named and pronominal mentions, as well as mentions nested within other mentions. The dataset covers a broad range of 12 written and spoken genres, most of which have not been included in Entity Linking efforts to date, leading to poor performance by a pretrained SOTA system in our evaluation. The availability of a variety of other annotations for the same data also enables further research on entities in context.