Proceedings of the 14th Linguistic Annotation Workshop
Human annotations are an important source of information in the development of natural language understanding approaches. As under the pressure of productivity annotators can assign different labels to a given text, the quality of produced annotations frequently varies. This is especially the case if decisions are difficult, with high cognitive load, requires awareness of broader context, or careful consideration of background knowledge. To alleviate the problem, we propose two semi-supervised methods to guide the annotation process: a Bayesian deep learning model and a Bayesian ensemble method. Using a Bayesian deep learning method, we can discover annotations that cannot be trusted and might require reannotation. A recently proposed Bayesian ensemble method helps us to combine the annotators’ labels with predictions of trained models. According to the results obtained from three hate speech detection experiments, the proposed Bayesian methods can improve the annotations and prediction performance of BERT models.
Research in Computational Linguistics is dependent on text corpora for training and testing new tools and methodologies. While there exists a plethora of annotated linguistic information, these corpora are often not interoperable without significant manual work. Moreover, these annota-tions might have evolved into different versions, making it challenging for researchers to know the data’s provenance. This paper addresses this issue with a case study on event annotated corpora and by creating a new, more interoperable representation of this data in the form of nanopublications. We demonstrate how linguistic annotations from separate corpora can be reliably linked from the start, and thereby be accessed and queried as if they were a single dataset. We describe how such nanopublications can be created and demonstrate how SPARQL queries can be performed to extract interesting content from the new representations. The queries show that information of multiple corpora can be retrieved more easily and effectively because the information of different corpora is represented in a uniform data format.
This paper investigates the semantic prosody of three causal connectives: due to, owing to and because of in seven varieties of the English language. While research in the domain of English causality exists, we are not aware of studies that would cover the domain of causal connectives in English. Our claim is that connectives such as because of link two arguments, (at least) one of which will include a phrase that contributes to the interpretation of the relation as positive or negative, and hence define the prosody of the connective used. As our results demonstrate, the majority of the prosodies identified are negative for all three connectives; the proportions are stable across the varieties of English studied, and contrary to our expectations, we find no significant differences between the functions of the connectives and discourse preferences. Further, we investigate whether automatizing the sentiment annotation procedure via a simple language-model based classifier is possible. The initial results highlights the complexity of the task and the need for complicated systems, probably aided with other related datasets to achieve reasonable performance.
Humor research is a multifaceted field that has led to a better understanding of humor’s psychological effects and the development of different theories of humor. This paper’s main objective is to develop a hierarchical schema for a fine-grained annotation of Conversational Humor. Based on the Benign Violation Theory, the benignity or non-benignity of the interlocutor’s intentions is included within the framework. Under the categories mentioned above, in addition to different types of humor, the techniques utilized by these types are identified. Furthermore, a prominent play from Telugu, Kanyasulkam, is annotated to substantiate the work across cultures at multiple levels. The inter-annotator agreement is calculated to assess the accuracy and validity of the dataset. An in-depth analysis of the disagreement is performed to understand the subjectivity of humor better.
Most annotation efforts assume that annotators will agree on labels, if the annotation categories are well-defined and documented in annotation guidelines. However, this is not always true. For instance, content-related questions such as ‘Is this sentence about topic X?’ are unlikely to elicit the same answer from all annotators. Additional specifications in the guidelines are helpful to some extent, but can soon get overspecified by rules that cannot be justified by a research question. In this study, we model the semantic category ‘illness’ and its use in a gradual way. For this purpose, we (i) ask many annotators (30 votes per item, 960 items) for their opinion in a crowdsourcing experiment, (ii) ask annotators to indicate their certainty with respect to their annotation, and (iii) compare this across two different text types. We show that results of multiple annotations and average annotator certainty correlate, but many ambiguities can only be captured if several people contribute. The annotated data allow us to filter for sentences with high or low agreement and analyze causes of disagreement, thus getting a better understanding of people’s perception of illness—as an example of a semantic category—as well as of the content of our annotated texts.
The development of linguistic corpora is fraught with various problems of annotation and representation. These constitute a very real challenge for the development and use of annotated corpora, but as yet not much literature exists on how to address the underlying problems. In this paper, we identify and discuss five sources of representation problems, which are independent though interrelated: ambiguity, variation, uncertainty, error and bias. We outline and characterize these sources, discussing how their improper treatment can have stark consequences for research outcomes. Finally, we discuss how an adequate treatment can inform corpus-related linguistic research, both computational and theoretical, improving the reliability of research results and NLP models, as well as informing the more general reproducibility issue.
Generating expert ground truth annotations of documents can be a very expensive process. However, such annotations are essential for training domain-specific keyphrase extraction models, especially when utilizing data-intensive deep learning models in unique domains such as real-estate. Therefore, it is critical to optimize the manual annotation process to maximize the quality of the annotations while minimizing the cost of manual labor. To address this need, we explore multiple annotation strategies including self-review and peer-review as well as various methods of resolving annotator disagreements. We evaluate these annotation strategies with respect to their cost and on the task of learning keyphrase extraction models applied with an experimental dataset in the real-estate domain. The results demonstrate that different annotation strategies should be considered depending on specific metrics such as precision and recall.
It has become increasingly common for people to share cooking recipes on the Internet. Along with the increase in the number of shared recipes, there have been corresponding increases in recipe-related studies and datasets. However, there are still few datasets that provide linguistic annotations for the recipe-related studies even though such annotations should form the basis of the studies. This paper introduces a novel recipe-related dataset, named Cookpad Parsed Corpus, which contains linguistic annotations for Japanese recipes. We randomly extracted 500 recipes from the largest recipe-related dataset, the Cookpad Recipe Dataset, and annotated 4; 738 sentences in the recipes with morphemes, named entities, and dependency relations. This paper also reports benchmark results on our corpus for Japanese morphological analysis, named entity recognition, and dependency parsing. We show that there is still room for improvement in the analyses of recipes.
This paper reports on the harvesting, analysis, and enrichment of 20k documents from 4 different endangered language archives in 300 different low-resource languages. The documents are heterogeneous as to their provenance (holding archive, language, geographical area, creator) and internal structure (annotation types, metalanguages), but they have the ELAN-XML format in common. Typical annotations include sentence-level translations, morpheme-segmentation, morpheme-level translations, and parts-of-speech. The ELAN-format gives a lot of freedom to document creators, and hence the data set is very heterogeneous. We use regularities in the ELAN format to arrive at a common internal representation of sentences, words, and morphemes, with translations into one or more additional languages. Building upon the paradigm of Linguistic Linked Open Data (LLOD, Chiarcos, Nordhoff, et al. 2012), the document elements receive unique identifiers and are linked to other resources such as Glottolog for languages, Wikidata for semantic concepts, and the Leipzig Glossing Rules list for category abbreviations. We provide an RDF export in the LIGT format (Chiarcos & Ionov 2019), enabling uniform and interoperable access with some semantic enrichments to a formerly disparate resource type difficult to access. Two use cases (semantic search and colexification) are presented to show the viability of the approach.
We present the Prepositions Annotated with Supsersense Tags in Reddit International English (“PASTRIE”) corpus, a new dataset containing manually annotated preposition supersenses of English data from presumed speakers of four L1s: English, French, German, and Spanish. The annotations are comprehensive, covering all preposition types and tokens in the sample. Along with the corpus, we provide analysis of distributional patterns across the included L1s and a discussion of the influence of L1s on L2 preposition choice.
Prepositional supersense annotation is time-consuming and requires expert training. Here, we present two sensible methods for obtaining prepositional supersense annotations indirectly by eliciting surface substitution and similarity judgments. Four pilot studies suggest that both methods have potential for producing prepositional supersense annotations that are comparable in quality to expert annotations.
We reevaluate an existing adpositional annotation scheme with respect to two thorny semantic domains: accompaniment and purpose. ‘Accompaniment’ broadly speaking includes two entities situated together or participating in the same event, while ‘purpose’ broadly speaking covers the desired outcome of an action, the intended use or evaluated use of an entity, and more. We argue the policy in the SNACS scheme for English should be recalibrated with respect to these clusters of interrelated meanings without adding complexity to the overall scheme. Our analysis highlights tradeoffs in lumping vs. splitting decisions as well as the flexibility afforded by the construal analysis.
Multi-sentence questions (MSQs) are sequences of questions connected by relations which, unlike sequences of standalone questions, need to be answered as a unit. Following Rhetorical Structure Theory (RST), we recognise that different “question discourse relations” between the subparts of MSQs reflect different speaker intents, and consequently elicit different answering strategies. Correctly identifying these relations is therefore a crucial step in automatically answering MSQs. We identify five different types of MSQs in English, and define five novel relations to describe them. We extract over 162,000 MSQs from Stack Exchange to enable future research. Finally, we implement a high-precision baseline classifier based on surface features.
This paper describes a novel annotation scheme specifically designed for a customer-service context where written interactions take place between a given user and the chatbot of an Italian telecommunication company. More specifically, the scheme aims to detect and highlight two aspects: the presence of errors in the conversation on both sides (i.e. customer and chatbot) and the “emotional load” of the conversation. This can be inferred from the presence of emotions of some kind (especially negative ones) in the customer messages, and from the possible empathic responses provided by the agent. The dataset annotated according to this scheme is currently used to develop the prototype of a rule-based Natural Language Generation system aimed at improving the chatbot responses and the customer experience overall.
We propose a new method for annotating verbal fluency data, which allows the reliable detection of the age-related decline of lexical access capacity. The main innovation is that annotators should inferentially assess the intention of the speaker when producing a word form during a verbal fluency test. Our method correlates probable speaker inten-tions such as “intended as a valid answer” or “intended as a meta-comment” with lin-guistic features such as word intensity (e.g. reduced intensity suggests private speech) and syntactic integration. The annotation scheme can be implemented with high reliabil-ity, and minimal linguistic training. When fluency data are annotated using this scheme, a relation between fluency and age emerges; this is in contrast to a strict implementation of the traditional method of annotating verbal fluency data, which has no way of deal-ing with score-confounding phenomena because it force-groups all verbal fluency pro-ductions –regardless of speaker intention— into one of three taxonomic groups (i.e. val-id answers, perseverations, and intrusions). The traditional lack of fine-grained annota-tion units is especially problematic when analyzing the qualitatively distinct fluency da-ta of older participants and may cause studies to miss the relation between lexical access capacity and age.
pyMMAX2 is an API for processing MMAX2 stand-off annotation data in Python. It provides a lightweight basis for the development of code which opens up the Java- and XML-based ecosystem of MMAX2 for more recent, Python-based NLP and data science methods. While pyMMAX2 is pure Python, and most functionality is implemented from scratch, the API re-uses the complex implementation of the essential business logic for MMAX2 annotation schemes by interfacing with the original MMAX2 Java libraries. pyMMAX2 is available for download at http://github.com/nlpAThits/pyMMAX2.
This study develops the strand of research on topic transitions in social talk which aims to gain a better understanding of interlocutors’ conversational goals. Lưu and Malamud (2020) proposed that one way to identify such transitions is to annotate coherence relations, and then to identify utterances potentially expressing new topics as those that fail to participate in these relations. This work validates and refines their suggested annotation methodology, focusing on annotating most prominent coherence relations in face-to-face social dialogue. The result is a publicly accessible gold standard corpus with efficient and reliable annotation, whose broad coverage provides a foundation for future steps of identifying and classifying new topic utterances.