We present Component ALignment for Abstract Meaning Representation (Calamr), a novel method for graph alignment that can support summarization and its evaluation. First, our method produces graphs that explain what is summarized through their alignments, which can be used to train graph based summarization learners. Second, although numerous scoring methods have been proposed for abstract meaning representation (AMR) that evaluate semantic similarity, no AMR based summarization metrics exist despite years of work using AMR for this task. Calamr provides alignments on which new scores can be based. The contributions of this work include a) a novel approach to aligning AMR graphs, b) a new summarization based scoring methods for similarity of AMR subgraphs composed of one or more sentences, and c) the entire reusable source code to reproduce our results.
The education domain has been a popular area of collaboration with NLP researchers for decades. However, many recent breakthroughs, such as large transformer based language models, have provided new opportunities for solving interesting, but difficult problems. One such problem is assigning sentiment to reviews of educators’ performance. We present EduSenti: a corpus of 1,163 Albanian and 624 English reviews of educational instructor’s performance reviews annotated for sentiment, emotion and educational topic. In this work, we experiment with fine-tuning several language models on the EduSenti corpus and then compare with an Albanian masked language trained model from the last XLM-RoBERTa checkpoint. We show promising results baseline results, which include an F1 of 71.9 in Albanian and 73.8 in English. Our contributions are: (i) a sentiment analysis corpus in Albanian and English, (ii) a large Albanian corpus of crawled data useful for unsupervised training of language models, and (iii) the source code for our experiments.
Summarization of medical notes has been studied for decades with hospital discharge summaries garnering recent interest in the research community. While methods for summarizing these notes have been the focus, there has been little work in understanding the feasibility of this task. We believe this effort is warranted given the notes’ length and complexity, and that they are often riddled with poorly formatted structured data and redundancy in copy and pasted text. In this work, we investigate the feasibility of the summarization task by finding the origin, or data provenance, of the discharge summary’s source text. As a motivation to understanding the data challenges of the summarization task, we present DSProv, a new dataset of 51 hospital admissions annotated by clinical informatics physicians. The dataset is analyzed for semantics and the extent of copied text from human authored electronic health record (EHR) notes. We also present a novel unsupervised method of matching notes used in discharge summaries, and release our annotation dataset1 and source code to the community.
Given the criticality and difficulty of reproducing machine learning experiments, there have been significant efforts in reducing the variance of these results. The ability to consistently reproduce results effectively strengthens the underlying hypothesis of the work and should be regarded as important as the novel aspect of the research itself. The contribution of this work is an open source framework that has the following characteristics: a) facilitates reproducing consistent results, b) allows hot-swapping features and embeddings without further processing and re-vectorizing the dataset, c) provides a means of easily creating, training and evaluating natural language processing deep learning models with little to no code changes, and d) is freely available to the community.
The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are helpful to the reader when searching for information and contextualizing specific topics. The goal of this work is to segment the sections of clinical medical domain documentation. The primary contribution of this work is MedSecId, a publicly available set of 2,002 fully annotated medical notes from the MIMIC-III. We include several baselines, source code, a pretrained model and analysis of the data showing a relationship between medical concepts across sections using principal component analysis.