Goran Nenadic

Also published as: Goran Nenadić


pdf bib
Enriching the Metadata of Community-Generated Digital Content through Entity Linking: An Evaluative Comparison of State-of-the-Art Models
Youcef Benkhedda | Adrians Skapars | Viktor Schlegel | Goran Nenadic | Riza Batista-Navarro
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

Digital archive collections that have been contributed by communities, known as community-generated digital content (CGDC), are important sources of historical and cultural knowledge. However, CGDC items are not easily searchable due to semantic information being obscured within their textual metadata. In this paper, we investigate the extent to which state-of-the-art, general-domain entity linking (EL) models (i.e., BLINK, EPGEL and mGENRE) can map named entities mentioned in CGDC textual metadata, to Wikidata entities. We evaluate and compare their performance on an annotated dataset of CGDC textual metadata and provide some error analysis, in the way of informing future studies aimed at enriching CGDC metadata using entity linking methods.


pdf bib
Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce
Serge Gladkoff | Lifeng Han | Goran Nenadic
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In this work, we first introduce the study on how to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. Then, this leads to our example with two human-generated observational scores, for which, we introduce “Student’s t-Distribution” method and explain how to use it to measure the IRR score using only these two data points, as well as the confidence intervals (CIs) of the quality evaluation. We give a quantitative analysis of how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage researchers to report their IRR scores in all possible means, e.g. using Student’s t-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy. This t-Distribution method can be also used outside of NLP fields to measure IRR level for trustworthy evaluation of experimental investigations, whenever the observational data is scarce.

pdf bib
AlphaMWE-Arabic: Arabic Edition of Multilingual Parallel Corpora with Multiword Expression Annotations
Najet Hadj Mohamed | Malak Rassem | Lifeng Han | Goran Nenadic
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Multiword Expressions (MWEs) have been a bottleneck for Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks due to their idiomaticity, ambiguity, and non-compositionality. Bilingual parallel corpora introducing MWE annotations are very scarce which set another challenge for current Natural Language Processing (NLP) systems, especially in a multilingual setting. This work presents AlphaMWE-Arabic, an Arabic edition of the AlphaMWE parallel corpus with MWE annotations. We introduce how we created this corpus including machine translation (MT), post-editing, and annotations for both standard and dialectal varieties, i.e. Tunisian and Egyptian Arabic. We analyse the MT errors when they meet MWEs-related content, both quantitatively using the human-in-the-loop metric HOPE and qualitatively. We report the current state-of-the-art MT systems are far from reaching human parity performances. We expect our bilingual English-Arabic corpus will be an asset for multilingual research on MWEs such as translation and localisation, as well as for monolingual settings including the study of Arabic-specific lexicography and phrasal verbs on MWEs. Our corpus and experimental data are available at https://github.com/aaronlifenghan/AlphaMWE.

pdf bib
EDU-level Extractive Summarization with Varying Summary Lengths
Yuping Wu | Ching-Hsun Tseng | Jiayu Shang | Shengzhong Mao | Goran Nenadic | Xiao-Jun Zeng
Findings of the Association for Computational Linguistics: EACL 2023

Extractive models usually formulate text summarization as extracting fixed top-k salient sentences from the document as a summary. Few works exploited extracting finer-grained Elementary Discourse Unit (EDU) with little analysis and justification for the extractive unit selection. Further, the selection strategy of the fixed top-k salient sentences fits the summarization need poorly, as the number of salient sentences in different documents varies and therefore a common or best k does not exist in reality. To fill these gaps, this paper first conducts the comparison analysis of oracle summaries based on EDUs and sentences, which provides evidence from both theoretical and experimental perspectives to justify and quantify that EDUs make summaries with higher automatic evaluation scores than sentences. Then, considering this merit of EDUs, this paper further proposes an EDU-level extractive model with Varying summary Lengths (EDU-VL) and develops the corresponding learning algorithm. EDU-VL learns to encode and predict probabilities of EDUs in the document, generate multiple candidate summaries with varying lengths based on various k values, and encode and score candidate summaries, in an end-to-end training manner. Finally, EDU-VL is experimented on single and multi-document benchmark datasets and shows improved performances on ROUGE scores in comparison with state-of-the-art extractive models, and further human evaluation suggests that EDU-constituent summaries maintain good grammaticality and readability.

pdf bib
Do You Hear The People Sing? Key Point Analysis via Iterative Clustering and Abstractive Summarisation
Hao Li | Viktor Schlegel | Riza Batista-Navarro | Goran Nenadic
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Argument summarisation is a promising but currently under-explored field. Recent work has aimed to provide textual summaries in the form of concise and salient short texts, i.e., key points (KPs), in a task known as Key Point Analysis (KPA). One of the main challenges in KPA is finding high-quality key point candidates from dozens of arguments even in a small corpus. Furthermore, evaluating key points is crucial in ensuring that the automatically generated summaries are useful. Although automatic methods for evaluating summarisation have considerably advanced over the years, they mainly focus on sentence-level comparison, making it difficult to measure the quality of a summary (a set of KPs) as a whole. Aggravating this problem is the fact that human evaluation is costly and unreproducible. To address the above issues, we propose a two-step abstractive summarisation framework based on neural topic modelling with an iterative clustering procedure, to generate key points which are aligned with how humans identify key points. Our experiments show that our framework advances the state of the art in KPA, with performance improvement of up to 14 (absolute) percentage points, in terms of both ROUGE and our own proposed evaluation metrics. Furthermore, we evaluate the generated summaries using a novel set-based evaluation toolkit. Our quantitative analysis demonstrates the effectiveness of our proposed evaluation metrics in assessing the quality of generated KPs. Human evaluation further demonstrates the advantages of our approach and validates that our proposed evaluation metric is more consistent with human judgment than ROUGE scores.

pdf bib
MedTem2.0: Prompt-based Temporal Classification of Treatment Events from Discharge Summaries
Yang Cui | Lifeng Han | Goran Nenadic
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Discharge summaries are comprehensive medical records that encompass vital information about a patient’s hospital stay. A crucial aspect of discharge summaries is the temporal information of treatments administered throughout the patient’s illness. With an extensive volume of clinical documents, manually extracting and compiling a patient’s medication list can be laborious, time-consuming, and susceptible to errors. The objective of this paper is to build upon the recent development on clinical NLP by temporally classifying treatments in clinical texts, specifically determining whether a treatment was administered between the time of admission and discharge from the hospital. State-of-the-art NLP methods including prompt-based learning on Generative Pre-trained Transformers (GPTs) models and fine-tuning on pre-trained language models (PLMs) such as BERT were employed to classify temporal relations between treatments and hospitalisation periods in discharge summaries. Fine-tuning with the BERT model achieved an F1 score of 92.45% and a balanced accuracy of 77.56%, while prompt learning using the T5 model and mixed templates resulted in an F1 score of 90.89% and a balanced accuracy of 72.07%.Our codes and data are available at https://github.com/HECTA-UoM/MedTem.

pdf bib
Investigating Massive Multilingual Pre-Trained Machine Translation Models for Clinical Domain via Transfer Learning
Lifeng Han | Gleb Erofeev | Irina Sorokina | Serge Gladkoff | Goran Nenadic
Proceedings of the 5th Clinical Natural Language Processing Workshop

Massively multilingual pre-trained language models (MMPLMs) are developed in recent years demonstrating superpowers and the pre-knowledge they acquire for downstream tasks. This work investigates whether MMPLMs can be applied to clinical domain machine translation (MT) towards entirely unseen languages via transfer learning. We carry out an experimental investigation using Meta-AI’s MMPLMs “wmt21-dense-24-wide-en-X and X-en (WMT21fb)” which were pre-trained on 7 language pairs and 14 translation directions including English to Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese, and the opposite direction. We fine-tune these MMPLMs towards English-Spanish language pair which did not exist at all in their original pre-trained corpora both implicitly and explicitly.We prepare carefully aligned clinical domain data for this fine-tuning, which is different from their original mixed domain knowledge.Our experimental result shows that the fine-tuning is very successful using just 250k well-aligned in-domain EN-ES segments for three sub-task translation testings: clinical cases, clinical terms, and ontology concepts. It achieves very close evaluation scores to another MMPLM NLLB from Meta-AI, which included Spanish as a high-resource setting in the pre-training.To the best of our knowledge, this is the first work on using MMPLMs towards clinical domain transfer-learning NMT successfully for totally unseen languages during pre-training.

pdf bib
Team:PULSAR at ProbSum 2023:PULSAR: Pre-training with Extracted Healthcare Terms for Summarising Patients’ Problems and Data Augmentation with Black-box Large Language Models
Hao Li | Yuping Wu | Viktor Schlegel | Riza Batista-Navarro | Thanh-Tung Nguyen | Abhinav Ramesh Kashyap | Xiao-Jun Zeng | Daniel Beck | Stefan Winkler | Goran Nenadic
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Medical progress notes play a crucial role in documenting a patient’s hospital journey, including his or her condition, treatment plan, and any updates for healthcare providers. Automatic summarisation of a patient’s problems in the form of a “problem list” can aid stakeholders in understanding a patient’s condition, reducing workload and cognitive bias. BioNLP 2023 Shared Task 1A focusses on generating a list of diagnoses and problems from the provider’s progress notes during hospitalisation. In this paper, we introduce our proposed approach to this task, which integrates two complementary components. One component employs large language models (LLMs) for data augmentation; the other is an abstractive summarisation LLM with a novel pre-training objective for generating the patients’ problems summarised as a list. Our approach was ranked second among all submissions to the shared task. The performance of our model on the development and test datasets shows that our approach is more robust on unknown data, with an improvement of up to 3.1 points over the same size of the larger model.


pdf bib
Examining Large Pre-Trained Language Models for Machine Translation: What You Don’t Know about It
Lifeng Han | Gleb Erofeev | Irina Sorokina | Serge Gladkoff | Goran Nenadic
Proceedings of the Seventh Conference on Machine Translation (WMT)

Pre-trained language models (PLMs) often take advantage of the monolingual and multilingual dataset that is freely available online to acquire general or mixed domain knowledge before deployment into specific tasks. Extra-large PLMs (xLPLMs) are proposed very recently to claim supreme performances over smaller-sized PLMs such as in machine translation (MT) tasks. These xLPLMs include Meta-AI’s wmt21-dense-24-wide-en-X (2021) and NLLB (2022). In this work, we examine if xLPLMs are absolutely superior to smaller-sized PLMs in fine-tuning toward domain-specific MTs. We use two different in-domain data of different sizes: commercial automotive in-house data and clinical shared task data from the ClinSpEn2022 challenge at WMT2022. We choose the popular Marian Helsinki as smaller sized PLM and two massive-sized Mega-Transformers from Meta-AI as xLPLMs.Our experimental investigation shows that 1) on smaller-sized in-domain commercial automotive data, xLPLM wmt21-dense-24-wide-en-X indeed shows much better evaluation scores using SacreBLEU and hLEPOR metrics than smaller-sized Marian, even though its score increase rate is lower than Marian after fine-tuning; 2) on relatively larger-size well prepared clinical data fine-tuning, the xLPLM NLLB tends to lose its advantage over smaller-sized Marian on two sub-tasks (clinical terms and ontology concepts) using ClinSpEn offered metrics METEOR, COMET, and ROUGE-L, and totally lost to Marian on Task-1 (clinical cases) on all official metrics including SacreBLEU and BLEU; 3) metrics do not always agree with each other on the same tasks using the same model outputs; 4) clinic-Marian ranked No.2 on Task- 1 (via SacreBLEU/BLEU) and Task-3 (via METEOR and ROUGE) among all submissions.


pdf bib
An efficient representation of chronological events in medical texts
Andrey Kormilitzin | Nemanja Vaci | Qiang Liu | Hao Ni | Goran Nenadic | Alejo Nevado-Holgado
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

In this work we addressed the problem of capturing sequential information contained in longitudinal electronic health records (EHRs). Clinical notes, which is a particular type of EHR data, are a rich source of information and practitioners often develop clever solutions how to maximise the sequential information contained in free-texts. We proposed a systematic methodology for learning from chronological events available in clinical notes. The proposed methodological path signature framework creates a non-parametric hierarchical representation of sequential events of any type and can be used as features for downstream statistical learning tasks. The methodology was developed and externally validated using the largest in the UK secondary care mental health EHR data on a specific task of predicting survival risk of patients diagnosed with Alzheimer’s disease. The signature-based model was compared to a common survival random forest model. Our results showed a 15.4% increase of risk prediction AUC at the time point of 20 months after the first admission to a specialist memory clinic and the signature method outperformed the baseline mixed-effects model by 13.2 %.

pdf bib
A Framework for Evaluation of Machine Reading Comprehension Gold Standards
Viktor Schlegel | Marco Valentino | Andre Freitas | Goran Nenadic | Riza Batista-Navarro
Proceedings of the Twelfth Language Resources and Evaluation Conference

Machine Reading Comprehension (MRC) is the task of answering a question over a paragraph of text. While neural MRC systems gain popularity and achieve noticeable performance, issues are being raised with the methodology used to establish their performance, particularly concerning the data design of gold standards that are used to evaluate them. There is but a limited understanding of the challenges present in this data, which makes it hard to draw comparisons and formulate reliable hypotheses. As a first step towards alleviating the problem, this paper proposes a unifying framework to systematically investigate the present linguistic features, required reasoning and background knowledge and factual correctness on one hand, and the presence of lexical cues as a lower bound for the requirement of understanding on the other hand. We propose a qualitative annotation schema for the first and a set of approximative metrics for the latter. In a first application of the framework, we analyse modern MRC gold standards and present our findings: the absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.


pdf bib
MedNorm: A Corpus and Embeddings for Cross-terminology Medical Concept Normalisation
Maksim Belousov | William G. Dixon | Goran Nenadic
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

The medical concept normalisation task aims to map textual descriptions to standard terminologies such as SNOMED-CT or MedDRA. Existing publicly available datasets annotated using different terminologies cannot be simply merged and utilised, and therefore become less valuable when developing machine learning-based concept normalisation systems. To address that, we designed a data harmonisation pipeline and engineered a corpus of 27,979 textual descriptions simultaneously mapped to both MedDRA and SNOMED-CT, sourced from five publicly available datasets across biomedical and social media domains. The pipeline can be used in the future to integrate new datasets into the corpus and also could be applied in relevant data curation tasks. We also described a method to merge different terminologies into a single concept graph preserving their relations and demonstrated that representation learning approach based on random walks on a graph can efficiently encode both hierarchical and equivalent relations and capture semantic similarities not only between concepts inside a given terminology but also between concepts from different terminologies. We believe that making a corpus and embeddings for cross-terminology medical concept normalisation available to the research community would contribute to a better understanding of the task.


pdf bib
Inferring Methodological Meta-knowledge from Large Biomedical Corpora
Goran Nenadic
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Keynote Speeches and Invited Talks


pdf bib
Mining temporal footprints from Wikipedia
Michele Filannino | Goran Nenadic
Proceedings of the First AHA!-Workshop on Information Discovery in Text


pdf bib
ManTIME: Temporal expression identification and normalization in the TempEval-3 challenge
Michele Filannino | Gavin Brown | Goran Nenadic
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)


pdf bib
An Exploration of Mining Gene Expression Mentions and Their Anatomical Locations from Biomedical Text
Martin Gerner | Goran Nenadic | Casey M. Bergman
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf bib
Using SVMs with the Command Relation features to identify negated events in biomedical literature
Farzaneh Sarafraz | Goran Nenadic
Proceedings of the Workshop on Negation and Speculation in Natural Language Processing


pdf bib
Biomedical Event Detection using Rules, Conditional Random Fields and Parse Tree Distances
Farzaneh Sarafraz | James Eales | Reza Mohammadi | Jonathan Dickerson | David Robertson | Goran Nenadic
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task


pdf bib
Towards a terminological resource for biomedical text mining
Goran Nenadic | Naoki Okazaki | Sophia Ananiadou
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

One of the main challenges in biomedical text mining is the identification of terminology, which is a key factor for accessing and integrating the information stored in literature. Manual creation of biomedical terminologies cannot keep pace with the data that becomes available. Still, many of them have been used in attempts to recognise terms in literature, but their suitability for text mining has been questioned as substantial re-engineering is needed to tailor the resources for automatic processing. Several approaches have been suggested to automatically integrate and map between resources, but the problems of extensive variability of lexical representations and ambiguity have been revealed. In this paper we present a methodology to automatically maintain a biomedical terminological database, which contains automatically extracted terms, their mutual relationships, features and possible annotations that can be useful in text processing. In addition to TermDB, a database used for terminology management and storage, we present the following modules that are used to populate the database: TerMine (recognition, extraction and normalisation of terms from literature), AcroTerMine (extraction and clustering of acronyms and their long forms), AnnoTerm (annotation and classification of terms), and ClusTerm (extraction of term associations and clustering of terms).

pdf bib
Annotation and Disambiguation of Semantic Types in Biomedical Text: A Cascaded Approach to Named Entity Recognition
Dietrich Rebholz-Schuhmann | Harald Kirsch | Sylvain Gaudan | Miguel Arregui | Goran Nenadic
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing


pdf bib
Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing
Sofia Stamou | Goran Nenadic | Dimitris Christodoulakis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Enhancing automatic term recognition through recognition of variation
Goran Nenadic | Sophia Ananiadou | John McNaught
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics


pdf bib
An Integrated Term-Based Corpus Query System
Irena Spasic | Goran Nenadic | Kostas Manios | Sophia Ananiadou
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Using Domain-Specific Verbs for Term Classification
Irena Spasic | Goran Nenadic | Sophia Ananiadou
Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine

pdf bib
Selecting Text Features for Gene Name Classification: from Documents to Terms
Goran Nenadic | Simon Rice | Irena Spasic | Sophia Ananiadou | Benjamin Stapley
Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine

pdf bib
Morpho-syntactic Clues for Terminological Processing in Serbian
Goran Nenadić | Irena Spasić | Sophia Ananiadou
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages


pdf bib
Tuning Context Features with Genetic Algorithms
Irena Spasić | Goran Nenadić | Sophia Ananiadou
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Automatic Acronym Acquisition and Term Variation Management within Domain-Specific Texts
Goran Nenadić | Irena Spasić | Sophia Ananiadou
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
A Methodology for Terminology-based Knowledge Acquisition and Integration
Hideki Mima | Sophia Ananiadou | Goran Nenadic | Jun-Ichi Tsujii
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Automatic Discovery of Term Similarities Using Pattern Mining
Goran Nenadić | Irena Spasić | Sophia Ananiadou
COLING-02: COMPUTERM 2002: Second International Workshop on Computational Terminology