Jennifer D’Souza

Also published as: Jennifer D’souza

2025

pdf bib abs
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D’Souza | Hamed Babaei Giglou | Quentin Münch
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.

pdf bib abs
Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models
Jennifer D’Souza | Zachary Laubach | Tarek Al Mustafa | Sina Zarrieß | Robert Frühstückl | Phyllis Illari
Proceedings of the 1st Workshop on Ecology, Environment, and Natural Language Processing (NLP4Ecology2025)

This study explores the use of large language models (LLMs), specifically GPT-4o, to extract key ecological entities—species, locations, habitats, and ecosystems—from invasion biology literature. This information is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Without domain-specific fine-tuning, we assess the potential and limitations of GPT-4o, out-of-the-box, for this task, highlighting the role of LLMs in advancing automated knowledge extraction for ecological research and management.

pdf bib
Proceedings of The First Workshop on Human–LLM Collaboration for Ethical and Responsible Science Production (SciProdLLM)
Wei Zhao | Jennifer D’Souza | Steffen Eger | Anne Lauscher | Yufang Hou | Nafise Sadat Moosavi | Tristan Miller | Chenghua Lin
Proceedings of The First Workshop on Human–LLM Collaboration for Ethical and Responsible Science Production (SciProdLLM)

pdf bib abs
SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library’s Open-Access Catalog
Jennifer D’souza | Sameer Sadruddin | Holger Israel | Mathias Begoin | Diana Slawig
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We present SemEval-2025 Task 5: LLMs4Subjects, a shared task on automated subject tagging for scientific and technical records in English and German using the GND taxonomy. Participants developed LLM-based systems to recommend top-k subjects, evaluated through quantitative metrics (precision, recall, F1-score) and qualitative assessments by subject specialists. Results highlight the effectiveness of LLM ensembles, synthetic data generation, and multilingual processing, offering insights into LLMs for digital library classification. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available. The dataset is available at {href{https://github.com/emotion-analysis-project/SemEval2025-task11}{SemEval2024-task 11}}.

2024

pdf bib abs
Large Language Models for Scientific Information Extraction: An Empirical Study for Virology
Mahsa Shamsabadi | Jennifer D’Souza | Sören Auer
Findings of the Association for Computational Linguistics: EACL 2024

In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generation capabilities of LLMs to produce structured scholarly contribution summaries, offering both a practical solution and insights into LLMs’ emergent abilities.For LLMs, the prime focus is on improving their general intelligence as conversational agents. We argue that these models can also be applied effectively in information extraction (IE), specifically in complex IE tasks within terse domains like Science. This paradigm shift replaces the traditional modular, pipelined machine learning approach with a simpler objective expressed through instructions. Our results show that finetuned FLAN-T5 with 1000x fewer parameters than the state-of-the-art GPT-davinci is competitive for the task.

pdf bib
Large Language Models as Evaluators for Scientific Synthesis
Julia Evans | Jennifer D’Souza | Sören Auer
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

2022

pdf bib abs
NLPSharedTasks: A Corpus of Shared Task Overview Papers in Natural Language Processing Domains
Anna Martin | Ted Pedersen | Jennifer D’Souza
Proceedings of the First Workshop on Information Extraction from Scientific Publications

As the rate of scientific output continues to grow, it is increasingly important to develop systems to improve interfaces between researchers and scholarly papers. Training models to extract scientific information from the full texts of scholarly documents is important for improving how we structure and access scientific information. However, there are few annotated corpora that provide full paper texts. This paper presents the NLPSharedTasks corpus, a new resource of 254 full text Shared Task Overview papers in NLP domains with annotated task descriptions. We calculated strict and relaxed inter-annotator agreement scores, achieving Cohen’s kappa coefficients of 0.44 and 0.95, respectively. Lastly, we performed a sentence classification task over the dataset, in order to generate a neural baseline for future research and to provide an example of how to preprocess unbalanced datasets of full scientific texts. We achieved an F1 score of 0.75 using SciBERT, fine-tuned and tested on a rebalanced version of the dataset.

2021

pdf bib abs
SemEval-2021 Task 11: NLPContributionGraph - Structuring Scholarly NLP Contributions for a Research Knowledge Graph
Jennifer D’Souza | Sören Auer | Ted Pedersen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. ‘the NCG task’) tasks participants to develop automated systems that structure contributions from NLP scholarly articles in the English language. Being the first-of-its-kind in the SemEval series, the task released structured data from NLP scholarly articles at three levels of information granularity, i.e. at sentence-level, phrase-level, and phrases organized as triples toward Knowledge Graph (KG) building. The sentence-level annotations comprised the few sentences about the article’s contribution. The phrase-level annotations were scientific term and predicate phrases from the contribution sentences. Finally, the triples constituted the research overview KG. For the Shared Task, participating systems were then expected to automatically classify contribution sentences, extract scientific terms and relations from the sentences, and organize them as KG triples. Overall, the task drew a strong participation demographic of seven teams and 27 participants. The best end-to-end task system classified contribution sentences at 57.27% F1, phrases at 46.41% F1, and triples at 22.28% F1. While the absolute performance to generate triples remains low, as conclusion to the article, the difficulty of producing such data and as a consequence of modeling it is highlighted.

2020

pdf bib abs
The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
Jennifer D’Souza | Anett Hoppe | Arthur Brack | Mohmad Yaser Jaradeh | Sören Auer | Ralph Ewerth
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.

pdf bib abs
Fine-tuning BERT with Focus Words for Explanation Regeneration
Isaiah Onando Mulang’ | Jennifer D’Souza | Sören Auer
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Explanation generation introduced as the world tree corpus (Jansen et al., 2018) is an emerging NLP task involving multi-hop inference for explaining the correct answer in multiple-choice QA. It is a challenging task evidenced by low state-of-the-art performances(below 60% in F-score) demonstrated on the task. Of the state-of-the-art approaches, fine-tuned transformer-based (Vaswani et al., 2017) BERT models have shown great promise toward continued system performance improvements compared with approaches relying on surface-level cues alone that demonstrate performance saturation. In this work, we take a novel direction by addressing a particular linguistic characteristic of the data — we introduce a novel and lightweight focus feature in the transformer-based model and examine task improvements. Our evaluations reveal a significantly positive impact of this lightweight focus feature achieving the highest scores, second only to a significantly computationally intensive system.

2019

pdf bib abs
Team SVMrank: Leveraging Feature-rich Support Vector Machines for Ranking Explanations to Elementary Science Questions
Jennifer D’Souza | Isaiah Onando Mulang’ | Sören Auer
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

The TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration (MIER-19) tackles explanation generation for answers to elementary science questions. It builds on the AI2 Reasoning Challenge 2018 (ARC-18) which was organized as an advanced question answering task on a dataset of elementary science questions. The ARC-18 questions were shown to be hard to answer with systems focusing on surface-level cues alone, instead requiring far more powerful knowledge and reasoning. To address MIER-19, we adopt a hybrid pipelined architecture comprising a featurerich learning-to-rank (LTR) machine learning model, followed by a rule-based system for reranking the LTR model predictions. Our system was ranked fourth in the official evaluation, scoring close to the second and third ranked teams, achieving 39.4% MAP.

2015

pdf bib
Sieve-Based Spatial Relation Extraction with Expanding Parse Trees
Jennifer D’Souza | Vincent Ng
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Sieve-Based Entity Linking for the Biomedical Domain
Jennifer D’Souza | Vincent Ng
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
UTD: Ensemble-Based Spatial Relation Extraction
Jennifer D’Souza | Vincent Ng
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
Ensemble-Based Medical Relation Classification
Jennifer D’Souza | Vincent Ng
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib abs
Annotating Inter-Sentence Temporal Relations in Clinical Notes
Jennifer D’Souza | Vincent Ng
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Owing in part to the surge of interest in temporal relation extraction, a number of datasets manually annotated with temporal relations between event-event pairs and event-time pairs have been produced recently. However, it is not uncommon to find missing annotations in these manually annotated datasets. Many researchers attributed this problem to “annotator fatigue”. While some of these missing relations can be recovered automatically, many of them cannot. Our goals in this paper are to (1) manually annotate certain types of missing links that cannot be automatically recovered in the i2b2 Clinical Temporal Relations Challenge Corpus, one of the recently released evaluation corpora for temporal relation extraction; and (2) empirically determine the usefulness of these additional annotations. We will make our annotations publicly available, in hopes of enabling a more accurate evaluation of temporal relation extraction systems.