Longin Jan Latecki

Also published as: Longin Latecki

2025

ClimateIE: A Dataset for Climate Science Information Extraction
Huitong Pan | Mustapha Adamu | Qi Zhang | Eduard Dragut | Longin Jan Latecki
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)

The rapid growth of climate science literature necessitates advanced information extraction (IE) systems to structure knowledge for researchers and policymakers. We introduce ClimateIE, a novel framework combining taxonomy-guided large language model (LLM) annotation with expert validation to address three core tasks: climate-specific named entity recognition, relationship extraction, and entity linking. Our contributions include: (1) the ClimateIE-Corpus—500 climate publications annotated via a hybrid human-AI pipeline with mappings to the extended GCMD+ taxonomy; (2) systematic evaluation showing Llama-3.3-70B achieves state-of-the-art performance (strict F1: 0.378 NER, 0.367 EL), outperforming larger commercial models (GPT-4o) and domain-adapted baselines (ClimateGPT) by 11-58%; and (3) analysis revealing critical challenges in technical relationship extraction (MountedOn: 0.000 F1) and emerging concept linking (26.4% unlinkable entities). Upon acceptance, we will release the corpus, toolkit, and guidelines to advance climate informatics, establishing benchmarks for NLP in Earth system science and underscoring the need for dynamic taxonomy governance and implicit relationship modeling.

pdf bib abs

DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition
Qi Zhang | Huitong Pan | Zhijia Chen | Longin Jan Latecki | Cornelia Caragea | Eduard Dragut
Findings of the Association for Computational Linguistics: NAACL 2025

Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.

pdf bib abs

Taxonomy-Driven Knowledge Graph Construction for Domain-Specific Scientific Applications
Huitong Pan | Qi Zhang | Mustapha Adamu | Eduard Dragut | Longin Jan Latecki
Findings of the Association for Computational Linguistics: ACL 2025

We present a taxonomy-driven framework for constructing domain-specific knowledge graphs (KGs) that integrates structured taxonomies, Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Although we focus on climate science to illustrate its effectiveness, our approach can potentially be adapted for other specialized domains. Existing methods often neglect curated taxonomies—hierarchies of verified entities and relationships—and LLMs frequently struggle to extract KGs in specialized domains. Our approach addresses these gaps by anchoring extraction to expert-curated taxonomies, aligning entities and relations with domain semantics, and validating LLM outputs using RAG against the domain taxonomy. Through a climate science case study using our annotated dataset of 25 publications (1,705 entity-publication links, 3,618 expert-validated relationships), we demonstrate that taxonomy-guided LLM prompting combined with RAG-based validation reduces hallucinations by 23.3% while improving F1 scores by 13.9% compared to baselines without the proposed techniques. Our contributions include: 1) a generalizable methodology for taxonomy-aligned KG construction; 2) a reproducible annotation pipeline, 3) the first benchmark dataset for climate science information retrieval; and 4) empirical insights into combining structured taxonomies with LLMs for specialized domains. The dataset, including expert annotations and taxonomy-aligned outputs, is publicly available at https://github.com/Jo-Pan/ClimateIE, and the accompanying framework can be accessed at https://github.com/Jo-Pan/TaxoDrivenKG.

2024

pdf bib abs

SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions
Huitong Pan | Qi Zhang | Cornelia Caragea | Eduard Dragut | Longin Jan Latecki
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus’s scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus’s utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.

pdf bib abs

SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents
Qi Zhang | Zhijia Chen | Huitong Pan | Cornelia Caragea | Longin Jan Latecki | Eduard Dragut
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Scientific information extraction (SciIE) is critical for converting unstructured knowledge from scholarly articles into structured data (entities and relations). Several datasets have been proposed for training and validating SciIE models. However, due to the high complexity and cost of annotating scientific texts, those datasets restrict their annotations to specific parts of paper, such as abstracts, resulting in the loss of diverse entity mentions and relations in context. In this paper, we release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations. To capture the intricate use and interactions among entities in full texts, our dataset contains a fine-grained tag set for relations. Additionally, we provide an out-of-distribution test set to offer a more realistic evaluation. We conduct comprehensive experiments, including state-of-the-art supervised models and our proposed LLM-based baselines, and highlight the challenges presented by our dataset, encouraging the development of innovative models to further the field of SciIE.

2023

pdf bib abs

DMDD: A Large-Scale Dataset for Dataset Mentions Detection
Huitong Pan | Qi Zhang | Eduard Dragut | Cornelia Caragea | Longin Jan Latecki
Transactions of the Association for Computational Linguistics, Volume 11

The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.