GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets

Wolfgang Otto, Matthäus Zloch, Lu Gan, Saurav Karmakar, Stefan Dietze


Abstract
Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets. In order to provide a nuanced understanding of how ML models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like “our BERT-based model” or “an image CNN”. You can find the ground truth dataset and code to replicate model training at https://data.gesis.org/gsap/gsap-ner.
Anthology ID:
2023.findings-emnlp.548
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8166–8176
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.548
DOI:
10.18653/v1/2023.findings-emnlp.548
Bibkey:
Cite (ACL):
Wolfgang Otto, Matthäus Zloch, Lu Gan, Saurav Karmakar, and Stefan Dietze. 2023. GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8166–8176, Singapore. Association for Computational Linguistics.
Cite (Informal):
GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets (Otto et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.548.pdf