Extracting position titles from unstructured historical job advertisements

Klara Venglarova; Raven Adam; Georg Vogeler

Extracting position titles from unstructured historical job advertisements

Klara Venglarova, Raven Adam, Georg Vogeler

Abstract

This paper explores the automated extraction of job titles from unstructured historical job advertisements, using a corpus of digitized German-language newspapers from 1850-1950. The study addresses the challenges of working with unstructured, OCR-processed historical data, contrasting with contemporary approaches that often use structured, digitally-born datasets when dealing with this text type. We compare four extraction methods: a dictionary-based approach, a rule-based approach, a named entity recognition (NER) mode, and a text-generation method. The NER approach, trained on manually annotated data, achieved the highest F1 score (0.944 using transformers model trained on GPU, 0.884 model trained on CPU), demonstrating its flexibility and ability to correctly identify job titles. The text-generation approach performs similarly (0.920). However, the rule-based (0.69) and dictionary-based (0.632) methods reach relatively high F1 Scores as well, while offering the advantage of not requiring extensive labeling of training data. The results highlight the complexities of extracting meaningful job titles from historical texts, with implications for further research into labor market trends and occupational history.

Anthology ID:: 2024.nlp4dh-1.8
Volume:: Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:: November
Year:: 2024
Address:: Miami, USA
Editors:: Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:: NLP4DH
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 75–84
Language:
URL:: https://aclanthology.org/2024.nlp4dh-1.8
DOI:
Bibkey:
Cite (ACL):: Klara Venglarova, Raven Adam, and Georg Vogeler. 2024. Extracting position titles from unstructured historical job advertisements. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 75–84, Miami, USA. Association for Computational Linguistics.
Cite (Informal):: Extracting position titles from unstructured historical job advertisements (Venglarova et al., NLP4DH 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.nlp4dh-1.8.pdf

PDF Cite Search