Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions

Thomas Green, Diana Maynard, Chenghua Lin


Abstract
We present the development of a benchmark suite consisting of an annotation schema, training corpus and baseline model for Entity Recognition (ER) in job descriptions, published under a Creative Commons license. This was created to address the distinct lack of resources available to the community for the extraction of salient entities, such as skills, from job descriptions. The dataset contains 18.6k entities comprising five types (Skill, Qualification, Experience, Occupation, and Domain). We include a benchmark CRF-based ER model which achieves an F1 score of 0.59. Through the establishment of a standard definition of entities and training/testing corpus, the suite is designed as a foundation for future work on tasks such as the development of job recommender systems.
Anthology ID:
2022.lrec-1.128
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1201–1208
Language:
URL:
https://aclanthology.org/2022.lrec-1.128
DOI:
Bibkey:
Cite (ACL):
Thomas Green, Diana Maynard, and Chenghua Lin. 2022. Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1201–1208, Marseille, France. European Language Resources Association.
Cite (Informal):
Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions (Green et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.128.pdf