ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain

Mike Zhang, Rob van der Goot, Barbara Plank


Abstract
The increasing number of benchmarks for Natural Language Processing (NLP) tasks in the computational job market domain highlights the demand for methods that can handle job-related tasks such as skill extraction, skill classification, job title classification, and de-identification. While some approaches have been developed that are specific to the job market domain, there is a lack of generalized, multilingual models and benchmarks for these tasks. In this study, we introduce a language model called ESCOXLM-R, based on XLM-R-large, which uses domain-adaptive pre-training on the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy, covering 27 languages. The pre-training objectives for ESCOXLM-R include dynamic masked language modeling and a novel additional objective for inducing multilingual taxonomical ESCO relations. We comprehensively evaluate the performance of ESCOXLM-R on 6 sequence labeling and 3 classification tasks in 4 languages and find that it achieves state-of-the-art results on 6 out of 9 datasets. Our analysis reveals that ESCOXLM-R performs better on short spans and outperforms XLM-R-large on entity-level and surface-level span-F1, likely due to ESCO containing short skill and occupation titles, and encoding information on the entity-level.
Anthology ID:
2023.acl-long.662
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11871–11890
Language:
URL:
https://aclanthology.org/2023.acl-long.662
DOI:
10.18653/v1/2023.acl-long.662
Bibkey:
Cite (ACL):
Mike Zhang, Rob van der Goot, and Barbara Plank. 2023. ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11871–11890, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain (Zhang et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.662.pdf
Video:
 https://aclanthology.org/2023.acl-long.662.mp4