An Approach to the Frugal Use of Human Annotators to Scale up Auto-coding for Text Classification Tasks
Li’An Chen | Hanna Suominen
Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association
Human annotation for establishing the training data is often a very costly process in natural language processing (NLP) tasks, which has led to frugal NLP approaches becoming an important research topic. Many research teams struggle to complete projects with limited funding, labor, and computational resources. Driven by the Move-Step analytic framework theorized in the applied linguistics field, our study offers a rigorous approach to the frugal use of two human annotators to scale up auto-coding for text classification tasks. We applied the Linear Support Vector Machine algorithm to text classification of a job ad corpus. Our Cohenâs Kappa for inter-rater agreement and Area Under the Curve (AUC) values reached averages of 0.76 and 0.80, respectively. The calculated time consumption for our human training process was 36 days. The results indicated that even the strategic and frugal use of only two human annotators could enable the efficient training of classifiers with reasonably good performance. This study does not aim to provide generalizability of the results. Rather, we propose that the annotation strategies arising from this study be considered by our readers only if such strategies are fit for one’s specific research purposes.
A machine-learning based model to identify PhD-level skills in job ads
Li’An Chen | Inger Mewburn | Hanna Suonimen
Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association
Around 60% of doctoral graduates worldwide ended up working in industry rather than academia. There have been calls to more closely align the PhD curriculum with the needs of industry, but an evidence base is lacking to inform these changes. We need to find better ways to understand what industry employers really want from doctoral graduates. One good source of data is job advertisements where employers provide a ‘wish list’ of skills and expertise. In this paper, a machine learning-natural language processing (ML-NLP) based approach was used to explore and extract skill requirements from research intensive job advertisements, suitable for PhD graduates. The model developed for detecting skill requirements in job ads was driven by SVM. The experiment results showed that ML-NLP approach had the potential to replicate manual efforts in understanding job requirements of PhD graduates. Our model offers a new perspective to look at PhD-level job skill requirements.