Hierarchical Multi-Label Classification of Scientific Documents

Mobashir Sadat; Cornelia Caragea

doi:10.18653/v1/2022.emnlp-main.610

Hierarchical Multi-Label Classification of Scientific Documents

Abstract

Automatic topic classification has been studied extensively to assist managing and indexing scientific documents in a digital collection. With the large number of topics being available in recent years, it has become necessary to arrange them in a hierarchy. Therefore, the automatic classification systems need to be able to classify the documents hierarchically. In addition, each paper is often assigned to more than one relevant topic. For example, a paper can be assigned to several topics in a hierarchy tree. In this paper, we introduce a new dataset for hierarchical multi-label text classification (HMLTC) of scientific papers called SciHTC, which contains 186,160 papers and 1,234 categories from the ACM CCS tree. We establish strong baselines for HMLTC and propose a multi-task learning approach for topic classification with keyword labeling as an auxiliary task. Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities on hierarchical scientific topic classification. We make our dataset and code for all experiments publicly available.

Anthology ID:: 2022.emnlp-main.610
Volume:: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8923–8937
Language:
URL:: https://aclanthology.org/2022.emnlp-main.610
DOI:: 10.18653/v1/2022.emnlp-main.610
Bibkey:
Cite (ACL):: Mobashir Sadat and Cornelia Caragea. 2022. Hierarchical Multi-Label Classification of Scientific Documents. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8923–8937, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Hierarchical Multi-Label Classification of Scientific Documents (Sadat & Caragea, EMNLP 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.emnlp-main.610.pdf

PDF Cite Search