Long-tailed Extreme Multi-label Text Classification by the Retrieval of Generated Pseudo Label Descriptions

Ruohong Zhang; Yau-Shian Wang; Yiming Yang (杨亦鸣); Donghan Yu; Tom Vu; Likun Lei

doi:10.18653/v1/2023.findings-eacl.81

Long-tailed Extreme Multi-label Text Classification by the Retrieval of Generated Pseudo Label Descriptions

Ruohong Zhang, Yau-Shian Wang, Yiming Yang, Donghan Yu, Tom Vu, Likun Lei

Abstract

Extreme Multi-label Text Classification (XMTC) has been a tough challenge in machine learning research and applications due to the sheer sizes of the label spaces and the severe data scarcity problem associated with the long tail of rare labels in highly skewed distributions. This paper addresses the challenge of tail label prediction by leveraging the power of dense neural retrieval model in mapping input documents (as queries) to relevant label descriptions. To further enhance the quality of label descriptions, we propose to generate pseudo label descriptions from a trained bag-of-words (BoW) classifier, which demonstrates better classification performance under severe scarce data conditions. The proposed approach achieves the state-of-the-art (SOTA) performance of overall label prediction on XMTC benchmark datasets and especially outperforms the SOTA models in the tail label prediction. We also provide a theoretical analysis for relating the BoW and neural models w.r.t. performance lower bound.

Anthology ID:: 2023.findings-eacl.81
Volume:: Findings of the Association for Computational Linguistics: EACL 2023
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1092–1106
Language:
URL:: https://aclanthology.org/2023.findings-eacl.81/
DOI:: 10.18653/v1/2023.findings-eacl.81
Bibkey:
Cite (ACL):: Ruohong Zhang, Yau-Shian Wang, Yiming Yang, Donghan Yu, Tom Vu, and Likun Lei. 2023. Long-tailed Extreme Multi-label Text Classification by the Retrieval of Generated Pseudo Label Descriptions. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1092–1106, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Long-tailed Extreme Multi-label Text Classification by the Retrieval of Generated Pseudo Label Descriptions (Zhang et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-eacl.81.pdf
Video:: https://aclanthology.org/2023.findings-eacl.81.mp4

PDF Cite Search Video Fix data