Tao Deng
2024
From Text Segmentation to Enhanced Representation Learning: A Novel Approach to Multi-Label Classification for Long Texts
Wang Zhang
|
Xin Wang
|
Qian Wang
|
Tao Deng
|
Xiaoru Wu
Findings of the Association for Computational Linguistics: EMNLP 2024
Multi-label text classification (MLTC) is an important task in the field of natural language processing. Most existing models rely on high-quality text representations provided by pre-trained language models (PLMs). They hence face the challenge of input length limitation caused by PLMs, when dealing with long texts. In light of this, we introduce a comprehensive approach to multi-label long text classification. We propose a text segmentation algorithm, which guarantees to produce the optimal segmentation, to address the issue of input length limitation caused by PLMs. We incorporate external knowledge, labels’ co-occurrence relations, and attention mechanisms in representation learning to enhance both text and label representations. Our method’s effectiveness is validated through extensive experiments on various MLTC datasets, unraveling the intricate correlations between texts and labels.