Long Text Classification using Transformers with Paragraph Selection Strategies

Mohit Tuteja, Daniel González Juclà


Abstract
In the legal domain, we often perform classification tasks on very long documents, for example court judgements. These documents often contain thousands of words, so the length of these documents poses a challenge for this modelling task. In this research paper, we present a comprehensive evaluation of various strategies to perform long text classification using Transformers in conjunction with strategies to select document chunks using traditional NLP models. We conduct our experiments on 6 benchmark datasets comprising lengthy documents, 4 of which are publicly available. Each dataset has a median word count exceeding 1,000. Our evaluation encompasses state-of-the-art Transformer models, such as RoBERTa, Longformer, HAT, MEGA and LegalBERT and compares them with a traditional baseline TF-IDF + Neural Network (NN) model. We investigate the effectiveness of pre-training on large corpora, fine tuning strategies, and transfer learning techniques in the context of long text classification.
Anthology ID:
2023.nllp-1.3
Volume:
Proceedings of the Natural Legal Language Processing Workshop 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Daniel Preoțiuc-Pietro, Catalina Goanta, Ilias Chalkidis, Leslie Barrett, Gerasimos (Jerry) Spanakis, Nikolaos Aletras
Venues:
NLLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17–24
Language:
URL:
https://aclanthology.org/2023.nllp-1.3
DOI:
10.18653/v1/2023.nllp-1.3
Bibkey:
Cite (ACL):
Mohit Tuteja and Daniel González Juclà. 2023. Long Text Classification using Transformers with Paragraph Selection Strategies. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 17–24, Singapore. Association for Computational Linguistics.
Cite (Informal):
Long Text Classification using Transformers with Paragraph Selection Strategies (Tuteja & González Juclà, NLLP-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nllp-1.3.pdf