Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining

Inderjeet Nair, Natwar Modani


Abstract
Pretraining large language models has resulted in tremendous performance improvement for many natural language processing (NLP) tasks. While for non-domain specific tasks, such models can be used directly, a common strategy to achieve better performance for specific domains involves pretraining these language models over domain specific data using objectives like Masked Language Modelling (MLM), Autoregressive Language Modelling, etc. While such pretraining addresses the change in vocabulary and style of language for the domain, it is otherwise a domain agnostic approach. In this work, we investigate the effect of incorporating pretraining objectives that explicitly tries to exploit the domain specific language characteristics in addition to such MLM based pretraining. Particularly, we examine two distinct characteristics associated with the legal domain and propose pretraining objectives modelling these characteristics. The proposed objectives target improvement of token-level feature representation, as well as aim to incorporate sentence level semantics. We demonstrate superiority in the performance of the models pretrained using our objectives against those trained using domain-agnostic objectives over several legal downstream tasks.
Anthology ID:
2023.findings-eacl.190
Volume:
Findings of the Association for Computational Linguistics: EACL 2023
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2516–2526
Language:
URL:
https://aclanthology.org/2023.findings-eacl.190
DOI:
10.18653/v1/2023.findings-eacl.190
Bibkey:
Cite (ACL):
Inderjeet Nair and Natwar Modani. 2023. Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2516–2526, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining (Nair & Modani, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-eacl.190.pdf
Video:
 https://aclanthology.org/2023.findings-eacl.190.mp4