Prajna Devi Upadhyay
2024
BERT-based Idiom Identification using Language Translation and Word Cohesion
Arnav Yayavaram
|
Siddharth Yayavaram
|
Prajna Devi Upadhyay
|
Apurba Das
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
An idiom refers to a special type of multi-word expression whose meaning is figurative and cannot be deduced from the literal interpretation of its components. Idioms are prevalent in almost all languages and text genres, necessitating explicit handling by comprehensive NLP systems. Such phrases are referred to as Potentially Idiomatic Expressions (PIEs) and automatically identifying them in text is a challenging task. In this paper, we propose using a BERT-based model fine-tuned with custom objectives, to improve the accuracy of detecting PIEs in text. Our custom loss functions capture two important properties (word cohesion and language translation) to distinguish PIEs from non-PIEs. We conducted several experiments on 7 datasets and showed that incorporating custom objectives while training the model leads to substantial gains. Our models trained using this approach also have better sequence accuracy over DISC, a state-of-the-art PIE detection technique, along with good transfer capabilities.
LeGen: Complex Information Extraction from Legal sentences using Generative Models
Chaitra C R
|
Sankalp Kulkarni
|
Sai Rama Akash Varma Sagi
|
Shashank Pandey
|
Rohit Yalavarthy
|
Dipanjan Chakraborty
|
Prajna Devi Upadhyay
Proceedings of the Natural Legal Language Processing Workshop 2024
Constructing legal knowledge graphs from unstructured legal texts is a complex challenge due to the intricate nature of legal language. While open information extraction (OIE) techniques can convert text into triples of the form subject, relation, object, they often fall short of capturing the nuanced relationships within lengthy legal sentences, necessitating more sophisticated approaches known as complex information extraction. This paper proposes LeGen – an end-to-end approach leveraging pre-trained large language models (GPT-4o, T5, BART) to perform complex information extraction from legal sentences. LeGen learns and represents the discourse structure of legal sentences, capturing both their complexity and semantics. It minimizes error propagation typical in multi-step pipelines and achieves up to a 32.2% gain on the Indian Legal benchmark. Additionally, it demonstrates competitive performance on open information extraction benchmarks. A promising application of the resulting legal knowledge graphs is in developing question-answering systems for government schemes, tailored to the Next Billion Users who struggle with the complexity of legal language. Our code and data are available at https://github.com/prajnaupadhyay/LegalIE
Search
Co-authors
- Arnav Yayavaram 1
- Siddharth Yayavaram 1
- Apurba Das 1
- Chaitra C R 1
- Sankalp Kulkarni 1
- show all...