Santosh T.Y.S.S

Also published as: Santosh T.y.s.s., Santosh T.y.s.s

2025

pdf bib abs
LexTempus: Enhancing Temporal Generalizability of Legal Language Models Through Dynamic Mixture of Experts
Santosh T.y.s.s | Tuan-Quang Vuong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The rapid evolution of legal concepts over time necessitates that legal language models adapt swiftly accounting for the temporal dynamics. However, prior works have largely neglected this crucial dimension, treating legal adaptation as a static problem rather than a continuous process. To address this gap, we pioneer LexTempus, a dynamic mixture of experts model that explicitly models the temporal evolution of legal language in a parameter-efficient online learning framework. LexTempus starts with a single lightweight adapter expert and dynamically expands by adding new experts as significant deviations in the data distribution are detected. This self-expansion strategy allows LexTempus to adapt to new information without forgetting past knowledge, thereby improving temporal generalization. We use a a non-parametric similarity-based router to merge relevant experts into a unified expert for each test instance, ensuring efficient inference without additional overhead. We validate the effectiveness of LexTempus on ECHR and EU case law datasets, demonstrating its superiority in both perplexity and open-ended text generation quality metrics.

pdf bib abs
LexCLiPR: Cross-Lingual Paragraph Retrieval from Legal Judgments
Rohit Upadhya | Santosh T.y.s.s
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Efficient retrieval of pinpointed information from case law is crucial for legal professionals but challenging due to the length and complexity of legal judgments. Existing works mostly often focus on retrieving entire cases rather than precise, paragraph-level information. Moreover, multilingual legal practice necessitates cross-lingual retrieval, most works have been limited to monolingual settings. To address these gaps, we introduce LexCLiPR, a cross-lingual dataset for paragraph-level retrieval from European Court of Human Rights (ECtHR) judgments, leveraging multilingual case law guides and distant supervision to curate our dataset. We evaluate retrieval models in a zero-shot setting, revealing the limitations of pre-trained multilingual models for cross-lingual tasks in low-resource languages and the importance of retrieval based post-training strategies. In fine-tuning settings, we observe that two-tower models excel in cross-lingual retrieval, while siamese architectures are better suited for monolingual tasks. Fine-tuning multilingual models on native language queries improves performance but struggles to generalize to unseen legal concepts, highlighting the need for robust strategies to address topical distribution shifts in the legal queries.

Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)—a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on models’ confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.

pdf bib abs
ProMALex: Progressive Modular Adapters for Multi-Jurisdictional Legal Language Modeling
Santosh T.y.s.s | Mohamed Hesham Elganayni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper addresses the challenge of adapting language models to the jurisdiction-specific nature of legal corpora. Existing approaches—training separate models for each jurisdiction or using a single shared model—either fail to leverage common legal principles beneficial for low-resource settings or risk negative interference from conflicting jurisdictional interpretations. To overcome these limitations, we propose a parameter-efficient framework ProMALex, that first derives hierarchical relationships across jurisdictions and progressively inserts adapter modules across model layers based on jurisdictional similarity. This design allows modules in lower layers to be shared across jurisdictions, capturing common legal principles, while higher layers specialize through jurisdiction-specific adapters. Experimental results on two legal language modeling benchmarks demonstrate that ProMALex outperforms both fully shared and jurisdiction-specific models.

pdf bib abs
Fairness Beyond Performance: Revealing Reliability Disparities Across Groups in Legal NLP
Santosh T.y.s.s | Irtiza Chowdhury
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fairness in NLP must extend beyond performance parity to encompass equitable reliability across groups. This study exposes a criticalblind spot: models often make less reliable or overconfident predictions for marginalized groups, even when overall performance appearsfair. Using the FairLex benchmark as a case study in legal NLP, we systematically evaluate both performance and reliability dispari-ties across demographic, regional, and legal attributes spanning four jurisdictions. We show that domain-specific pre-training consistentlyimproves both performance and reliability, especially for underrepresented groups. However, common bias mitigation methods frequentlyworsen reliability disparities, revealing a trade-off not captured by performance metrics alone. Our results call for a rethinking of fairnessin high-stakes NLP: To ensure equitable treatment, models must not only be accurate, but also reliably self-aware across all groups.

pdf bib abs
LexKeyPlan: Planning with Keyphrases and Retrieval Augmentation for Legal Text Generation: A Case Study on European Court of Human Rights Cases
Santosh T.y.s.s | Elvin Quero Hernandez
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large language models excel at legal text generation but often produce hallucinations due to their sole reliance on parametric knowledge. Retrieval-augmented models mitigate this by providing relevant external documents to the model but struggle when retrieval is based only on past context, which may not align with the model’s intended future content. We introduce LexKeyPlan, a novel framework that integrates anticipatory planning into generation. Instead of relying solely on context for retrieval, LexKeyPlan generates keyphrases outlining future content serving as forward-looking plan, guiding retrieval for more accurate text generation. This work incorporates planning into legal text generation, demonstrating how keyphrases—representing legal concepts—enhance factual accuracy. By structuring retrieval around legal concepts, LexKeyPlan better aligns with legal reasoning, making it particularly suited for legal applications. Using the ECHR corpus as case study, we show that LexKeyPlan improves factual accuracy and coherence by retrieving information aligned with the intended content.

pdf bib abs
LexGenie: Automated Generation of Structured Reports for European Court of Human Rights Case Law
Santosh T.y.s.s | Mahmoud Aly | Oana Ichim | Matthias Grabmair
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Analyzing large volumes of case law to uncover evolving legal principles, across multiple cases, on a given topic is a demanding task for legal professionals. Structured topical reports provide an effective solution by summarizing key issues, principles, and judgments, enabling comprehensive legal analysis on a particular topic. While prior works have advanced query-based individual case summarization, none have extended to automatically generating multi-case structured reports. To address this, we introduce LexGenie, an automated LLM-based pipeline designed to create structured reports using the entire body of case law on user-specified topics within the European Court of Human Rights jurisdiction. LexGenie retrieves, clusters, and organizes relevant passages by topic to generate a structured outline and cohesive content for each section. Expert evaluation confirms LexGenie’s utility in producing structured reports that enhance efficient, scalable legal analysis.

pdf bib abs
QABISAR: Query-Article Bipartite Interactions for Statutory Article Retrieval
Santosh T.y.s.s | Hassan Sarwat | Matthias Grabmair
Proceedings of the 31st International Conference on Computational Linguistics

In this paper, we introduce QABISAR, a novel framework for statutory article retrieval, to overcome the semantic mismatch problem when modeling each query-article pair in isolation, making it hard to learn representation that can effectively capture multi-faceted information. QABISAR leverages bipartite interactions between queries and articles to capture diverse aspects inherent in them. Further, we employ knowledge distillation to transfer enriched query representations from the graph network into the query bi-encoder, to capture the rich semantics present in the graph representations, despite absence of graph-based supervision for unseen queries during inference. Our experiments on a real-world expert-annotated dataset demonstrate its effectiveness.

pdf bib abs
RELexED: Retrieval-Enhanced Legal Summarization with Exemplar Diversity
Santosh T.y.s.s | Chen Jia | Patrick Goroncy | Matthias Grabmair
Findings of the Association for Computational Linguistics: NAACL 2025

This paper addresses the task of legal summarization, which involves distilling complex legal documents into concise, coherent summaries. Current approaches often struggle with content theme deviation and inconsistent writing styles due to their reliance solely on source documents. We propose RELexED, a retrieval-augmented framework that utilizes exemplar summaries along with the source document to guide the model. RELexED employs a two-stage exemplar selection strategy, leveraging a determinantal point process to balance the trade-off between similarity of exemplars to the query and diversity among exemplars, with scores computed via influence functions. Experimental results on two legal summarization datasets demonstrate that RELexED significantly outperforms models that do not utilize exemplars and those that rely solely on similarity-based exemplar selection.

pdf bib abs
CoPERLex: Content Planning with Event-based Representations for Legal Case Summarization
Santosh T.y.s.s | Youssef Farag | Matthias Grabmair
Findings of the Association for Computational Linguistics: NAACL 2025

Legal professionals often struggle with lengthy judgments and require efficient summarization for quick comprehension. To address this challenge, we investigate the need for structured planning in legal case summarization, particularly through event-centric representations that reflect the narrative nature of legal case documents. We propose our framework, CoPERLex, which operates in three stages: first, it performs content selection to identify crucial information from the judgment; second, the selected content is utilized to generate intermediate plans through event-centric representations modeled as Subject-Verb-Object tuples; and finally, it generates coherent summaries based on both the content and the structured plan. Our experiments on four legal summarization datasets demonstrate the effectiveness of integrating content selection and planning components, highlighting the advantages of event-centric plans over traditional entity-centric approaches in the context of legal judgements.

pdf bib abs
LeCoPCR: Legal Concept-guided Prior Case Retrieval for European Court of Human Rights cases
Santosh T.y.s.s | Isaac Misael Olguín Nolasco | Matthias Grabmair
Findings of the Association for Computational Linguistics: NAACL 2025

Prior case retrieval (PCR) is crucial for legal practitioners to find relevant precedent cases given the facts of a query case. Existing approaches often overlook the underlying semantic intent in determining relevance with respect to the query case. In this work, we propose LeCoPCR, a novel approach that explicitly generate intents in the form of legal concepts from a given query case facts and then augments the query with these concepts to enhance models understanding of semantic intent that dictates relavance. To overcome the unavailability of annotated legal concepts, We employ a weak supervision approach to extract key legal concepts from the reasoning section using Determinantal Point Process (DPP) to balance quality and diversity. Experimental results on the ECtHR-PCR dataset demonstrate the effectiveness of leveraging legal concepts and DPP-based key concept extraction.

pdf bib abs
AQuAECHR: Attributed Question Answering for European Court of Human Rights
Korbinian Q. Weidinger | Santosh T.y.s.s | Oana Ichim | Matthias Grabmair
Findings of the Association for Computational Linguistics: ACL 2025

LLMs have become prevalent tools for information seeking across various fields, including law. However, their generated responses often suffer from hallucinations, hindering their widespread adoption in high stakes domains such as law, which can potentially mislead experts and propagate societal harms. To enhance trustworthiness in these systems, one promising approach is to attribute the answer to an actual source, thereby improving the factuality and verifiability of the response. In pursuit of advancing attributed legal question answering, we introduce AQuAECHR, a benchmark comprising information-seeking questions from ECHR jurisprudence along with attributions to relevant judgments. We present strategies to automatically curate this dataset from ECHR case law guides and utilize an LLM-based filtering pipeline to improve dataset quality, as validated by legal experts. Additionally, we assess several LLMs, including those trained on legal corpora, on this dataset to underscore significant challenges with the current models and strategies dealing with attributed QA, both quantitatively and qualitatively.

pdf bib abs
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Shanshan Xu | Santosh T.y.s.s | Yanai Elazar | Quirin Vogel | Barbara Plank | Matthias Grabmair
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)

Recent works have shown that Large Language Models (LLMs) have a tendency to memorize patterns and biases present in their training data, raising important questions about how such memorized content influences model behavior. One such concern is the emergence of political bias in LLM outputs. In this paper, we investigate the extent to which LLMs’ political leanings reflect memorized patterns from their pretraining corpora. We propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs’ political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 U.S. Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data, and the methodology for auditing the memorization in LLMs to ensure human-AI alignment.

2024

pdf bib abs
Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification
Shanshan Xu | Santosh T.y.s.s | Oana Ichim | Barbara Plank | Matthias Grabmair
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In legal decisions, split votes (SV) occur when judges cannot reach a unanimous decision, posing a difficulty for lawyers who must navigate diverse legal arguments and opinions. In high-stakes domains, %as human-AI interaction systems become increasingly important, understanding the alignment of perceived difficulty between humans and AI systems is crucial to build trust. However, existing NLP calibration methods focus on a classifier’s awareness of predictive performance, measured against the human majority class, overlooking inherent human label variation (HLV). This paper explores split votes as naturally observable human disagreement and value pluralism. We collect judges’ vote distributions from the European Court of Human Rights (ECHR), and present SV-ECHR, a case outcome classification (COC) dataset with SV information. We build a taxonomy of disagreement with SV-specific subcategories. We further assess the alignment of perceived difficulty between models and humans, as well as confidence- and human-calibration of COC models. We observe limited alignment with the judge vote distribution. To our knowledge, this is the first systematic exploration of calibration to human judgements in legal NLP. Our study underscores the necessity for further research on measuring and enhancing model calibration considering HLV in legal decision tasks.

pdf bib abs
ChronosLex: Time-aware Incremental Training for Temporal Generalization of Legal Classification Tasks
Santosh T.y.s.s | Tuan-Quang Vuong | Matthias Grabmair
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This study investigates the challenges posed by the dynamic nature of legal multi-label text classification tasks, where legal concepts evolve over time. Existing models often overlook the temporal dimension in their training process, leading to suboptimal performance of those models over time, as they treat training data as a single homogeneous block. To address this, we introduce ChronosLex, an incremental training paradigm that trains models on chronological splits, preserving the temporal order of the data. However, this incremental approach raises concerns about overfitting to recent data, prompting an assessment of mitigation strategies using continual learning and temporal invariant methods. Our experimental results over six legal multi-label text classification datasets reveal that continual learning methods prove effective in preventing overfitting thereby enhancing temporal generalizability, while temporal invariant methods struggle to capture these dynamics of temporal shifts.

pdf bib abs
A Tale of Two Revisions: Summarizing Changes Across Document Versions
Santosh T.y.s.s | Natwar Modani | Apoorv Saxena
Findings of the Association for Computational Linguistics: ACL 2024

Document revision is a crucial aspect of the writing process, particularly in collaborative environments where multiple authors contribute simultaneously. However, current tools lack an efficient way to provide a comprehensive overview of changes between versions, leading to difficulties in understanding revisions. To address this, we propose a novel task of providing thematic summary of changes between document versions, organizing individual edits based on shared themes. We assess capabilities of LLMs on this task and further introduce three strategies to tackle this task: (i) representing the input of two documents along with edits in the ‘diff’ format (ii) a two-stage task decomposition with individual edit description generation as an intermediate task and (iii) clustering based chunking and subsequent merging techniques for handling longer documents. Our experiments demonstrate the effectiveness of our approach in improving the model’s capacity to handle this complex task. Additionally, we introduce ChangeSumm, a curated dataset comprising human-written thematic summaries for pairs of document versions, to facilitate evaluation and further research in this direction.

pdf bib abs
The Craft of Selective Prediction: Towards Reliable Case Outcome Classification - An Empirical Study on European Court of Human Rights Cases
Santosh T.y.s.s | Irtiza Chowdhury | Shanshan Xu | Matthias Grabmair
Findings of the Association for Computational Linguistics: EMNLP 2024

In high-stakes decision-making tasks within legal NLP, such as Case Outcome Classification (COC), quantifying a model’s predictive confidence is crucial. Confidence estimation enables humans to make more informed decisions, particularly when the model’s certainty is low, or where the consequences of a mistake are significant. However, most existing COC works prioritize high task performance over model reliability. This paper conducts an empirical investigation into how various design choices—including pre-training corpus, confidence estimator and fine-tuning loss—affect the reliability of COC models within the framework of selective prediction. Our experiments on the multi-label COC task, focusing on European Court of Human Rights (ECtHR) cases, highlight the importance of a diverse yet domain-specific pre-training corpus for better calibration. Additionally, we demonstrate that larger models tend to exhibit overconfidence, Monte Carlo dropout methods produce reliable confidence estimates, and confident error regularization effectively mitigates overconfidence. To our knowledge, this is the first systematic exploration of selective prediction in legal NLP. Our findings underscore the need for further research on enhancing confidence measurement and improving the trustworthiness of models in the legal domain.

pdf bib abs
Incorporating Precedents for Legal Judgement Prediction on European Court of Human Rights Cases
Santosh T.y.s.s | Mohamed Hesham Elganayni | Stanisław Sójka | Matthias Grabmair
Findings of the Association for Computational Linguistics: EMNLP 2024

Inspired by the legal doctrine of stare decisis, which leverages precedents (prior cases) for informed decision-making, we explore methods to integrate them into LJP models. To facilitate precedent retrieval, we train a retriever with a fine-grained relevance signal based on the overlap ratio of alleged articles between cases. We investigate two strategies to integrate precedents: direct incorporation at inference via label interpolation based on case proximity and during training via a precedent fusion module using a stacked-cross attention model. We employ joint training of the retriever and LJP models to address latent space divergence between them. Our experiments on LJP tasks from the ECHR jurisdiction reveal that integrating precedents during training coupled with joint training of the retriever and LJP model, outperforms models without precedents or with precedents incorporated only at inference, particularly benefiting sparser articles.

pdf bib abs
HiCuLR: Hierarchical Curriculum Learning for Rhetorical Role Labeling of Legal Documents
Santosh T.y.s.s | Apolline Isaia | Shiyu Hong | Matthias Grabmair
Findings of the Association for Computational Linguistics: EMNLP 2024

Rhetorical Role Labeling (RRL) of legal documents is pivotal for various downstream tasks such as summarization, semantic case search and argument mining. Existing approaches often overlook the varying difficulty levels inherent in legal document discourse styles and rhetorical roles. In this work, we propose HiCuLR, a hierarchical curriculum learning framework for RRL. It nests two curricula: Rhetorical Role-level Curriculum (RC) on the outer layer and Document-level Curriculum (DC) on the inner layer. DC categorizes documents based on their difficulty, utilizing metrics like deviation from a standard discourse structure and exposes the model to them in an easy-to-difficult fashion. RC progressively strengthens the model to discern coarse-to-fine-grained distinctions between rhetorical roles. Our experiments on four RRL datasets demonstrate the efficacy of HiCuLR, highlighting the complementary nature of DC and RC.

pdf bib abs
CuSINeS: Curriculum-driven Structure Induced Negative Sampling for Statutory Article Retrieval
Santosh T.y.s.s. | Kristina Kaiser | Matthias Grabmair
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we introduce CuSINeS, a negative sampling approach to enhance the performance of Statutory Article Retrieval (SAR). CuSINeS offers three key contributions. Firstly, it employs a curriculum-based negative sampling strategy guiding the model to focus on easier negatives initially and progressively tackle more difficult ones. Secondly, it leverages the hierarchical and sequential information derived from the structural organization of statutes to evaluate the difficulty of samples. Lastly, it introduces a dynamic semantic difficulty assessment using the being-trained model itself, surpassing conventional static methods like BM25, adapting the negatives to the model’s evolving competence. Experimental results on a real-world expert-annotated SAR dataset validate the effectiveness of CuSINeS across four different baselines, demonstrating its versatility.

pdf bib abs
ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights
Santosh T.y.s.s. | Rashid Haddad | Matthias Grabmair
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In common law jurisdictions, legal practitioners rely on precedents to construct arguments, in line with the doctrine of stare decisis. As the number of cases grow over the years, prior case retrieval (PCR) has garnered significant attention. Besides lacking real-world scale, existing PCR datasets do not simulate a realistic setting, because their queries use complete case documents while only masking references to prior cases. The query is thereby exposed to legal reasoning not yet available when constructing an argument for an undecided case as well as spurious patterns left behind by citation masks, potentially short-circuiting a comprehensive understanding of case facts and legal principles. To address these limitations, we introduce a PCR dataset based on judgements from the European Court of Human Rights (ECtHR), which explicitly separate facts from arguments and exhibit precedential practices, aiding us to develop this PCR dataset to foster systems’ comprehensive understanding. We benchmark different lexical and dense retrieval approaches with various negative sampling strategies, adapting them to deal with long text sequences using hierarchical variants. We found that difficulty-based negative sampling strategies were not effective for the PCR task, highlighting the need for investigation into domain-specific difficulty criteria. Furthermore, we observe performance of the dense models degrade with time and calls for further research into temporal adaptation of retrieval models. Additionally, we assess the influence of different views , Halsbury’s and Goodhart’s, in practice in ECtHR jurisdiction using PCR task.

pdf bib abs
LexAbSumm: Aspect-based Summarization of Legal Decisions
Santosh T.y.s.s. | Mahmoud Aly | Matthias Grabmair
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Legal professionals frequently encounter long legal judgments that hold critical insights for their work. While recent advances have led to automated summarization solutions for legal documents, they typically provide generic summaries, which may not meet the diverse information needs of users. To address this gap, we introduce LexAbSumm, a novel dataset designed for aspect-based summarization of legal case decisions, sourced from the European Court of Human Rights jurisdiction. We evaluate several abstractive summarization models tailored for longer documents on LexAbSumm, revealing a challenge in conditioning these models to produce aspect-specific summaries. We release LexAbSum to facilitate research in aspect-based summarization for legal domain.

pdf bib abs
Mind Your Neighbours: Leveraging Analogous Instances for Rhetorical Role Labeling for Legal Documents
Santosh T.y.s.s. | Hassan Sarwat | Ahmed Mohamed Abdelaal Abdou | Matthias Grabmair
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Rhetorical Role Labeling (RRL) of legal judgments is essential for various tasks, such as case summarization, semantic search and argument mining. However, it presents challenges such as inferring sentence roles from context, interrelated roles, limited annotated data, and label imbalance. This study introduces novel techniques to enhance RRL performance by leveraging knowledge from semantically similar instances (neighbours). We explore inference-based and training-based approaches, achieving remarkable improvements in challenging macro-F1 scores. For inference-based methods, we explore interpolation techniques that bolster label predictions without re-training. While in training-based methods, we integrate prototypical learning with our novel discourse-aware contrastive method that work directly on embedding spaces. Additionally, we assess the cross-domain applicability of our methods, demonstrating their effectiveness in transferring knowledge across diverse legal domains.

pdf bib abs
Query-driven Relevant Paragraph Extraction from Legal Judgments
Santosh T.y.s.s. | Elvin A. Quero Hernandez | Matthias Grabmair
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Legal professionals often grapple with navigating lengthy legal judgements to pinpoint information that directly address their queries. This paper focus on this task of extracting relevant paragraphs from legal judgements based on the query. We construct a specialized dataset for this task from the European Court of Human Rights (ECtHR) using the case law guides. We assess the performance of current retrieval models in a zero-shot way and also establish fine-tuning benchmarks using various models. The results highlight the significant gap between fine-tuned and zero-shot performance, emphasizing the challenge of handling distribution shift in the legal domain. We notice that the legal pre-training handles distribution shift on the corpus side but still struggles on query side distribution shift, with unseen legal queries. We also explore various Parameter Efficient Fine-Tuning (PEFT) methods to evaluate their practicality within the context of information retrieval, shedding light on the effectiveness of different PEFT methods across diverse configurations with pre-training and model architectures influencing the choice of PEFT method.

pdf bib abs
Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset
Santosh T.y.s.s. | Nina Baumgartner | Matthias Stürmer | Matthias Grabmair | Joel Niklaus
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The assessment of explainability in Legal Judgement Prediction (LJP) systems is of paramount importance in building trustworthy and transparent systems, particularly considering the reliance of these systems on factors that may lack legal relevance or involve sensitive attributes. This study delves into the realm of explainability and fairness in LJP models, utilizing Swiss Judgement Prediction (SJP), the only available multilingual LJP dataset. We curate a comprehensive collection of rationales that ‘support’ and ‘oppose’ judgement from legal experts for 108 cases in German, French, and Italian. By employing an occlusion-based explainability approach, we evaluate the explainability performance of state-of-the-art monolingual and multilingual BERT-based LJP models, as well as models developed with techniques such as data augmentation and cross-lingual transfer, which demonstrated prediction performance improvement. Notably, our findings reveal that improved prediction performance does not necessarily correspond to enhanced explainability performance, underscoring the significance of evaluating models from an explainability perspective. Additionally, we introduce a novel evaluation framework, Lower Court Insertion (LCI), which allows us to quantify the influence of lower court information on model predictions, exposing current models’ biases.

pdf bib abs
Beyond Borders: Investigating Cross-Jurisdiction Transfer in Legal Case Summarization
Santosh T.y.s.s | Vatsal Venkatkrishna | Saptarshi Ghosh | Matthias Grabmair
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Legal professionals face the challenge of managing an overwhelming volume of lengthy judgments, making automated legal case summarization crucial. However, prior approaches mainly focused on training and evaluating these models within the same jurisdiction. In this study, we explore the cross-jurisdictional generalizability of legal case summarization models. Specifically, we explore how to effectively summarize legal cases of a target jurisdiction where reference summaries are not available. In particular, we investigate whether supplementing models with unlabeled target jurisdiction corpus and extractive silver summaries obtained from unsupervised algorithms on target data enhances transfer performance. Our comprehensive study on three datasets from different jurisdictions highlights the role of pre-training in improving transfer performance. We shed light on the pivotal influence of jurisdictional similarity in selecting optimal source datasets for effective transfer. Furthermore, our findings underscore that incorporating unlabeled target data yields improvements in general pre-trained models, with additional gains when silver summaries are introduced. This augmentation is especially valuable when dealing with extractive datasets and scenarios featuring limited alignment between source and target jurisdictions. Our study provides key insights for developing adaptable legal case summarization systems, transcending jurisdictional boundaries.

pdf bib abs
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English
Santosh T.y.s.s | Cornelius Weiss | Matthias Grabmair
Proceedings of the Natural Legal Language Processing Workshop 2024

In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress. However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking generative tasks. This work curates LexSumm, a benchmark designed for evaluating legal summarization tasks in English. It comprises eight English legal summarization datasets, from diverse jurisdictions, such as the US, UK, EU and India. Additionally, we release LexT5, legal oriented sequence-to-sequence model, addressing the limitation of the existing BERT-style encoder-only models in the legal domain. We assess its capabilities through zero-shot probing on LegalLAMA and fine-tuning on LexSumm. Our analysis reveals abstraction and faithfulness errors even in summaries generated by zero-shot LLMs, indicating opportunities for further improvements. LexSumm benchmark and LexT5 model are available at https://github.com/TUMLegalTech/LexSumm-LexT5.

pdf bib abs
Towards Supporting Legal Argumentation with NLP: Is More Data Really All You Need?
Santosh T.y.s.s | Kevin Ashley | Katie Atkinson | Matthias Grabmair
Proceedings of the Natural Legal Language Processing Workshop 2024

Modeling legal reasoning and argumentation justifying decisions in cases has always been central to AI & Law, yet contemporary developments in legal NLP have increasingly focused on statistically classifying legal conclusions from text. While conceptually “simpler’, these approaches often fall short in providing usable justifications connecting to appropriate legal concepts. This paper reviews both traditional symbolic works in AI & Law and recent advances in legal NLP, and distills possibilities of integrating expert-informed knowledge to strike a balance between scalability and explanation in symbolic vs. data-driven approaches. We identify open challenges and discuss the potential of modern NLP models and methods that integrate conceptual legal knowledge.

pdf bib abs
PrivaT5: A Generative Language Model for Privacy Policies
Mohammad Zoubi | Santosh T.y.s.s | Edgar Rosas | Matthias Grabmair
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing

In the era of of digital privacy, users often neglect to read privacy policies due to their complexity. To bridge this gap, NLP models have emerged to assist in understanding privacy policies. While recent generative language models like BART and T5 have shown prowess in text generation and discriminative tasks being framed as generative ones, their application to privacy policy domain tasks remains unexplored. To address that, we introduce PrivaT5, a T5-based model that is further pre-trained on privacy policy text. We evaluate PrivaT5 over a diverse privacy policy related tasks and notice its superior performance over T5, showing the utility of continued domain-specific pre-training. Our results also highlight challenges faced by these generative models in complex structured output label space, especially in sequence tagging tasks, where they fall short compared to lighter encoder-only models.

2023

pdf bib abs
Leveraging Task Dependency and Contrastive Learning for Case Outcome Classification on European Court of Human Rights Cases
Santosh T.y.s.s | Marcel Perez San Blas | Phillip Kemper | Matthias Grabmair
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We report on an experiment in case outcome classification on European Court of Human Rights cases where our model first learns to identify the convention articles allegedly violated by the state from case facts descriptions, and subsequently uses that information to classify whether the court finds a violation of those articles. We assess the dependency between these two tasks at the feature and outcome level. Furthermore, we leverage a hierarchical contrastive loss to pull together article-specific representations of cases at the higher level, leading to distinctive article clusters. The cases in each article cluster are further pulled closer based on their outcome, leading to sub-clusters of cases with similar outcomes. Our experiment results demonstrate that, given a static pre-trained encoder, our models produce a small but consistent improvement in classification performance over single-task and joint models without contrastive loss.

pdf bib abs
From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification
Shanshan Xu | Santosh T.y.s.s | Oana Ichim | Isabella Risini | Barbara Plank | Matthias Grabmair
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RaVE: Rationale Variation in ECHR, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of state-of-the-art COC models on RaVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case’s facts supposedly relevant for its outcome.

pdf bib abs
VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights
Shanshan Xu | Leon Staufer | Santosh T.y.s.s | Oana Ichim | Corina Heri | Matthias Grabmair
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recognizing vulnerability is crucial for understanding and implementing targeted support to empower individuals in need. This is especially important at the European Court of Human Rights (ECtHR), where the court adapts Convention standards to meet actual individual needs and thus to ensure effective human rights protection. However, the concept of vulnerability remains elusive at the ECtHR and no prior NLP research has dealt with it. To enable future research in this area, we present VECHR, a novel expert-annotated multi-label dataset comprising of vulnerability type classification and explanation rationale. We benchmark the performance of state-of-the-art models on VECHR from both prediction and explainability perspective. Our results demonstrate the challenging nature of task with lower prediction performance and limited agreement between models and experts. Further, we analyze the robustness of these models in dealing with out-of-domain (OOD) data and observe overall limited performance. Our dataset poses unique challenges offering a significant room for improvement regarding performance, explainability and robustness.

pdf bib abs
Zero-shot Transfer of Article-aware Legal Outcome Classification for European Court of Human Rights Cases
Santosh T.y.s.s | Oana Ichim | Matthias Grabmair
Findings of the Association for Computational Linguistics: EACL 2023

In this paper, we cast Legal Judgment Prediction on European Court of Human Rights cases into an article-aware classification task, where the case outcome is classified from a combined input of case facts and convention articles. This configuration facilitates the model learning some legal reasoning ability in mapping article text to specific case fact text. It also provides an opportunity to evaluate the model’s ability to generalize to zero-shot settings when asked to classify the case outcome with respect to articles not seen during training. We devise zero-shot experiments and apply domain adaptation methods based on domain discrimination and Wasserstein distance. Our results demonstrate that the article-aware architecture outperforms straightforward fact classification. We also find that domain adaptation methods improve zero-shot transfer performance, with article relatedness and encoder pre-training influencing the effect.

2022

pdf bib abs
Deconfounding Legal Judgment Prediction for European Court of Human Rights Cases Towards Better Alignment with Experts
Santosh T.y.s.s | Shanshan Xu | Oana Ichim | Matthias Grabmair
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

This work demonstrates that Legal Judgement Prediction systems without expert-informed adjustments can be vulnerable to shallow, distracting surface signals that arise from corpus construction, case distribution, and confounding factors. To mitigate this, we use domain expertise to strategically identify statistically predictive but legally irrelevant information. We adopt adversarial training to prevent the system from relying on it. We evaluate our deconfounded models by employing interpretability techniques and comparing to expert annotations. Quantitative experiments and qualitative analysis show that our deconfounded model consistently aligns better with expert rationales than baselines trained for prediction only. We further contribute a set of reference expert annotations to the validation and testing partitions of an existing benchmark dataset of European Court of Human Rights cases.

2020

pdf bib abs
SaSAKE: Syntax and Semantics Aware Keyphrase Extraction from Research Papers
Santosh T.y.s.s | Debarshi Kumar Sanyal | Plaban Kumar Bhowmick | Partha Pratim Das
Proceedings of the 28th International Conference on Computational Linguistics

Keyphrases in a research paper succinctly capture the primary content of the paper and also assist in indexing the paper at a concept level. Given the huge rate at which scientific papers are published today, it is important to have effective ways of automatically extracting keyphrases from a research paper. In this paper, we present a novel method, Syntax and Semantics Aware Keyphrase Extraction (SaSAKE), to extract keyphrases from research papers. It uses a transformer architecture, stacking up sentence encoders to incorporate sequential information, and graph encoders to incorporate syntactic and semantic dependency graph information. Incorporation of these dependency graphs helps to alleviate long-range dependency problems and identify the boundaries of multi-word keyphrases effectively. Experimental results on three benchmark datasets show that our proposed method SaSAKE achieves state-of-the-art performance in keyphrase extraction from scientific papers.

2019

pdf bib abs
Incorporating Domain Knowledge into Medical NLI using Knowledge Graphs
Soumya Sharma | Bishal Santra | Abhik Jana | Santosh T.y.s.s | Niloy Ganguly | Pawan Goyal
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recently, biomedical version of embeddings obtained from language models such as BioELMo have shown state-of-the-art results for the textual inference task in the medical domain. In this paper, we explore how to incorporate structured domain knowledge, available in the form of a knowledge graph (UMLS), for the Medical NLI task. Specifically, we experiment with fusing embeddings obtained from knowledge graph with the state-of-the-art approaches for NLI task (ESIM model). We also experiment with fusing the domain-specific sentiment information for the task. Experiments conducted on MedNLI dataset clearly show that this strategy improves the baseline BioELMo architecture for the Medical NLI task.

pdf bib abs
AttentiveChecker: A Bi-Directional Attention Flow Mechanism for Fact Verification
Santosh T.y.s.s | Vishal G | Avirup Saha | Niloy Ganguly
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The recently released FEVER dataset provided benchmark results on a fact-checking task in which given a factual claim, the system must extract textual evidence (sets of sentences from Wikipedia pages) that support or refute the claim. In this paper, we present a completely task-agnostic pipelined system, AttentiveChecker, consisting of three homogeneous Bi-Directional Attention Flow (BIDAF) networks, which are multi-layer hierarchical networks that represent the context at different levels of granularity. We are the first to apply to this task a bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. AttentiveChecker can be used to perform document retrieval, sentence selection, and claim verification. Experiments on the FEVER dataset indicate that AttentiveChecker is able to achieve the state-of-the-art results on the FEVER test set.

2018

pdf bib abs
Deep Learning for Social Media Health Text Classification
Santosh T.y.s.s | Santosh Tokala | Vaibhav Gambhir | Animesh Mukherjee
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

This paper describes the systems developed for 1st and 2nd tasks of the 3rd Social Media Mining for Health Applications Shared Task at EMNLP 2018. The first task focuses on automatic detection of posts mentioning a drug name or dietary supplement, a binary classification. The second task is about distinguishing the tweets that present personal medication intake, possible medication intake and non-intake. We performed extensive experiments with various classifiers like Logistic Regression, Random Forest, SVMs, Gradient Boosted Decision Trees (GBDT) and deep learning architectures such as Long Short-Term Memory Networks (LSTM), jointed Convolutional Neural Networks (CNN) and LSTM architecture, and attention based LSTM architecture both at word and character level. We have also explored using various pre-trained embeddings like Global Vectors for Word Representation (GloVe), Word2Vec and task-specific embeddings learned using CNN-LSTM and LSTMs.