Masha Medvedeva


pdf bib
Legal Judgment Prediction: If You Are Going to Do It, Do It Right
Masha Medvedeva | Pauline Mcbride
Proceedings of the Natural Legal Language Processing Workshop 2023

The field of Legal Judgment Prediction (LJP) has witnessed significant growth in the past decade, with over 100 papers published in the past three years alone. Our comprehensive survey of over 150 papers reveals a stark reality: only ~7% of published papers are doing what they set out to do - predict court decisions. We delve into the reasons behind the flawed and unreliable nature of the remaining experiments, emphasising their limited utility in the legal domain. We examine the distinctions between predicting court decisions and the practices of legal professionals in their daily work. We explore how a lack of attention to the identity and needs of end-users has fostered the misconception that LJP is a near-solved challenge suitable for practical application, and contributed to the surge in academic research in the field. To address these issues, we examine three different dimensions of ‘doing LJP right’: using data appropriate for the task; tackling explainability; and adopting an application-centric approach to model reporting and evaluation. We formulate a practical checklist of recommendations, delineating the characteristics that are required if a judgment prediction system is to be a valuable addition to the legal field.


pdf bib
When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish
Martin Kroon | Masha Medvedeva | Barbara Plank
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the results of our participation in the Discriminating between Dutch and Flemish in Subtitles VarDial 2018 shared task. We try techniques proven to work well for discriminating between language varieties as well as explore the potential of using syntactic features, i.e. hierarchical syntactic subtrees. We experiment with different combinations of features. Discriminating between these two languages turned out to be a very hard task, not only for a machine: human performance is only around 0.51 F1 score; our best system is still a simple Naive Bayes model with word unigrams and bigrams. The system achieved an F1 score (macro) of 0.62, which ranked us 4th in the shared task.