Malek Yaich

2025

Improving Accessibility of SCOTUS Opinions: A Benchmark Study and a New Dataset for Generic Heading Prediction and Specific Heading Generation
Malek Yaich | Nicolas Hernandez
Proceedings of the 31st International Conference on Computational Linguistics

The opinions of the U.S. Supreme Court (SCOTUS) are known for their extensive length, complex legal language, and lack of titled sections, which pose significant challenges for accessibility and comprehension. This paper defines the task of automatic section titling by proposing both generic and specific headings for each section. Given the scarcity of sections with headings in SCOTUS, we study the possibility of using data from lower courts for training models. A dataset of sections with generic or specific headings covering three courts (SCOTUS and two lower courts) was compiled. A supplementary SCOTUS set was manually annotated with these two types of titles. In order to establish a benchmark, we provide the performance of different systems trained for each subtask: For generic heading prediction, we compare the performance of fine-tuning non-contextual, general and domain-oriented pretrained language models. Transformer-based sequence-to-sequence models are considered for specific heading generation. Our results show that a fine-tuned LegalBERT can achieve a F1 score of about 0.90 % in predicting generic headings. They also show that BART and T5 have similar performance in generating specific headings and that, although this performance is good, there is still room for improvement. In addition, we provide a human assessment to support the generation experiment and show a quasi-linear correlation between human degrees of agreement and the results of conventional measures such as ROUGE and BERTScore.

pdf bib abs

Nous présentons COLaF, un projet dédié à la collecte et au développement d’outils et de ressources de traitement automatique des langues (TAL) pour le français et les autres langues de France, avec une attention particulière sur les langues et variétés moins dotées. Le projet concerne les données textuelles, audio et vidéo, afin de fournir des corpus et des outils pour le langage écrit, parlé et signé. Le projet inclut la collecte, la normalisation et la documentation de données préexistantes, y compris des données actuellement non accessibles ou non exploitables à des fins de recherche, ainsi que le développement d’outils de TAL adaptés à ces langues, comme des outils pour l’annotation linguistique et pour la traduction automatique. Cet article permet la présentation des principaux défis posés par le projet et de premiers résultats.

Co-authors

Panagiotis Tsolakis 1

Emmanuel Vincent 1

Venues

Fix author