Mark Hansen


2024

pdf bib
LegalDiscourse: Interpreting When Laws Apply and To Whom
Alexander Spangher | Zihan Xue | Te-Lin Wu | Mark Hansen | Jonathan May
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

While legal AI has made strides in recent years, it still struggles with basic legal concepts: _when_ does a law apply? _Who_ does it applies to? _What_ does it do? We take a _discourse_ approach to addressing these problems and introduce a novel taxonomy for span-and-relation parsing of legal texts. We create a dataset, _LegalDiscourse_ of 602 state-level law paragraphs consisting of 3,715 discourse spans and 1,671 relations. Our trained annotators have an agreement-rate đťś…>.8, yet few-shot GPT3.5 performs poorly at span identification and relation classification. Although fine-tuning improves performance, GPT3.5 still lags far below human level. We demonstrate the usefulness of our schema by creating a web application with journalists. We collect over 100,000 laws for 52 U.S. states and territories using 20 scrapers we built, and apply our trained models to 6,000 laws using U.S. Census population numbers. We describe two journalistic outputs stemming from this application: (1) an investigation into the increase in liquor licenses following population growth and (2) a decrease in applicable laws under different under-count projections.

2023

pdf bib
Does BERT Exacerbate Gender or L1 Biases in Automated English Speaking Assessment?
Alexander Kwako | Yixin Wan | Jieyu Zhao | Mark Hansen | Kai-Wei Chang | Li Cai
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

In English speaking assessment, pretrained large language models (LLMs) such as BERT can score constructed response items as accurately as human raters. Less research has investigated whether LLMs perpetuate or exacerbate biases, which would pose problems for the fairness and validity of the test. This study examines gender and native language (L1) biases in human and automated scores, using an off-the-shelf (OOS) BERT model. Analyses focus on a specific type of bias known as differential item functioning (DIF), which compares examinees of similar English language proficiency. Results show that there is a moderate amount of DIF, based on examinees’ L1 background in grade band 912. DIF is higher when scored by an OOS BERT model, indicating that BERT may exacerbate this bias; however, in practical terms, the degree to which BERT exacerbates DIF is very small. Additionally, there is more DIF for longer speaking items and for older examinees, but BERT does not exacerbate these patterns of DIF.

2022

pdf bib
Using Item Response Theory to Measure Gender and Racial Bias of a BERT-based Automated English Speech Assessment System
Alexander Kwako | Yixin Wan | Jieyu Zhao | Kai-Wei Chang | Li Cai | Mark Hansen
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

Recent advances in natural language processing and transformer-based models have made it easier to implement accurate, automated English speech assessments. Yet, without careful examination, applications of these models may exacerbate social prejudices based on gender and race. This study addresses the need to examine potential biases of transformer-based models in the context of automated English speech assessment. For this purpose, we developed a BERT-based automated speech assessment system and investigated gender and racial bias of examinees’ automated scores. Gender and racial bias was measured by examining differential item functioning (DIF) using an item response theory framework. Preliminary results, which focused on a single verbal-response item, showed no statistically significant DIF based on gender or race for automated scores.