Angelina Wang
2026
SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases
Laya Iyer | Angelina Wang | Sanmi Koyejo
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Laya Iyer | Angelina Wang | Sanmi Koyejo
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess the audio beyond just what words are said— rather, in how they are said and the non-speech components of the audio. To strengthen ecological validity, we include a small human-recorded evaluation split per category. Based on the needs articulated by audio understanding use-cases of accessibility technology and industrial noise monitoring, this benchmark reveals critical gaps in current LALMs. The performance in each task is quite varied, with some tasks having performance far below random chance and others with high accuracy. We also provide a structured error taxonomy to characterize standard failure modes across tasks. These results provide direction for targeted improvements in model capabilities.
2025
Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs
Angelina Wang | Michelle Phan | Daniel E. Ho | Sanmi Koyejo
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Angelina Wang | Michelle Phan | Daniel E. Ho | Sanmi Koyejo
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Algorithmic fairness has conventionally adopted the mathematically convenient perspective of racial color-blindness (i.e., difference unaware treatment). However, we contend that in a range of important settings, group difference awareness matters. For example, differentiating between groups may be necessary in legal contexts (e.g., the U.S. compulsory draft applies to men but not women) and harm assessments (e.g., referring to girls as “terrorists” may be less harmful than referring to Muslim people as such). Thus, in contrast to most fairness work, we study fairness through the perspective of treating people differently — when it is contextually appropriate to. We first introduce an important distinction between descriptive (fact-based), normative (value-based), and correlation (association-based) benchmarks. This distinction is significant because each category requires separate interpretation and mitigation tailored to its specific characteristics. Then, we present a benchmark suite composed of eight different scenarios for a total of 16k questions that enables us to assess difference awareness. Finally, we show results across ten models that demonstrate difference awareness is a distinct dimension to fairness where existing bias mitigation strategies may backfire.