Kenny Tsu Wei Choo

2026

BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models
Bryan Chen Zhengyu Tan | Weihua Zheng | Zhengyuan Liu | Nancy F. Chen | Hwaran Lee | Kenny Tsu Wei Choo | Roy Ka-Wei Lee
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce ‘BLEnD-Vis‘, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, ‘BLEnD-Vis‘ constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region → Entity, (ii) an inverted text-only variant (Entity → Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice questions (MCQ) instances, validated through human annotation. ‘BLEnD-Vis‘ reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing. While visual cues often aid performance, low cross-modal consistency highlights the challenges of robustly integrating textual and visual understanding, particularly in lower-resource regions. ‘BLEnD-Vis‘ thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs. Code is available at https://github.com/Social-AI-Studio/BLEnD-Vis.

2025

pdf bib abs

Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
Yuriel Ryan | Rui Yang Tan | Kenny Tsu Wei Choo | Roy Ka-Wei Lee
Findings of the Association for Computational Linguistics: EMNLP 2025

Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

2024

pdf bib abs

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
Yunze Xiao | Yujia Hu | Kenny Tsu Wei Choo | Roy Ka-Wei Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce ToxiCloakCN, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language detection to combat the evolving tactics used to evade detection mechanisms.

pdf bib abs

SGHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Singapore
Ri Chi Ng | Nirmalendu Prakash | Ming Shan Hee | Kenny Tsu Wei Choo | Roy Ka-wei Lee
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

To address the limitations of current hate speech detection models, we introduce SGHateCheck, a novel framework designed for the linguistic and cultural context of Singapore and Southeast Asia. It extends the functional testing approach of HateCheck and MHC, employing large language models for translation and paraphrasing into Singapore’s main languages, and refining these with native annotators. SGHateCheck reveals critical flaws in state-of-the-art models, highlighting their inadequacy in sensitive content moderation. This work aims to foster the development of more effective hate speech detection tools for diverse linguistic environments, particularly for Singapore and Southeast Asia contexts.

Co-authors

Bryan Chen Zhengyu Tan 1

Rui Yang Tan 1

Yunze Xiao 1

Weihua Zheng 1

Venues

Fix author