Wang Yutong


2025

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

2021

Open Relation Extraction (OpenRE) aiming to extract relational facts from open-domain cor-pora is a sub-task of Relation Extraction and a crucial upstream process for many other NLPtasks. However various previous clustering-based OpenRE strategies either confine themselves to unsupervised paradigms or can not directly build a unified relational semantic space henceimpacting down-stream clustering. In this paper we propose a novel supervised learning frame-work named MORE-RLL (Metric learning-based Open Relation Extraction with Ranked ListLoss) to construct a semantic metric space by utilizing Ranked List Loss to discover new rela-tional facts. Experiments on real-world datasets show that MORE-RLL can achieve excellent performance compared with previous state-of-the-art methods demonstrating the capability of MORE-RLL in unified semantic representation learning and novel relational fact detection.