Xiaowen Lin


2026

Not Safe for Work (NSFW) image classifiers play a critical role in safeguarding text-to-image (T2I) systems. However, a concerning phenomenon has emerged in T2I systems – changes in text prompts that manipulate benign image elements can result in failed detection by NSFW classifiers – dubbed "*context shifts*." For instance, while a NSFW image of "*a nude person in an empty scene*" can be easily blocked by most NSFW classifiers, a stealthier one that depicts "*a nude person blending in a group of dressed people*" may evade detection. We ask: how to systematically reveal NSFW image classifiers’ failure against such context shifts?Towards this end, we present an automated red-teaming framework that leverages a set of generative AI tools. We propose an **exploration-exploitation** approach: **First**, in the *exploration* stage, we synthesize a diverse and massive 36K NSFW image dataset that facilitates our study of context shifts. We find that varying fractions (e.g., 4.1% to 36% nude and sexual content) of the dataset are misclassified by NSFW image classifiers like GPT-4o and Gemini. **Second**, in the *exploitation* stage, we leverage these failure cases to train a specialized LLM that rewrites unseen seed prompts into more evasive versions, increasing the likelihood of detection evasion by up to 6 times. Alarmingly, we show **these failures translate to real-world T2I and even T2V systems** like DALL-E 3, Sora, Nano Banana, and Veo 3 – beyond the open-weight image generators in our main study. For example, querying DALL-E 3 with prompts rewritten by our approach increases the chance of obtaining NSFW images from 0 to over 50%.

2019

It is challenging for current one-step retrieve-and-read question answering (QA) systems to answer questions like “Which novel by the author of ‘Armada’ will be adapted as a feature film by Steven Spielberg?” because the question seldom contains retrievable clues about the missing entity (here, the author). Answering such a question requires multi-hop reasoning where one must gather information about the missing entity (or facts) to proceed with further reasoning. We present GoldEn (Gold Entity) Retriever, which iterates between reading context and retrieving more supporting documents to answer open-domain multi-hop questions. Instead of using opaque and computationally expensive neural retrieval models, GoldEn Retriever generates natural language search queries given the question and available context, and leverages off-the-shelf information retrieval systems to query for missing entities. This allows GoldEn Retriever to scale up efficiently for open-domain multi-hop reasoning while maintaining interpretability. We evaluate GoldEn Retriever on the recently proposed open-domain multi-hop QA dataset, HotpotQA, and demonstrate that it outperforms the best previously published model despite not using pretrained language models such as BERT.