Xinyi Xu
Other people with similar names: Xinyi Xu
2026
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories
Yanlin Wang | Ziyao Zhang | Chong Wang | Xinyi Xu | Mingwei Liu | Yong Wang | Jiachi Chen | Zibin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Yanlin Wang | Ziyao Zhang | Chong Wang | Xinyi Xu | Mingwei Liu | Yong Wang | Jiachi Chen | Zibin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic vulnerabilities or evaluating functional correctness in isolation, failing to capture the complex interplay between functionality and security found in real-world software. To address this gap, we introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories. Our methodology employs a multi-stage pipeline that combines systematic SAST scanning with CodeQL, LLM-based false positive elimination, and rigorous human expert validation. The resulting benchmark contains 105 instances grounded in real-word repository contexts, spanning 19 Common Weakness Enumeration (CWE) types and exhibiting a wide diversity of data flow complexities, including vulnerabilities with up to 34-hop inter-procedural dependencies. Using RealSec-bench, we conduct an extensive empirical study on 5 popular LLMs. We introduce a novel composite metric, SecurePass@K, to assess both functional correctness and security simultaneously. We find that while Retrieval-Augmented Generation (RAG) techniques can improve functional correctness, they provide negligible benefits to security. Furthermore, explicitly prompting models with general security guidelines often leads to compilation failures, harming functional correctness without reliably preventing vulnerabilities. Our work highlights the gap between functional and secure code generation in current LLMs. Our code and data are available at https://github.com/DeepSoftwareAnalytics/Realsec-code-Bench.
JanusMM: A Benchmark for Self-Deprecation Understanding in Real-World Multimodal Conversations
Xinyi Xu | Bingguang Hao | Yongyi Xiong | Zimo Chen | Xinchen Liu | Hongxin Guo | Xuelong Wang | Silin Zhou | Shihan Dou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinyi Xu | Bingguang Hao | Yongyi Xiong | Zimo Chen | Xinchen Liu | Hongxin Guo | Xuelong Wang | Silin Zhou | Shihan Dou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Self-deprecation is a prevalent communicative strategy in human society, often using image-text interplay to express emotions and intentions. Despite self-deprecation is widespread in real-world conversations, the ability of multimodal large language models (MLLMs) to understand it remains underexplored. To fill this gap, we introduce **JanusMM**, the first benchmark designed to evaluate MLLMs’ understanding of self-deprecation in real-world conversations. JanusMM contains 2,016 bilingual memes from three types of social interactions and provides a dual-task evaluation framework with six new metrics. The first task assesses MLLMs’ abilities in self-deprecation recognition and reasoning, while the second task evaluates the consistency of their understanding by simulating the perspectives of the initiator and responder. We evaluate ten frontier MLLMs and find that they exhibit weak recognition and reasoning abilities, with their understanding of self-deprecation remaining inconsistent across both perspectives.