Hongyu Sun


2025

pdf bib
Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary?
Adam Nohejl | Frederikus Hudi | Eunike Andriani Kardinata | Shintaro Ozaki | Maria Angelica Riera Machin | Hongyu Sun | Justin Vasselli | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and lexical complexity. In addition to being strongly correlated with two psycholinguistic variables, a simple linear regression on the new frequencies achieves a new high score on a lexical complexity prediction task in English and Japanese, surpassing both models trained on film subtitle frequencies and the LLM GPT-4. We publicly release our code, the frequency lists, fastText word embeddings, and statistical language models.

2024

pdf bib
Applying Contrastive Learning to Code Vulnerability Type Classification
Chen Ji | Su Yang | Hongyu Sun | Yuqing Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Vulnerability classification is a crucial task in software security analysis, essential for identifying and mitigating potential security risks. Learning-based methods often perform poorly due to the long-tail distribution of vulnerability classification datasets. Recent approaches try to address the problem but treat each CWE class in isolation, ignoring their relationships. This results in non-scalable code vector representations, causing significant performance drops when handling complex real-world vulnerabilities. We propose a hierarchical contrastive learning framework for code vulnerability type classification to bring vector representations of related CWEs closer together. To address the issue of class collapse and enhance model robustness, we mix self-supervised contrastive learning loss into our loss function. Additionally, we employ max-pooling to enable the model to handle longer vulnerability code inputs. Extensive experiments demonstrate that our proposed framework outperforms state-of-the-art methods by 2.97%-17.90% on accuracy and 0.98%-22.27% on weighted-F1, with even better performance on higher-quality datasets. We also utilize an ablation study to prove each component’s contribution. These findings underscore the potential and advantages of our approach in the multi-class vulnerability classification task.