Wu Minghui

2025

Statistically Optimized SGNS Model: Enhancing Word Vector Representation with Global Semantic Weight
Yulin Liu | Xiong Feng | Wanwei Liu | Wu Minghui
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"Addressing the limitations of the Skip-gram with Negative Sampling (SGNS) model related to negative sampling, subsampling, and its fixed context window mechanism, this paper first presents an in-depth statistical analysis of the optimal solution for SGNS matrix factorization,deriving the theoretically optimal distribution for negative sampling. Building upon this analysis, we propose the concept of Global Semantic Weight (GSW), derived from Pointwise Mutual Information (PMI). We integrate GSW with word frequency information to improve the effectiveness of both negative sampling and subsampling. Furthermore, we design dynamic adjustment mechanisms for the context window size and the number of negative samples based on GSW, enabling the model to adaptively capture contextual information commensurate with the semantic importance of the center word. Notably, our optimized model maintains the sametime complexity as the original SGNS implementation. Experimental results demonstrate that our proposed model achieves competitive performance aganist state-of-the-art word embedding models including SGNS, CBOW, and GloVe, across multiple benchmark tasks.Compared with the current mainstream dynamic word vector models, this work emphasizes achieving a balance between efficiency and performance within a static embedding framework, and provides potential supplementation and support for complex models such as LLMs."

2021

pdf bib abs

Open Relation Extraction (OpenRE) aiming to extract relational facts from open-domain cor-pora is a sub-task of Relation Extraction and a crucial upstream process for many other NLPtasks. However various previous clustering-based OpenRE strategies either confine themselves to unsupervised paradigms or can not directly build a unified relational semantic space henceimpacting down-stream clustering. In this paper we propose a novel supervised learning frame-work named MORE-RLL (Metric learning-based Open Relation Extraction with Ranked ListLoss) to construct a semantic metric space by utilizing Ranked List Loss to discover new rela-tional facts. Experiments on real-world datasets show that MORE-RLL can achieve excellent performance compared with previous state-of-the-art methods demonstrating the capability of MORE-RLL in unified semantic representation learning and novel relational fact detection.

Co-authors

Lou Renze 1

Zhou Xiaowei 1

Wang Yutong 1

Venues

CCL2

Fix author