Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss

Galim Turumtaev

Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss

Abstract

Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.

Anthology ID:: 2025.loreslm-1.28
Volume:: Proceedings of the First Workshop on Language Models for Low-Resource Languages
Month:: January
Year:: 2025
Address:: Abu Dhabi, United Arab Emirates
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venues:: LoResLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 373–386
Language:
URL:: https://aclanthology.org/2025.loreslm-1.28/
DOI:
Bibkey:
Cite (ACL):: Galim Turumtaev. 2025. Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, pages 373–386, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss (Turumtaev, LoResLM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.loreslm-1.28.pdf

PDF Cite Search Fix data