How Should We Model the Probability of a Language?

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot


Abstract
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
Anthology ID:
2026.vardial-1.18
Volume:
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
223–233
Language:
URL:
https://aclanthology.org/2026.vardial-1.18/
DOI:
Bibkey:
Cite (ACL):
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, and Benoît Sagot. 2026. How Should We Model the Probability of a Language?. In Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 223–233, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
How Should We Model the Probability of a Language? (Dent et al., VarDial 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.vardial-1.18.pdf