How to Distinguish Languages and Dialects

Søren Wichmann


Abstract
The terms “language” and “dialect” are ingrained, but linguists nevertheless tend to agree that it is impossible to apply a non-arbitrary distinction such that two speech varieties can be identified as either distinct languages or two dialects of one and the same language. A database of lexical information for more than 7,500 speech varieties, however, unveils a strong tendency for linguistic distances to be bimodally distributed. For a given language group the linguistic distances pertaining to either cluster can be teased apart, identifying a mixture of normal distributions within the data and then separating them fitting curves and finding the point where they cross. The thresholds identified are remarkably consistent across data sets, qualifying their mean as a universal criterion for distinguishing between language and dialect pairs. The mean of the thresholds identified translates into a temporal distance of around one to one-and-a-half millennia (1,075–1,635 years).
Anthology ID:
J19-4007
Volume:
Computational Linguistics, Volume 45, Issue 4 - December 2019
Month:
December
Year:
2019
Address:
Cambridge, MA
Venue:
CL
SIG:
Publisher:
MIT Press
Note:
Pages:
823–831
Language:
URL:
https://aclanthology.org/J19-4007/
DOI:
10.1162/coli_a_00366
Bibkey:
Cite (ACL):
Søren Wichmann. 2019. How to Distinguish Languages and Dialects. Computational Linguistics, 45(4):823–831.
Cite (Informal):
How to Distinguish Languages and Dialects (Wichmann, CL 2019)
Copy Citation:
PDF:
https://aclanthology.org/J19-4007.pdf