Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect

Yo Sato, Kevin Heffernan


Abstract
We present in this work a universal, character-based method for representing sentences so that one can thereby calculate the distance between any two sentence pair. With a small alphabet, it can function as a proxy of phonemes, and as one of its main uses, we carry out dialect clustering: cluster a dialect/sub-language mixed corpus into sub-groups and see if they coincide with the conventional boundaries of dialects and sub-languages. By using data with multiple Japanese dialects and multiple Slavic languages, we report how well each group clusters, in a manner to partially respond to the question of what separates languages from dialects.
Anthology ID:
2020.lrec-1.124
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
985–990
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.124
DOI:
Bibkey:
Cite (ACL):
Yo Sato and Kevin Heffernan. 2020. Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 985–990, Marseille, France. European Language Resources Association.
Cite (Informal):
Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect (Sato & Heffernan, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.124.pdf