LSTM Autoencoders for Dialect Analysis

Taraka Rama, Çağrı Çöltekin


Abstract
Computational approaches for dialectometry employed Levenshtein distance to compute an aggregate similarity between two dialects belonging to a single language group. In this paper, we apply a sequence-to-sequence autoencoder to learn a deep representation for words that can be used for meaningful comparison across dialects. In contrast to the alignment-based methods, our method does not require explicit alignments. We apply our architectures to three different datasets and show that the learned representations indicate highly similar results with the analyses based on Levenshtein distance and capture the traditional dialectal differences shown by dialectologists.
Anthology ID:
W16-4803
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
25–32
Language:
URL:
https://aclanthology.org/W16-4803/
DOI:
Bibkey:
Cite (ACL):
Taraka Rama and Çağrı Çöltekin. 2016. LSTM Autoencoders for Dialect Analysis. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 25–32, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
LSTM Autoencoders for Dialect Analysis (Rama & Çöltekin, VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4803.pdf