Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm Clustering

E. Margaret Perkoff, Josh Daniels, Alexis Palmer


Abstract
This paper presents two different systems for unsupervised clustering of morphological paradigms, in the context of the SIGMORPHON 2021 Shared Task 2. The goal of this task is to correctly cluster words in a given language by their inflectional paradigm, without any previous knowledge of the language and without supervision from labeled data of any sort. The words in a single morphological paradigm are different inflectional variants of an underlying lemma, meaning that the words share a common core meaning. They also - usually - show a high degree of orthographical similarity. Following these intuitions, we investigate KMeans clustering using two different types of word representations: one focusing on orthographical similarity and the other focusing on semantic similarity. Additionally, we discuss the merits of randomly initialized centroids versus pre-defined centroids for clustering. Pre-defined centroids are identified based on either a standard longest common substring algorithm or a connected graph method built off of longest common substring. For all development languages, the character-based embeddings perform similarly to the baseline, and the semantic embeddings perform well below the baseline. Analysis of the systems’ errors suggests that clustering based on orthographic representations is suitable for a wide range of morphological mechanisms, particularly as part of a larger system.
Anthology ID:
2021.sigmorphon-1.10
Volume:
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
August
Year:
2021
Address:
Online
Editors:
Garrett Nicolai, Kyle Gorman, Ryan Cotterell
Venue:
SIGMORPHON
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
90–97
Language:
URL:
https://aclanthology.org/2021.sigmorphon-1.10
DOI:
10.18653/v1/2021.sigmorphon-1.10
Bibkey:
Cite (ACL):
E. Margaret Perkoff, Josh Daniels, and Alexis Palmer. 2021. Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm Clustering. In Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 90–97, Online. Association for Computational Linguistics.
Cite (Informal):
Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm Clustering (Perkoff et al., SIGMORPHON 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.sigmorphon-1.10.pdf