One Wug, Two Wug+s Transformer Inflection Models Hallucinate Affixes

Farhan Samir, Miikka Silfverberg


Abstract
Data augmentation strategies are increasingly important in NLP pipelines for low-resourced and endangered languages, and in neural morphological inflection, augmentation by so called data hallucination is a popular technique. This paper presents a detailed analysis of inflection models trained with and without data hallucination for the low-resourced Canadian Indigenous language Gitksan. Our analysis reveals evidence for a concatenative inductive bias in augmented models—in contrast to models trained without hallucination, they strongly prefer affixing inflection patterns over suppletive ones. We find that preference for affixation in general improves inflection performance in “wug test” like settings, where the model is asked to inflect lexemes missing from the training set. However, data hallucination dramatically reduces prediction accuracy for reduplicative forms due to a misanalysis of reduplication as affixation. While the overall impact of data hallucination for unseen lexemes remains positive, our findings call for greater qualitative analysis and more varied evaluation conditions in testing automatic inflection systems. Our results indicate that further innovations in data augmentation for computational morphology are desirable.
Anthology ID:
2022.computel-1.5
Volume:
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Sarah Moeller, Antonios Anastasopoulos, Antti Arppe, Aditi Chaudhary, Atticus Harrigan, Josh Holden, Jordan Lachler, Alexis Palmer, Shruti Rijhwani, Lane Schwartz
Venue:
ComputEL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31–40
Language:
URL:
https://aclanthology.org/2022.computel-1.5
DOI:
10.18653/v1/2022.computel-1.5
Bibkey:
Cite (ACL):
Farhan Samir and Miikka Silfverberg. 2022. One Wug, Two Wug+s Transformer Inflection Models Hallucinate Affixes. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 31–40, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
One Wug, Two Wug+s Transformer Inflection Models Hallucinate Affixes (Samir & Silfverberg, ComputEL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.computel-1.5.pdf
Video:
 https://aclanthology.org/2022.computel-1.5.mp4