A Mansi FST and spellchecker

Jack Rueter, Csilla Horváth, Trond Trosterud


Abstract
The article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.
Anthology ID:
2025.cgmta-1.5
Volume:
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP
Month:
march
Year:
2025
Address:
Tallinn, Estonia
Editors:
Trond Trosterud, Linda Wiechetek, Flammie Pirinen
Venues:
cgmta | WS
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
32–37
Language:
URL:
https://aclanthology.org/2025.cgmta-1.5/
DOI:
Bibkey:
Cite (ACL):
Jack Rueter, Csilla Horváth, and Trond Trosterud. 2025. A Mansi FST and spellchecker. In Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP, pages 32–37, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):
A Mansi FST and spellchecker (Rueter et al., cgmta 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.cgmta-1.5.pdf