Enhancing Turkish Word Segmentation: A Focus on Borrowed Words and Invalid Morpheme

Soheila Behrooznia, Ebrahim Ansari, Zdenek Zabokrtsky


Abstract
This study addresses a challenge in morphological segmentation: accurately segmenting words in languages with rich morphology. Current probabilistic methods, such as Morfessor, often produce results that lack consistency with human-segmented words. Our study adds some steps to the Morfessor segmentation process to consider invalid morphemes and borrowed words from other languages to improve morphological segmentation significantly. Comparing our idea to the results obtained from Morfessor demonstrates its efficiency, leading to more accurate morphology segmentation. This is particularly evident in the case of Turkish, highlighting the potential for further advancements in morpheme segmentation for morphologically rich languages.
Anthology ID:
2024.loresmt-1.9
Volume:
Proceedings of the The Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jade Abbott, Jonathan Washington, Nathaniel Oco, Valentin Malykh, Varvara Logacheva, Xiaobing Zhao
Venues:
LoResMT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
85–93
Language:
URL:
https://aclanthology.org/2024.loresmt-1.9
DOI:
Bibkey:
Cite (ACL):
Soheila Behrooznia, Ebrahim Ansari, and Zdenek Zabokrtsky. 2024. Enhancing Turkish Word Segmentation: A Focus on Borrowed Words and Invalid Morpheme. In Proceedings of the The Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), pages 85–93, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Enhancing Turkish Word Segmentation: A Focus on Borrowed Words and Invalid Morpheme (Behrooznia et al., LoResMT-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.loresmt-1.9.pdf