A Characterwise Windowed Approach to Hebrew Morphological Segmentation

Amir Zeldes


Abstract
This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ≈4% and 5% over previous state of the art performance.
Anthology ID:
W18-5811
Volume:
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Sandra Kuebler, Garrett Nicolai
Venue:
EMNLP
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
101–110
Language:
URL:
https://aclanthology.org/W18-5811
DOI:
10.18653/v1/W18-5811
Bibkey:
Cite (ACL):
Amir Zeldes. 2018. A Characterwise Windowed Approach to Hebrew Morphological Segmentation. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 101–110, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
A Characterwise Windowed Approach to Hebrew Morphological Segmentation (Zeldes, EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-5811.pdf
Code
 amir-zeldes/RFTokenizer
Data
Wiki5K Hebrew segmentationSPMRL Hebrew segmentation data