Farasa: A New Fast and Accurate Arabic Word Segmenter

Kareem Darwish, Hamdy Mubarak


Abstract
In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVMrank using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named entities; and underlying stem templates. Farasa outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA. Meanwhile, Farasa is nearly one order of magnitude faster than QATARA and two orders of magnitude faster than MADAMIRA. The segmenter should be able to process one billion words in less than 5 hours. Farasa is written entirely in native Java, with no external dependencies, and is open-source.
Anthology ID:
L16-1170
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1070–1074
Language:
URL:
https://aclanthology.org/L16-1170/
DOI:
Bibkey:
Cite (ACL):
Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A New Fast and Accurate Arabic Word Segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1070–1074, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Farasa: A New Fast and Accurate Arabic Word Segmenter (Darwish & Mubarak, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1170.pdf