A Framework for Spelling Correction in Persian Language Using Noisy Channel Model

Mohammad Hoseyn Sheykholeslam, Behrouz Minaei-Bidgoli, Hossein Juzi


Abstract
There are several methods offered for spelling correction in Farsi (Persian) Language. Unfortunately no powerful framework has been implemented because of lack of a large training set in Farsi as an accurate model. A training set consisting of erroneous and related correction string pairs have been obtained from a large number of instances of the books each of which were typed two times in Computer Research Center of Islamic Sciences. We trained our error model using this huge set. In testing part after finding erroneous words in sample text, our program proposes some candidates for related correction. The paper focuses on describing the method of ranking related corrections. This method is customized version of Noisy Channel Spelling Correction for Farsi. This ranking method attempts to find intended correction c from a typo t, that maximizes P(c) P(t | c). In this paper different methods are described and analyzed to obtain a wide overview of the field. Our evaluation results show that Noisy Channel Model using our corpus and training set in this framework works more accurately and improves efficiently in comparison with other methods.
Anthology ID:
L12-1194
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
706–710
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/384_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mohammad Hoseyn Sheykholeslam, Behrouz Minaei-Bidgoli, and Hossein Juzi. 2012. A Framework for Spelling Correction in Persian Language Using Noisy Channel Model. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 706–710, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
A Framework for Spelling Correction in Persian Language Using Noisy Channel Model (Sheykholeslam et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/384_Paper.pdf