Improving K-Nearest Neighbor Efficacy for Farsi Text Classification

Mohammad Hossein Elahimanesh, Behrouz Minaei, Hossein Malekinezhad


Abstract
One of the common processes in the field of text mining is text classification. Because of the complex nature of Farsi language, words with separate parts and combined verbs, the most of text classification systems are not applicable to Farsi texts. K-Nearest Neighbors (KNN) is one of the most popular used methods for text classification and presents good performance in experiments on different datasets. A method to improve the classification performance of KNN is proposed in this paper. Effects of removing or maintaining stop words, applying N-Grams with different lengths are also studied. For this study, a portion of a standard Farsi corpus called Hamshahri1 and articles of some archived newspapers are used. As the results indicate, classification efficiency improves by applying this approach especially when eight-grams indexing method and removing stop words are applied. Using N-grams with lengths more than 3 characters, presented very encouraging results for Farsi text classification. The Results of classification using our method are compared with the results obtained by mentioned related works.
Anthology ID:
L12-1538
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1618–1621
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/903_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mohammad Hossein Elahimanesh, Behrouz Minaei, and Hossein Malekinezhad. 2012. Improving K-Nearest Neighbor Efficacy for Farsi Text Classification. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1618–1621, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Improving K-Nearest Neighbor Efficacy for Farsi Text Classification (Elahimanesh et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/903_Paper.pdf