Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization

Muhammad Humayoun, Hwanjo Yu


Abstract
Preprocessing is a preliminary step in many fields including IR and NLP. The effect of basic preprocessing settings on English for text summarization is well-studied. However, there is no such effort found for the Urdu language (with the best of our knowledge). In this study, we analyze the effect of basic preprocessing settings for single-document text summarization for Urdu, on a benchmark corpus using various experiments. The analysis is performed using the state-of-the-art algorithms for extractive summarization and the effect of stopword removal, lemmatization, and stemming is analyzed. Results showed that these pre-processing settings improve the results.
Anthology ID:
L16-1585
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3686–3693
Language:
URL:
https://aclanthology.org/L16-1585
DOI:
Bibkey:
Cite (ACL):
Muhammad Humayoun and Hwanjo Yu. 2016. Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3686–3693, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization (Humayoun & Yu, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1585.pdf
Code
 humsha/USCorpus
Data
CC-News