Urdu Summary Corpus

Muhammad Humayoun, Rao Muhammad Adeel Nawab, Muhammad Uzair, Saba Aslam, Omer Farzand


Abstract
Language resources, such as corpora, are important for various natural language processing tasks. Urdu has millions of speakers around the world but it is under-resourced in terms of standard evaluation resources. This paper reports the construction of a benchmark corpus for Urdu summaries (abstracts) to facilitate the development and evaluation of single document summarization systems for Urdu language. In Urdu, space does not always mark word boundary. Therefore, we created two versions of the same corpus. In the first version, words are separated by space. In contrast, proper word boundaries are manually tagged in the second version. We further apply normalization, part-of-speech tagging, morphological analysis, lemmatization, and stemming for the articles and their summaries in both versions. In order to apply these annotations, we re-implemented some NLP tools for Urdu. We provide Urdu Summary Corpus, all these annotations and the needed software tools (as open-source) for researchers to run experiments and to evaluate their work including but not limited to single-document summarization task.
Anthology ID:
L16-1128
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
796–800
Language:
URL:
https://aclanthology.org/L16-1128
DOI:
Bibkey:
Cite (ACL):
Muhammad Humayoun, Rao Muhammad Adeel Nawab, Muhammad Uzair, Saba Aslam, and Omer Farzand. 2016. Urdu Summary Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 796–800, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Urdu Summary Corpus (Humayoun et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1128.pdf
Code
 humsha/USCorpus