Muhammad Humayoun


2016

pdf bib
Urdu Summary Corpus
Muhammad Humayoun | Rao Muhammad Adeel Nawab | Muhammad Uzair | Saba Aslam | Omer Farzand
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Language resources, such as corpora, are important for various natural language processing tasks. Urdu has millions of speakers around the world but it is under-resourced in terms of standard evaluation resources. This paper reports the construction of a benchmark corpus for Urdu summaries (abstracts) to facilitate the development and evaluation of single document summarization systems for Urdu language. In Urdu, space does not always mark word boundary. Therefore, we created two versions of the same corpus. In the first version, words are separated by space. In contrast, proper word boundaries are manually tagged in the second version. We further apply normalization, part-of-speech tagging, morphological analysis, lemmatization, and stemming for the articles and their summaries in both versions. In order to apply these annotations, we re-implemented some NLP tools for Urdu. We provide Urdu Summary Corpus, all these annotations and the needed software tools (as open-source) for researchers to run experiments and to evaluate their work including but not limited to single-document summarization task.

pdf bib
Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization
Muhammad Humayoun | Hwanjo Yu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Preprocessing is a preliminary step in many fields including IR and NLP. The effect of basic preprocessing settings on English for text summarization is well-studied. However, there is no such effort found for the Urdu language (with the best of our knowledge). In this study, we analyze the effect of basic preprocessing settings for single-document text summarization for Urdu, on a benchmark corpus using various experiments. The analysis is performed using the state-of-the-art algorithms for extractive summarization and the effect of stopword removal, lemmatization, and stemming is analyzed. Results showed that these pre-processing settings improve the results.

2011

pdf bib
An Open Source Punjabi Resource Grammar
Shafqat Mumtaz Virk | Muhammad Humayoun | Aarne Ranta
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf bib
Developing Punjabi Morphology, Corpus and Lexicon
Muhammad Humayoun | Aarne Ranta
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf bib
An Open Source Urdu Resource Grammar
Shafqat Mumtaz Virk | Muhammad Humayoun | Aarne Ranta
Proceedings of the Eighth Workshop on Asian Language Resouces