Ghulam Raza
2010
Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar
Muhammad Kamran Malik
|
Tafseer Ahmed
|
Sebastian Sulger
|
Tina Bögel
|
Atif Gulzar
|
Ghulam Raza
|
Sarmad Hussain
|
Miriam Butt
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present a system for transliterating the Arabic-based script of Urdu to a Roman transliteration scheme. The system is integrated into a larger system consisting of a morphology module, implemented via finite state technologies, and a computational LFG grammar of Urdu that was developed with the grammar development platform XLE (Crouch et al. 2008). Our long-term goal is to handle Hindi alongside Urdu; the two languages are very similar with respect to syntax and lexicon and hence, one grammar can be used to cover both languages. However, they are not similar concerning the script -- Hindi is written in Devanagari, while Urdu uses an Arabic-based script. By abstracting away to a common Roman transliteration scheme in the respective transliterators, our system can be enabled to handle both languages in parallel. In this paper, we discuss the pipeline architecture of the Urdu-Roman transliterator, mention several linguistic and orthographic issues and present the integration of the transliterator into the LFG parsing system.
Inferring Subcat Frames of Verbs in Urdu
Ghulam Raza
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes an approach for inferring syntactic frames of verbs in Urdu from an untagged corpus. Urdu, like many other South Asian languages, is a free word order and case-rich language. Separable lexical units mark different constituents for case in phrases and clauses and are called case clitics. There is not always a one to one correspondence between case clitic form and case, and case and grammatical function in Urdu. Case clitics, therefore, can not serve as direct clues for extracting the syntactic frames of verbs. So a two-step approach has been implemented. In a first step, all case clitic combinations for a verb are extracted and the unreliable ones are filtered out by applying the inferential statistics. In a second step, the information of occurrences of case clitic forms in different combinations as a whole and on individual level is processed to infer all possible syntactic frames of the verb.
Search
Co-authors
- Muhammad Kamran Malik 1
- Tafseer Ahmed 1
- Sebastian Sulger 1
- Tina Bögel 1
- Atif Gulzar 1
- show all...
Venues
- lrec2