2016
pdf
bib
abs
Issues and Challenges in Annotating Urdu Action Verbs on the IMAGACT4ALL Platform
Sharmin Muzaffar
|
Pitambar Behera
|
Girish Jha
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In South-Asian languages such as Hindi and Urdu, action verbs having compound constructions and serial verbs constructions pose serious problems for natural language processing and other linguistic tasks. Urdu is an Indo-Aryan language spoken by 51, 500, 0001 speakers in India. Action verbs that occur spontaneously in day-to-day communication are highly ambiguous in nature semantically and as a consequence cause disambiguation issues that are relevant and applicable to Language Technologies (LT) like Machine Translation (MT) and Natural Language Processing (NLP). IMAGACT4ALL is an ontology-driven web-based platform developed by the University of Florence for storing action verbs and their inter-relations. This group is currently collaborating with Jawaharlal Nehru University (JNU) in India to connect Indian languages on this platform. Action verbs are frequently used in both written and spoken discourses and refer to various meanings because of their polysemic nature. The IMAGACT4ALL platform stores each 3d animation image, each one of them referring to a variety of possible ontological types, which in turn makes the annotation task for the annotator quite challenging with regard to selecting verb argument structure having a range of probability distribution. The authors, in this paper, discuss the issues and challenges such as complex predicates (compound and conjunct verbs), ambiguously animated video illustrations, semantic discrepancies, and the factors of verb-selection preferences that have produced significant problems in annotating Urdu verbs on the IMAGACT ontology.
pdf
bib
abs
The IMAGACT4ALL Ontology of Animated Images: Implications for Theoretical and Machine Translation of Action Verbs from English-Indian Languages
Pitambar Behera
|
Sharmin Muzaffar
|
Atul Ku. Ojha
|
Girish Jha
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
Action verbs are one of the frequently occurring linguistic elements in any given natural language as the speakers use them during every linguistic intercourse. However, each language expresses action verbs in its own inherently unique manner by categorization. One verb can refer to several interpretations of actions and one action can be expressed by more than one verb. The inter-language and intra-language variations create ambiguity for the translation of languages from the source language to target language with respect to action verbs. IMAGACT is a corpus-based ontological platform of action verbs translated from prototypic animated images explained in English and Italian as meta-languages. In this paper, we are presenting the issues and challenges in translating action verbs of Indian languages as target and English as source language by observing the animated images. Among the ten Indian languages which have been annotated so far on the platform are Sanskrit, Hindi, Urdu, Odia (Oriya), Bengali, Manipuri, Tamil, Assamese, Magahi and Marathi. Out of them, Manipuri belongs to the Sino-Tibetan, Tamil comes off the Dravidian and the rest owe their genesis to the Indo-Aryan language family. One of the issues is that the one-word morphological English verbs are translated into most of the Indian languages as verbs having more than one-word form; for instance as in the case of conjunct, compound, serial verbs and so on. We are further presenting a cross-lingual comparison of action verbs among Indian languages. In addition, we are also dealing with the issues in disambiguating animated images by the L1 native speakers using competence-based judgements and the theoretical and machine translation implications they bear.
pdf
bib
abs
Dealing with Linguistic Divergences in English-Bhojpuri Machine Translation
Pitambar Behera
|
Neha Mourya
|
Vandana Pandey
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
In Machine Translation, divergence is one of the major barriers which plays a deciding role in determining the efficiency of the system at hand. Translation divergences originate when there is structural discrepancies between the input and the output languages. It can be of various types based on the issues we are addressing to such as linguistic, cultural, communicative and so on. Owing to the fact that two languages owe their origin to different language families, linguistic divergences emerge. The present study attempts at categorizing different types of linguistic divergences: the lexical-semantic and syntactic. In addition, it also helps identify and resolve the divergent linguistic features between English as source language and Bhojpuri as target language pair. Dorr’s theoretical framework (1994, 1994a) has been followed in the classification and resolution procedure. Furthermore, so far as the methodology is concerned, we have adhered to the Dorr’s Lexical Conceptual Structure for the resolution of divergences. This research will prove to be beneficial for developing efficient MT systems if the mentioned factors are incorporated considering the inherent structural constraints between source and target languages.ated considering the inherent structural constraints between SL and TL pairs.