The Use of Parallel and Comparable Data for Analysis of Abstract Anaphora in German and English
Stefanie Dipper | Melanie Seiss | Heike Zinsmeister
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Parallel corpora ― original texts aligned with their translations ― are a widely used resource in computational linguistics. Translation studies have shown that translated texts often differ systematically from comparable original texts. Translators tend to be faithful to structures of the original texts, resulting in a """"shining through"""" of the original language preferences in the translated text. Translators also tend to make their translations most comprehensible with the effect that translated texts can be more explicit than their source texts. Motivated by the need to use a parallel resource for cross-linguistic feature induction in abstract anaphora resolution, this paper investigates properties of English and German texts in the Europarl corpus, taking into account both general features such as sentence length as well as task-dependent features such as the distribution of demonstrative noun phrases. The investigation is based on the entire Europarl corpus as well as on a small subset thereof, which has been manually annotated. The results indicate English translated texts are sufficiently """"authentic"""" to be used as training data for anaphora resolution; results for German texts are less conclusive, though.
Resource development mainly focuses on well-described languages with a large amount of speakers. However, smaller languages may also profit from language resources which can then be used in applications such as electronic dictionaries or computer-assisted language learning materials. The development of resources for such languages may face various challenges. Often, not enough data is available for a successful statistical approach and the methods developed for other languages may not be suitable for this specific language. This paper presents a morphological analyzer for Murrinh-Patha, a polysynthetic language spoken in the Northern Territory of Australia. While nouns in Murrinh-Patha only show minimal inflection, verbs in this language are very complex. The complexity makes it very difficult if not impossible to handle data in Murrinh-Patha with statistical, surface-oriented methods. I therefore present a rule-based morphological analyzer built in XFST and LEXC (Beesley and Karttunen, 2003) which can handle the inflection on nouns and adjectives as well as the complexities of the Murrinh-Patha verb.