Jirka Hana

2017

Understanding Non-Native Writings: Can a Parser Help?
Jirka Hana | Barbora Hladká
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

We present a pilot study on parsing non-native texts written by learners of Czech. We performed experiments that have shown that at least high-level syntactic functions, like subject, predicate, and object, can be assigned based on a parser trained on standard native language.

2014

pdf bib

Sentence diagrams: their evaluation and combination
Jirka Hana | Barbora Hladká | Ivana Lukšová
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

pdf bib abs

The MERLIN corpus is a written learner corpus for Czech, German,and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains 2,290 learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that enable research into the empirical foundations of the CEFR scales and provide language teachers, test developers, and Second Language Acquisition researchers with concrete examples of learner performance and progress across multiple proficiency levels. For computational linguistics, it provide a range of authentic learner data for three target languages, supporting a broadening of the scope of research in areas such as automatic proficiency classification or native language identification. The annotated corpus and related information will be freely available as a corpus resource and through a freely accessible, didactically-oriented online platform.

2013

pdf bib

Automatic Identification of Learners’ Language Background Based on Their Writing in Czech
Katsiaryna Aharodnik | Marco Chang | Anna Feldman | Jirka Hana
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib abs

Building a learner corpus
Jirka Hana | Alexandr Rosen | Barbora Štindlová | Petr Jäger
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation.

pdf bib abs

Getting more data – Schoolkids as annotators
Jirka Hana | Barbora Hladká
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a new way to get more morphologically and syntactically annotated data. We have developed an annotation editor tailored to school children to involve them in text annotation. Using this editor, they practice morphology and dependency-based syntax in the same way as they normally do at (Czech) schools, without any special training. Their annotation is then automatically transformed into the target annotation schema. The editor is designed to be language independent, however the subsequent transformation is driven by the annotation framework we are heading for. In our case, the object language is Czech and the target annotation scheme corresponds to the Prague Dependency Treebank annotation framework.

pdf bib

Prague Markup Language Framework
Jirka Hana | Jan Štěpánek
Proceedings of the Sixth Linguistic Annotation Workshop

2011

pdf bib

A low-budget tagger for Old Czech
Jirka Hana | Anna Feldman | Katsiaryna Aharodnik
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2010

pdf bib abs

A Positional Tagset for Russian
Jirka Hana | Anna Feldman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Fusional languages have rich inflection. As a consequence, tagsets capturing their morphological features are necessarily large. A natural way to make a tagset manageable is to use a structured system. In this paper, we present a positional tagset for describing morphological properties of Russian. The tagset was inspired by the Czech positional system (Hajic, 2004). We have used preliminary versions of this tagset in our previous work (e.g., Hana et al. (2004, 2006); Feldman (2006); Feldman and Hana (2010)). Here, we both systematize and extend these preliminary versions (by adding information about animacy, aspect and reflexivity); give a more detailed description of the tagset and provide comparison with the Czech system. Each tag of the tagset consists of 16 positions, each encoding one morphological feature (part-of-speech, detailed part-of-speech, gender, animacy, number, case, possessor's gender and number, person, reflexivity, tense, aspect, degree of comparison, negation, voice, variant). The tagset contains approximately 2,000 tags.

pdf bib

Error-Tagged Learner Corpus of Czech
Jirka Hana | Alexandr Rosen | Svatava Škodová | Barbora Štindlová
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib

Challenges of Cheap Resource Creation
Jirka Hana | Anna Feldman
Proceedings of the Fourth Linguistic Annotation Workshop

2006

pdf bib

Tagging Portuguese with a Spanish Tagger
Jirka Hana | Anna Feldman | Luiz Amaral | Chris Brew
Proceedings of the Cross-Language Knowledge Induction Workshop

pdf bib abs

A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources
Anna Feldman | Jirka Hana | Chris Brew
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.