Maria Khokhlova
2024
A Corpus of Liturgical Texts in German: Towards Multilevel Text Annotation
Maria Khokhlova
|
Mikhail Koryshev
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
The aim of the study is to create a “documented” literary and theological history of German Catholic hymnography. The paper focuses on the creation of a corpus of liturgical texts in German and describes the first stage of annotation dealing with the metatextual markup of Catholic hymns. The authors dwell in detail on the parameters of the multi-level classification of hymn texts they developed, which allows them to differentiate hymns on different grounds. The parameters include not only characteristics that represent hymns (the period and the source of their origin, rubrics, musical accompaniment), but also ones that are inherent for strophes. Based on the created markup, it is possible to trace general trends in texts divided according to certain meta-features. The developed scheme of annotation is given on the example of the hymnbook Gotteslob (1975). The results present statistics on different parameters used for hymn description.
2020
Collocations in Russian Lexicography and Russian Collocations Database
Maria Khokhlova
Proceedings of the Twelfth Language Resources and Evaluation Conference
The paper presents the issue of collocability and collocations in Russian and gives a survey of a wide range of dictionaries both printed and online ones that describe collocations. Our project deals with building a database that will include dictionary and statistical collocations. The former can be described in various lexicographic resources whereas the latter can be extracted automatically from corpora. Dictionaries differ among themselves, the information is given in various ways, making it hard for language learners and researchers to acquire data. A number of dictionaries were analyzed and processed to retrieve verified collocations, however the overlap between the lists of collocations extracted from them is still rather small. This fact indicates there is a need to create a unified resource which takes into account collocability and more examples. The proposed resource will also be useful for linguists and for studying Russian as a foreign language. The obtained results can be important for machine learning and for other NLP tasks, for instance, automatic clustering of word combinations and disambiguation.
2010
Studying Word Sketches for Russian
Maria Khokhlova
|
Victor Zakharov
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Without any doubt corpora are vital tools for linguistic studies and solution for applied tasks. Although corpora opportunities are very useful, there is a need of another kind of software for further improvement of linguistic research as it is impossible to process huge amount of linguistic data manually. The Sketch Engine representing itself a corpus tool which takes as input a corpus of any language and corresponding grammar patterns. The paper describes the writing of Sketch grammar for the Russian language as a part of the Sketch Engine system. The system gives information about a words collocability on concrete dependency models, and generates lists of the most frequent phrases for a given word based on appropriate models. The paper deals with two different approaches to writing rules for the grammar, based on morphological information, and also with applying word sketches to the Russian language. The data evidences that such results may find an extensive use in various fields of linguistics, such as dictionary compiling, language learning and teaching, translation (including machine translation), phraseology, information retrieval etc.
Search