Lucie Poláková
Also published as: Lucie Polakova
2024
Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank
Jiří Mírovský | Pavlína Synková | Lucie Polakova | Marie Paclíková
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Jiří Mírovský | Pavlína Synková | Lucie Polakova | Marie Paclíková
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present a cost-effective method for obtaining a high-quality annotation of explicit discourse relations in the Czech part of the Prague Czech–English Dependency Treebank, a corpus of almost 50 thousand sentences coming from the Czech translation of the Wall Street Journal part of the Penn Treebank. We use three different sources of information and combine them to obtain the discourse annotation: (i) annotation projection from the Penn Discourse Treebank 3.0, (ii) manual tectogrammatical (deep syntax) representation of sentences of the corpus, and (iii) the Lexicon of Czech Discourse Connectives CzeDLex. After solving as many discrepancies as possible automatically, the final discourse annotation is achieved by manual inspection of the remaining problematic cases. The discourse annotation of the corpus will be available both in the Prague format (on top of tectogrammatical trees) with the Prague taxonomy of discourse types, and in the Penn format (on plain texts) with the Penn Discourse Treebank 3.0 sense taxonomy.
Developing a Rhetorical Structure Theory Treebank for Czech
Lucie Poláková | Jiří Mírovský | Šárka Zikánová | Eva Hajičová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Lucie Poláková | Jiří Mírovský | Šárka Zikánová | Eva Hajičová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce the first version of the Czech RST Discourse Treebank, a collection of Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST), a global coherence model proposed by Mann and Thompson (1988). Each document in the corpus is represented as a single tree-like structure, where discourse units are interconnected through hierarchical rhetorical relations and their relative importance for the main purpose of a text is modeled by the nuclearity principle. The treebank is freely available in the LINDAT/CLARIAH-CZ repository under the Creative Commons license; for some documents, it includes two gold annotations representing divergent yet relevant interpretations. The paper outlines the annotation process, provides corpus statistics and evaluation, and discusses the issue of consistency associated with the global level of textual interpretation. In general, good agreement on the structure and labeling could be achieved on the lowest, local tree level and on the identification of the most central (nuclear) elementary discourse units. Disagreements mostly concerned segmentation and, in the structure, differences in the stepwise process of linking the largest text blocks. The project contributes to the advancement of RST research and its application to real-world text analysis challenges.
Announcing the Prague Discourse Treebank 3.0
Pavlína Synková | Jiří Mírovský | Lucie Poláková | Magdaléna Rysová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Pavlína Synková | Jiří Mírovský | Lucie Poláková | Magdaléna Rysová
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present the Prague Discourse Treebank 3.0 – a new version of the annotation of discourse relations marked by primary and secondary discourse connectives in the data of the Prague Dependency Treebank. Compared to the previous version (PDiT 2.0), the version 3.0 comes with three types of major updates: (i) it brings a largely revised annotation of discourse relations: pragmatic relations have been thoroughly reworked, many inconsistencies across all discourse types have been fixed and previously unclear cases marked in annotators’ comments have been resolved, (ii) it achieves consistency with a Lexicon of Czech Discourse Connectives (CzeDLex), and (iii) it provides the data not only in its native format (Prague Markup Language, discourse relations annotated at the top of the dependency trees), but also in the Penn Discourse Treebank 3.0 format (plain text plus a stand-off discourse annotation) and sense taxonomy. PDiT 3.0 contains 21,662 discourse relations (plus 445 list relations) in 49 thousand sentences.
Charles Translator: A Machine Translation System between Ukrainian and Czech
Martin Popel | Lucie Polakova | Michal Novák | Jindřich Helcl | Jindřich Libovický | Pavel Straňák | Tomas Krabac | Jaroslava Hlavacova | Mariia Anisimova | Tereza Chlanova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Martin Popel | Lucie Polakova | Michal Novák | Jindřich Helcl | Jindřich Libovický | Pavel Straňák | Tomas Krabac | Jaroslava Hlavacova | Mariia Anisimova | Tereza Chlanova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in the required quality. The translator was later implemented as an online web interface and as an Android app with speech input, both featuring Cyrillic-Latin script transliteration. The system translates directly, in comparison to other available systems that use English as a pivot, and thus makes advantage of the typological similarity of the two languages. It uses the block back-translation method which allows for efficient use of monolingual training data. The paper describes the development process including data collection and implementation, evaluation, mentions several use cases and outlines possibilities for further development of the system for educational purposes.
2021
Discourse Relations and Connectives in Higher Text Structure
Lucie Polakova | Jiří Mírovský | Šárka Zikánová | Eva Hajičová
Dialogue Discourse Volume 12
Lucie Polakova | Jiří Mírovský | Šárka Zikánová | Eva Hajičová
Dialogue Discourse Volume 12
The present article investigates possibilities and limits of local (shallow) analysis of discourse coherence with respect to the phenomena of global coherence and higher composition of texts. We study corpora annotated with local discourse relations in Czech and partly in English to try and find clues in the local annotation indicating a higher discourse structure. First, we classify patterns of subsequent or overlapping pairs of local relations, and hierarchies formed by nested local relations. Special attention is then given to relations crossing paragraph boundaries and their semantic types, and to paragraph-initial discourse connectives. In the third part, we examine situations in which annotators incline to marking a large argument (larger than one sentence) of a discourse relation even with a minimality principle annotation rule in place. Our analyses bring (i) new linguistic insights regarding coherence signals in local and higher contexts, e.g. detection and description of hierarchies of local discourse relations up to 5 levels in Czech and English, description of distribution differences in semantic types in cross-paragraph and other settings, identification of Czech connectives only typical for higher structures, or the detection of prevalence of large left-sided arguments in locally annotated data; (ii) as another type of contribution, some new reflections on methodologies of the approaches under scrutiny.
2020
GeCzLex: Lexicon of Czech and German Anaphoric Connectives
Lucie Poláková | Kateřina Rysová | Magdaléna Rysová | Jiří Mírovský
Proceedings of the Twelfth Language Resources and Evaluation Conference
Lucie Poláková | Kateřina Rysová | Magdaléna Rysová | Jiří Mírovský
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce the first version of GeCzLex, an online electronic resource for translation equivalents of Czech and German discourse connectives. The lexicon is one of the outcomes of the research on anaphoricity and long-distance relations in discourse, it contains at present anaphoric connectives (ACs) for Czech and German connectives, and further their possible translations documented in bilingual parallel corpora (not necessarily anaphoric). As a basis, we use two existing monolingual lexicons of connectives: the Lexicon of Czech Discourse Connectives (CzeDLex) and the Lexicon of Discourse Markers (DiMLex) for German, interlink their relevant entries via semantic annotation of the connectives (according to the PDTB 3 sense taxonomy) and statistical information of translation possibilities from the Czech and German parallel data of the InterCorp project. The lexicon is, as far as we know, the first bilingual inventory of connectives with linkage on the level of individual entries, and a first attempt to systematically describe devices engaged in long-distance, non-local discourse coherence. The lexicon is freely available under the Creative Commons License.
CzeDLex 0.6 and its Representation in the PML-TQ
Jiří Mírovský | Lucie Poláková | Pavlína Synková
Proceedings of the Twelfth Language Resources and Evaluation Conference
Jiří Mírovský | Lucie Poláková | Pavlína Synková
Proceedings of the Twelfth Language Resources and Evaluation Conference
CzeDLex is an electronic lexicon of Czech discourse connectives with its data coming from a large treebank annotated with discourse relations. Its new version CzeDLex 0.6 (as compared with the previous version 0.5, which was published in 2017) is significantly larger with respect to manually processed entries. Also, its structure has been modified to allow for primary connectives to appear with multiple entries for a single discourse sense. The lexicon comes in several formats, being both human and machine readable, and is available for searching in PML Tree Query, a user-friendly and powerful search tool for all kinds of linguistically annotated treebanks. The main purpose of this paper/demo is to present the new version of the lexicon and to demonstrate possibilities of mining various types of information from the lexicon using PML Tree Query; we present several examples of search queries over the lexicon data along with their results. The new version of the lexicon, CzeDLex 0.6, is available on-line and was officially released in December 2019 under the Creative Commons License.
2019
A Test Suite and Manual Evaluation of Document-Level NMT at WMT19
Kateřina Rysová | Magdaléna Rysová | Tomáš Musil | Lucie Poláková | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
Kateřina Rysová | Magdaléna Rysová | Tomáš Musil | Lucie Poláková | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
As the quality of machine translation rises and neural machine translation (NMT) is moving from sentence to document level translations, it is becoming increasingly difficult to evaluate the output of translation systems. We provide a test suite for WMT19 aimed at assessing discourse phenomena of MT systems participating in the News Translation Task. We have manually checked the outputs and identified types of translation errors that are relevant to document-level translation.
2017
Signalling Implicit Relations: A PDTB - RST Comparison
Lucie Poláková | Jiˇrí Mírovský | Pavlína Synková
Dialogue Discourse Volume 8
Lucie Poláková | Jiˇrí Mírovský | Pavlína Synková
Dialogue Discourse Volume 8
Describing implicit phenomena in discourse is known to be a problematic task, from both theoretical and empirical perspectives. The present article contributes to this topic by a novel comparative analysis of two prominent annotation approaches to discourse relations (coherence relations) that were carried out on the same texts. We compare the annotation of implicit relations in the Penn Discourse Treebank 2.0, i.e. discourse relations not signaled by an explicit discourse connective, to the recently released analysis of signals of rhetorical relations in the RST Signalling Corpus (RST-SC). The intersection of corresponding pairs of relations is rather a small one, but it shows a clear tendency: unlike the overall signal distribution in the RST-SC, more than half of the signals in the studied intersection are of semantic type, formed mostly by loosely defined lexical chains. Our data transformation allows for a simultaneous depiction and detailed study of the two resources.
Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus
Pavlína Synková | Magdaléna Rysová | Lucie Poláková | Jiří Mírovský
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
Pavlína Synková | Magdaléna Rysová | Lucie Poláková | Jiří Mírovský
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
2016
Searching in the Penn Discourse Treebank Using the PML-Tree Query
Jiří Mírovský | Lucie Poláková | Jan Štěpánek
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Jiří Mírovský | Lucie Poláková | Jan Štěpánek
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The PML-Tree Query is a general, powerful and user-friendly system for querying richly linguistically annotated treebanks. The paper shows how the PML-Tree Query can be used for searching for discourse relations in the Penn Discourse Treebank 2.0 mapped onto the syntactic annotation of the Penn Treebank.
Designing CzeDLex – A Lexicon of Czech Discourse Connectives
Jiří Mírovský | Pavlína Jínová | Magdaléna Rysová | Lucie Poláková
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters
Jiří Mírovský | Pavlína Jínová | Magdaléna Rysová | Lucie Poláková
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters
2014
Discourse Relations in the Prague Dependency Treebank 3.0
Jiří Mírovský | Pavlína Jínová | Lucie Poláková
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations
Jiří Mírovský | Pavlína Jínová | Lucie Poláková
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations
Genres in the Prague Discourse Treebank
Lucie Poláková | Pavlína Jínová | Jiří Mírovský
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Lucie Poláková | Pavlína Jínová | Jiří Mírovský
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present the project of classification of Prague Discourse Treebank documents (Czech journalistic texts) for their genres. Our main interest lies in opening the possibility to observe how text coherence is realized in different types (in the genre sense) of language data and, in the future, in exploring the ways of using genres as a feature for multi-sentence-level language technologies. In the paper, we first describe the motivation and the concept of the genre annotation, and briefly introduce the Prague Discourse Treebank. Then, we elaborate on the process of manual annotation of genres in the treebank, from the annotators’ manual work to post-annotation checks and to the inter-annotator agreement measurements. The annotated genres are subsequently analyzed together with discourse relations (already annotated in the treebank) ― we present distributions of the annotated genres and results of studying distinctions of distributions of discourse relations across the individual genres.
2013
Machine Translation with Many Manually Labeled Discourse Connectives
Thomas Meyer | Lucie Poláková
Proceedings of the Workshop on Discourse in Machine Translation
Thomas Meyer | Lucie Poláková
Proceedings of the Workshop on Discourse in Machine Translation
Introducing the Prague Discourse Treebank 1.0
Lucie Poláková | Jiří Mírovský | Anna Nedoluzhko | Pavlína Jínová | Šárka Zikánová | Eva Hajičová
Proceedings of the Sixth International Joint Conference on Natural Language Processing
Lucie Poláková | Jiří Mírovský | Anna Nedoluzhko | Pavlína Jínová | Šárka Zikánová | Eva Hajičová
Proceedings of the Sixth International Joint Conference on Natural Language Processing
Subordinators with Elaborative Meanings in Czech and English
Pavlína Jínová | Lucie Poláková | Jiří Mírovský
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)
Pavlína Jínová | Lucie Poláková | Jiří Mírovský
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)
2012
Does Tectogrammatics Help the Annotation of Discourse?
Jiří Mírovský | Pavlína Jínová | Lucie Poláková
Proceedings of COLING 2012: Posters
Jiří Mírovský | Pavlína Jínová | Lucie Poláková
Proceedings of COLING 2012: Posters
Semi-Automatic Annotation of Intra-Sentential Discourse Relations in PDT
Pavlína Jínová | Jiří Mírovský | Lucie Poláková
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects
Pavlína Jínová | Jiří Mírovský | Lucie Poláková
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects
Interplay of Coreference and Discourse Relations: Discourse Connectives with a Referential Component
Lucie Poláková | Pavlína Jínová | Jiří Mírovský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Lucie Poláková | Pavlína Jínová | Jiří Mírovský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This contribution explores the subgroup of text structuring expressions with the form preposition + demonstrative pronoun, thus it is devoted to an aspect of the interaction of coreference relations and relations signaled by discourse connectives (DCs) in a text. The demonstrative pronoun typically signals a referential link to an antecedent, whereas the whole expression can, but does not have to, carry a discourse meaning in sense of discourse connectives. We describe the properties of these phrases/expressions with regard to their antecedents, their position among the text-structuring language means and their features typical for the connective function of them compared to their non-connective function. The analysis is carried out on Czech data from the approx. 50,000 sentences of the Prague Dependency Treebank 2.0, directly on the syntactic trees. We explore the characteristics of these phrases/expressions discovered during two projects: the manual annotation of 1, coreference relations (Nedoluzhko et al. 2011) and 2, discourse connectives, their scopes and meanings (Mladová et al. 2008).