Jan Štěpánek

2025

From Form to Meaning: The Case of Particles within the Prague Dependency Treebank Annotation Scheme
Marie Mikulova | Barbora Štěpánková | Jan Štěpánek
Proceedings of the 31st International Conference on Computational Linguistics

In the last decades, computational linguistics has become increasingly interested in annotation schemes that aim at an adequate description of the meaning of the sentences and texts. Discussions are ongoing on an appropriate annotation scheme for a large and complex amount of diverse information. In this contribution devoted to description of polyfunctional uninflected words (namely particles), i.e. words which, although having only one paradigmatic form, can have several different syntactic functions and even express relatively different semantic distinctions, we argue that it is the multi-layer system (linked from meaning to text) that allows a comprehensive description of the relations between morphological properties, syntactic function and expressed meaning, and thus contributes to greater accuracy in the description of the phenomena concerned and to the overall consistency of the annotated data. These aspects are demonstrated within the Prague Dependency Treebank annotation scheme, whose pioneering proposal can be found in the first COLING proceedings from 1965 (Sgall 1965), and to this day, the concept has proved to be sound and serves very well for complex annotation.

pdf bib abs

Comparing Manual and Automatic UMRs for Czech and Latin
Jan Štěpánek | Daniel Zeman | Markéta Lopatková | Federica Gamba | Hana Hledíková
Proceedings of the Sixth International Workshop on Designing Meaning Representations

Uniform Meaning Representation (UMR) is a semantic framework designed to represent the meaning of texts in a structured and interpretable manner. In this paper, we evaluate the results of the automatic conversion of existing resources to UMR, focusing on Czech (PDT-C treebank) and Latin (LDT treebank). We present both quantitative and qualitative evaluations based on a comparison between manually and automatically generated UMR structures for a sample of Czech and Latin sentences. The findings indicate comparable results of the automatic conversion for both languages. The key challenges prove to be the higher level of semantic abstraction required by UMR and the fact that UMR allows for capturing semantic structure in multiple ways, potentially with varying levels of granularity.

pdf bib abs

Label Bias in Symbolic Representation of Meaning
Marie Mikulová | Jan Štěpánek | Jan Hajič
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)

This paper contributes to the trend of building semantic representations and exploring the relations between a language and the world it represents. We analyse alternative approaches to semantic representation, focusing on methodology of determining meaning categories, their arrangement and granularity, and annotation consistency and reliability. Using the task of semantic classification of circumstantial meanings within the Prague Dependency Treebank framework, we present our principles for analyzing meaning categories. Compared with the discussed projects, the unique aspect of our approach is its focus on how a language, in its structure, reflects reality. We employ a two-level classification: a higher, coarse-grained set of general semantic concepts (defined by questions: where, how, why, etc.) and a fine-grained set of circumstantial meanings based on data-driven analysis, reflecting meanings fixed in the language. We highlight that the inherent vagueness of linguistic meaning is crucial for capturing the limitless variety of the world but it can lead to label biases in datasets. Therefore, besides semantically clear categories, we also use fuzzy meaning categories.

2022

pdf bib abs

Quality and Efficiency of Manual Annotation: Pre-annotation Bias
Marie Mikulová | Milan Straka | Jan Štěpánek | Barbora Štěpánková | Jan Hajic
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents an analysis of annotation using an automatic pre-annotation for a mid-level annotation complexity task - dependency syntax annotation. It compares the annotation efforts made by annotators using a pre-annotated version (with a high-accuracy parser) and those made by fully manual annotation. The aim of the experiment is to judge the final annotation quality when pre-annotation is used. In addition, it evaluates the effect of automatic linguistically-based (rule-formulated) checks and another annotation on the same data available to the annotators, and their influence on annotation quality and efficiency. The experiment confirmed that the pre-annotation is an efficient tool for faster manual syntactic annotation which increases the consistency of the resulting annotation without reducing its quality.

2020

pdf bib abs

We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research. PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here). The texts come from different sources: daily newspaper articles, Czech translation of the Wall Street Journal, transcribed dialogs and a small amount of user-generated, short, often non-standard language segments typed into a web translator. Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation. The diversity of the texts and annotations should serve well the NLP applications as well as it is an invaluable resource for linguistic research, including comparative studies regarding texts of different genres. The corpus is publicly and freely available.

2016

pdf bib abs

Searching in the Penn Discourse Treebank Using the PML-Tree Query
Jiří Mírovský | Lucie Poláková | Jan Štěpánek
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The PML-Tree Query is a general, powerful and user-friendly system for querying richly linguistically annotated treebanks. The paper shows how the PML-Tree Query can be used for searching for discourse relations in the Penn Discourse Treebank 2.0 mapped onto the syntactic annotation of the Penn Treebank.

2013

pdf bib

Coordination Structures in Dependency Treebanks
Martin Popel | David Mareček | Jan Štěpánek | Daniel Zeman | Zdeněk Žabokrtský
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib

pdf bib abs

We propose HamleDT ― HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. While the license terms prevent us from directly redistributing the corpora, most of them are easily acquirable for research purposes. What we provide instead is the software that normalizes tree structures in the data obtained by the user from their original providers.

We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.

pdf bib

Prague Markup Language Framework
Jirka Hana | Jan Štěpánek
Proceedings of the Sixth Linguistic Annotation Workshop

2010

pdf bib abs

Querying Diverse Treebanks in a Uniform Way
Jan Štěpánek | Petr Pajas
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents a system for querying treebanks in a uniform way. The system is able to work with both dependency and constituency based treebanks in any language. We demonstrate its abilities on 11 different treebanks. The query language used by the system provides many features not available in other existing systems while still keeping the performance efficient. The paper also describes the conversion of ten treebanks into a common XML-based format used by the system, touching the question of standards and formats. The paper then shows several examples of linguistically interesting questions that the system is able to answer, for example browsing verbal clauses without subjects or extraposed relative clauses, generating the underlying grammar in a constituency treebank, searching for non-projective edges in a dependency treebank, or word-order typology of a language based on the treebank. The performance of several implementations of the system is also discussed by measuring the time requirements of some of the queries.

pdf bib abs

Ways of Evaluation of the Annotators in Building the Prague Czech-English Dependency Treebank
Marie Mikulová | Jan Štěpánek
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present several ways to measure and evaluate the annotation and annotators, proposed and used during the building of the Czech part of the Prague Czech-English Dependency Treebank. At first, the basic principles of the treebank annotation project are introduced (division to three layers: morphological, analytical and tectogrammatical). The main part of the paper describes in detail one of the important phases of the annotation process: three ways of evaluation of the annotators - inter-annotator agreement, error rate and performance. The measuring of the inter-annotator agreement is complicated by the fact that the data contain added and deleted nodes, making the alignment between annotations non-trivial. The error rate is measured by a set of automatic checking procedures that guard the validity of some invariants in the data. The performance of the annotators is measured by a booking web application. All three measures are later compared and related to each other.