Zdeněk Žabokrtský

Also published as: Zdenek Zabokrtsky, Zdenĕk Žabokrtský, Zdenek Žabokrtsky

2024

pdf bib abs
Enhancing Turkish Word Segmentation: A Focus on Borrowed Words and Invalid Morpheme
Soheila Behrooznia | Ebrahim Ansari | Zdenek Zabokrtsky
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

This study addresses a challenge in morphological segmentation: accurately segmenting words in languages with rich morphology. Current probabilistic methods, such as Morfessor, often produce results that lack consistency with human-segmented words. Our study adds some steps to the Morfessor segmentation process to consider invalid morphemes and borrowed words from other languages to improve morphological segmentation significantly. Comparing our idea to the results obtained from Morfessor demonstrates its efficiency, leading to more accurate morphology segmentation. This is particularly evident in the case of Turkish, highlighting the potential for further advancements in morpheme segmentation for morphologically rich languages.

The aim of the Universal Anaphora initiative is to push forward the state of the art in anaphora and anaphora resolution by expanding the aspects of anaphoric interpretation which are or can be reliably annotated in anaphoric corpora, producing unified standards to annotate and encode these annotations, delivering datasets encoded according to these standards, and developing methods for evaluating models that carry out this type of interpretation. Although several papers on aspects of the initiative have appeared, no overall description of the initiative’s goals, proposals and achievements has been published yet except as an online draft. This paper aims to fill this gap, as well as to discuss its progress so far.

pdf bib abs
Universal Feature-based Morphological Trees
Federica Gamba | Abishek Stephen | Zdeněk Žabokrtský
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

The paper proposes a novel data representation inspired by Universal Dependencies (UD) syntactic trees, which are extended to capture the internal morphological structure of word forms. As a result, morphological segmentation is incorporated within the UD representation of syntactic dependencies. To derive the proposed data structure we leverage existing annotation of UD treebanks as well as available resources for segmentation, and we select 10 languages to work with in the presented case study. Additionally, statistical analysis reveals a robust correlation between morphs and sets of morphological features of words. We thus align the morphs to the observed feature inventories capturing the morphological meaning of morphs. Through the beneficial exploitation of cross-lingual correspondence of morphs, the proposed syntactic representation based on morphological segmentation proves to enhance the comparability of sentence structures across languages.

2023

pdf bib
Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution
Zdeněk Žabokrtský | Maciej Ogrodniczuk
Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution

This paper summarizes the second edition of the shared task on multilingual coreference resolution, held with the CRAC 2023 workshop. Just like last year, participants of the shared task were to create trainable systems that detect mentions and group them based on identity coreference; however, this year’s edition uses a slightly different primary evaluation score, and is also broader in terms of covered languages: version 1.1 of the multilingual collection of harmonized coreference resources CorefUD was used as the source of training and evaluation data this time, with 17 datasets for 12 languages. 7 systems competed in this shared task.

2022

pdf bib abs
Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum
Niyati Bafna | Josef van Genabith | Cristina España-Bonet | Zdeněk Žabokrtský
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.

pdf bib
Proceedings of the CRAC 2022 Shared Task on Multilingual Coreference Resolution
Zdeněk Žabokrtský | Maciej Ogrodniczuk
Proceedings of the CRAC 2022 Shared Task on Multilingual Coreference Resolution

This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages).

Our work aims at developing a multilingual data resource for morphological segmentation. We present a survey of 17 existing data resources relevant for segmentation in 32 languages, and analyze diversity of how individual linguistic phenomena are captured across them. Inspired by the success of Universal Dependencies, we propose a harmonized scheme for segmentation representation, and convert the data from the studied resources into this common scheme. Harmonized versions of resources available under free licenses are published as a collection called UniSegments 1.0.

pdf bib abs
Constructing a Lexical Resource of Russian Derivational Morphology
Lukáš Kyjánek | Olga Lyashevskaya | Anna Nedoluzhko | Daniil Vodolazsky | Zdeněk Žabokrtský
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Words of any language are to some extent related thought the ways they are formed. For instance, the verb ‘exempl-ify’ and the noun ‘example-s’ are both based on the word ‘example’, but the verb is derived from it, while the noun is inflected. In Natural Language Processing of Russian, the inflection is satisfactorily processed; however, there are only a few machine-trackable resources that capture derivations even though Russian has both of these morphological processes very rich. Therefore, we devote this paper to improving one of the methods of constructing such resources and to the application of the method to a Russian lexicon, which results in the creation of the largest lexical resource of Russian derivational relations. The resulting database dubbed DeriNet.RU includes more than 300 thousand lexemes connected with more than 164 thousand binary derivational relations. To create such data, we combined the existing machine-learning methods that we improved to manage this goal. The whole approach is evaluated on our newly created data set of manual, parallel annotation. The resulting DeriNet.RU is freely available under an open license agreement.

Recent advances in standardization for annotated language resources have led to successful large scale efforts, such as the Universal Dependencies (UD) project for multilingual syntactically annotated data. By comparison, the important task of coreference resolution, which clusters multiple mentions of entities in a text, has yet to be standardized in terms of data formats or annotation guidelines. In this paper we present CorefUD, a multilingual collection of corpora and a standardized format for coreference resolution, compatible with morphosyntactic annotations in the UD framework and including facilities for related tasks such as named entity recognition, which forms a first step in the direction of convergence for coreference resolution across languages.

pdf bib abs
Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi and Nepali
Niyati Bafna | Zdeněk Žabokrtský
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Word embeddings are growing to be a crucial resource in the field of NLP for any language. This work introduces a novel technique for static subword embeddings transfer for Indic languages from a relatively higher resource language to a genealogically related low resource language. We primarily work with HindiMarathi, simulating a low-resource scenario for Marathi, and confirm observed trends on Nepali. We demonstrate the consistent benefits of unsupervised morphemic segmentation on both source and target sides over the treatment performed by fastText. Our best-performing approach uses an EM-style approach to learning bilingual subword embeddings; we also show, for the first time, that a trivial “copyand-paste” embeddings transfer based on even perfect bilingual lexicons is inadequate in capturing language-specific relationships. We find that our approach substantially outperforms the fastText baselines for both Marathi and Nepali on the Word Similarity task as well as WordNetBased Synonymy Tests; on the former task, its performance for Marathi is close to that of pretrained fastText embeddings that use three orders of magnitude more Marathi data.

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.

2021

pdf bib
Is one head enough? Mention heads in coreference annotations compared with UD-style heads
Anna Nedoluzhko | Michal Novák | Martin Popel | Zdeněk Žabokrtský | Daniel Zeman
Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021)

pdf bib abs
Do UD Trees Match Mention Spans in Coreference Annotations?
Martin Popel | Zdeněk Žabokrtský | Anna Nedoluzhko | Michal Novák | Daniel Zeman
Findings of the Association for Computational Linguistics: EMNLP 2021

One can find dozens of data resources for various languages in which coreference - a relation between two or more expressions that refer to the same real-world entity - is manually annotated. One could also assume that such expressions usually constitute syntactically meaningful units; however, mention spans have been annotated simply by delimiting token intervals in most coreference projects, i.e., independently of any syntactic representation. We argue that it could be advantageous to make syntactic and coreference annotations convergent in the long term. We present a pilot empirical study focused on matches and mismatches between hand-annotated linear mention spans and automatically parsed syntactic trees that follow Universal Dependencies conventions. The study covers 9 datasets for 8 different languages.

2020

pdf bib abs
Sentence Meaning Representations Across Languages: What Can We Learn from Existing Frameworks?
Zdeněk Žabokrtský | Daniel Zeman | Magda Ševčíková
Computational Linguistics, Volume 46, Issue 3 - September 2020

This article gives an overview of how sentence meaning is represented in eleven deep-syntactic frameworks, ranging from those based on linguistic theories elaborated for decades to rather lightweight NLP-motivated approaches. We outline the most important characteristics of each framework and then discuss how particular language phenomena are treated across those frameworks, while trying to shed light on commonalities as well as differences.

2019

pdf bib abs
Supervised Morphological Segmentation Using Rich Annotated Lexicon
Ebrahim Ansari | Zdeněk Žabokrtský | Mohammad Mahmoudi | Hamid Haghdoost | Jonáš Vidra
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Morphological segmentation of words is the process of dividing a word into smaller units called morphemes; it is tricky especially when a morphologically rich or polysynthetic language is under question. In this work, we designed and evaluated several Recurrent Neural Network (RNN) based models as well as various other machine learning based approaches for the morphological segmentation task. We trained our models using annotated segmentation lexicons. To evaluate the effect of the training data size on our models, we decided to create a large hand-annotated morphologically segmented corpus of Persian words, which is, to the best of our knowledge, the first and the only segmentation lexicon for the Persian language. In the experimental phase, using the hand-annotated Persian lexicon and two smaller similar lexicons for Czech and Finnish languages, we evaluated the effect of the training data size, different hyper-parameters settings as well as different RNN-based models.

pdf bib
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology
Magda Ševčíková | Zdeněk Žabokrtský | Eleonora Litta | Marco Passarotti
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf bib
Attempting to separate inflection and derivation using vector space representations
Rudolf Rosa | Zdeněk Žabokrtský
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf bib
DeriNet 2.0: Towards an All-in-One Word-Formation Resource
Jonáš Vidra | Zdeněk Žabokrtský | Magda Ševčíková | Lukáš Kyjánek
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf bib
Building a Morphological Network for Persian on Top of a Morpheme-Segmented Lexicon
Hamid Haghdoost | Ebrahim Ansari | Zdeněk Žabokrtský | Mahshid Nikravesh
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf bib
Universal Derivations Kickoff: A Collection of Harmonized Derivational Resources for Eleven Languages
Lukáš Kyjánek | Zdeněk Žabokrtský | Magda Ševčíková | Jonáš Vidra
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

2018

pdf bib
Semi-Automatic Construction of Word-Formation Networks (for Polish and Spanish)
Mateusz Lango | Magda Ševčíková | Zdeněk Žabokrtský
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
SumeCzech: Large Czech News-Based Summarization Dataset
Milan Straka | Nikita Mediankin | Tom Kocmi | Zdeněk Žabokrtský | Vojtěch Hudeček | Jan Hajič
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Using Adversarial Examples in Natural Language Processing
Petr Bělohlávek | Ondřej Plátek | Zdeněk Žabokrtský | Milan Straka
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Udapi: Universal API for Universal Dependencies
Martin Popel | Zdeněk Žabokrtský | Martin Vojtek
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib abs
Slavic Forest, Norwegian Wood
Rudolf Rosa | Daniel Zeman | David Mareček | Zdeněk Žabokrtský
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We once had a corp, or should we say, it once had us They showed us its tags, isn’t it great, unified tags They asked us to parse and they told us to use everything So we looked around and we noticed there was near nothing We took other langs, bitext aligned: words one-to-one We played for two weeks, and then they said, here is the test The parser kept training till morning, just until deadline So we had to wait and hope what we get would be just fine And, when we awoke, the results were done, we saw we’d won So, we wrote this paper, isn’t it good, Norwegian wood.

pdf bib abs
Projection-based Coreference Resolution Using Deep Syntax
Michal Novák | Anna Nedoluzhko | Zdeněk Žabokrtský
Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017)

The paper describes the system for coreference resolution in German and Russian, trained exclusively on coreference relations project ed through a parallel corpus from English. The resolver operates on the level of deep syntax and makes use of multiple specialized models. It achieves 32 and 22 points in terms of CoNLL score for Russian and German, respectively. Analysis of the evaluation results show that the resolver for Russian is able to preserve 66% of the English resolver’s quality in terms of CoNLL score. The system was submitted to the Closed track of the CORBON 2017 Shared task.

pdf bib
Error Analysis of Cross-lingual Tagging and Parsing
Rudolf Rosa | Zdeněk Žabokrtský
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2016

pdf bib abs
If You Even Don’t Have a Bit of Bible: Learning Delexicalized POS Taggers
Zhiwei Yu | David Mareček | Zdeněk Žabokrtský | Daniel Zeman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Various unsupervised and semi-supervised methods have been proposed to tag an unseen language. However, many of them require some partial understanding of the target language because they rely on dictionaries or parallel corpora such as the Bible. In this paper, we propose a different method named delexicalized tagging, for which we only need a raw corpus of the target language. We transfer tagging models trained on annotated corpora of one or more resource-rich languages. We employ language-independent features such as word length, frequency, neighborhood entropy, character classes (alphabetic vs. numeric vs. punctuation) etc. We demonstrate that such features can, to certain extent, serve as predictors of the part of speech, represented by the universal POS tag.

pdf bib abs
Merging Data Resources for Inflectional and Derivational Morphology in Czech
Zdeněk Žabokrtský | Magda Ševčíková | Milan Straka | Jonáš Vidra | Adéla Limburská
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper deals with merging two complementary resources of morphological data previously existing for Czech, namely the inflectional dictionary MorfFlex CZ and the recently developed lexical network DeriNet. The MorfFlex CZ dictionary has been used by a morphological analyzer capable of analyzing/generating several million Czech word forms according to the rules of Czech inflection. The DeriNet network contains several hundred thousand Czech lemmas interconnected with links corresponding to derivational relations (relations between base words and words derived from them). After summarizing basic characteristics of both resources, the process of merging is described, focusing on both rather technical aspects (growth of the data, measuring the quality of newly added derivational relations) and linguistic issues (treating lexical homonymy and vowel/consonant alternations). The resulting resource contains 970 thousand lemmas connected with 715 thousand derivational relations and is publicly available on the web under the CC-BY-NC-SA license. The data were incorporated in the MorphoDiTa library version 2.0 (which provides morphological analysis, generation, tagging and lemmatization for Czech) and can be browsed and searched by two web tools (DeriNet Viewer and DeriNet Search tool).

pdf bib
Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined
Daniel Zeman | David Mareček | Zhiwei Yu | Zdeněk Žabokrtský
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

2015

pdf bib
KLcpos3 - a Language Similarity Measure for Delexicalized Parser Transfer
Rudolf Rosa | Zdeněk Žabokrtský
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
MSTParser Model Interpolation for Multi-Source Delexicalized Transfer
Rudolf Rosa | Zdeněk Žabokrtský
Proceedings of the 14th International Conference on Parsing Technologies

2014

pdf bib
Cross-lingual Coreference Resolution of Pronouns
Michal Novák | Zdeněk Žabokrtský
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib abs
Word-Formation Network for Czech
Magda Ševčíková | Zdeněk Žabokrtský
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In the present paper, we describe the development of the lexical network DeriNet, which captures core word-formation relations on the set of around 266 thousand Czech lexemes. The network is currently limited to derivational relations because derivation is the most frequent and most productive word-formation process in Czech. This limitation is reflected in the architecture of the network: each lexeme is allowed to be linked up with just a single base word; composition as well as combined processes (composition with derivation) are thus not included. After a brief summarization of theoretical descriptions of Czech derivation and the state of the art of NLP approaches to Czech derivation, we discuss the linguistic background of the network and introduce the formal structure of the network and the semi-automatic annotation procedure. The network was initialized with a set of lexemes whose existence was supported by corpus evidence. Derivational links were created using three sources of information: links delivered by a tool for morphological analysis, links based on an automatically discovered set of derivation rules, and on a grammar-based set of rules. Finally, we propose some research topics which could profit from the existence of such lexical network.

We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular in recent years. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion. We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future. We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline.

2013

pdf bib
Two Case Studies on Translating Pronouns in a Deep Syntax Framework
Michal Novák | Zdeněk Žabokrtský | Anna Nedoluzhko
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Coordination Structures in Dependency Treebanks
Martin Popel | David Mareček | Jan Štěpánek | Daniel Zeman | Zdeněk Žabokrtský
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Improvements to Syntax-based Machine Translation using Ensemble Dependency Parsers
Nathan Green | Zdeněk Žabokrtský
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf bib
Translation of “It” in a Deep Syntax Framework
Michal Novák | Anna Nedoluzhko | Zdeněk Žabokrtský
Proceedings of the Workshop on Discourse in Machine Translation

2012

pdf bib
Exploiting Reducibility in Unsupervised Dependency Parsing
David Mareček | Zdeněk Žabokrtský
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib abs
Language Richness of the Web
Martin Majliš | Zdeněk Žabokrtský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We have built a corpus containing texts in 106 languages from texts available on the Internet and on Wikipedia. The W2C Web Corpus contains 54.7~GB of text and the W2C Wiki Corpus contains 8.5~GB of text. The W2C Web Corpus contains more than 100~MB of text available for 75 languages. At least 10~MB of text is available for 100 languages. These corpora are a unique data source for linguists, since they outclass all published works both in the size of the material collected and the number of languages covered. This language data resource can be of use particularly to researchers specialized in multilingual technologies development. We also developed software that greatly simplifies the creation of a new text corpus for a given language, using text materials freely available on the Internet. Special attention was given to components for filtering and de-duplication that allow to keep the material quality very high.

We propose HamleDT ― HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. While the license terms prevent us from directly redistributing the corpora, most of them are easily acquirable for research purposes. What we provide instead is the software that normalizes tree structures in the data obtained by the user from their original providers.

pdf bib abs
Prague Dependency Style Treebank for Tamil
Loganathan Ramasamy | Zdeněk Žabokrtský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Annotated corpora such as treebanks are important for the development of parsers, language applications as well as understanding of the language itself. Only very few languages possess these scarce resources. In this paper, we describe our efforts in syntactically annotating a small corpora (600 sentences) of Tamil language. Our annotation is similar to Prague Dependency Treebank (PDT) and consists of annotation at 2 levels or layers: (i) morphological layer (m-layer) and (ii) analytical layer (a-layer). For both the layers, we introduce annotation schemes i.e. positional tagging for m-layer and dependency relations for a-layers. Finally, we discuss some of the issues in treebank development for Tamil.

We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.

CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

pdf bib
Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser
Nathan Green | Zdeněk Žabokrtský
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

pdf bib
Unsupervised Dependency Parsing using Reducibility and Fertility features
David Mareček | Zdeněk Žabokrtský
Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure

pdf bib
Using an SVM Ensemble System for Improved Tamil Dependency Parsing
Nathan Green | Loganathan Ramasamy | Zdeněk Žabokrtský
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
The Study of Effect of Length in Morphological Segmentation of Agglutinative Languages
Loganathan Ramasamy | Zdeněk Žabokrtský | Sowmya Vajjala
Proceedings of the First Workshop on Multilingual Modeling

pdf bib
Morphological Processing for English-Tamil Statistical Machine Translation
Loganathan Ramasamy | Ondřej Bojar | Zdeněk Žabokrtský
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages

pdf bib
Indonesian Dependency Treebank: Annotation and Parsing
Nathan Green | Septina Dian Larasati | Zdenek Zabokrtsky
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2011

pdf bib
Influence of Parser Choice on Dependency-Based MT
Martin Popel | David Mareček | Nathan Green | Zdeněk Žabokrtský
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Gibbs Sampling with Treeness Constraint in Unsupervised Dependency Parsing
David Mareček | Zdeněk Žabokrtský
Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing

2010

pdf bib
Annotation Tool for Discourse in PDT
Jiří Mírovský | Lucie Mladová | Zdeněk Žabokrtský
Coling 2010: Demonstrations

pdf bib abs
Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9
Ondřej Bojar | Adam Liška | Zdeněk Žabokrtský
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

CzEng 0.9 is the third release of a large parallel corpus of Czech and English. For the current release, CzEng was extended by significant amount of texts from various types of sources, including parallel web pages, electronically available books and subtitles. This paper describes and evaluates filtering techniques employed in the process in order to avoid misaligned or otherwise damaged parallel sentences in the collection. We estimate the precision and recall of two sets of filters. The first set was used to process the data before their inclusion into CzEng. The filters from the second set were newly created to improve the filtering process for future releases of CzEng. Given the overall amount and variance of sources of the data, our experiments illustrate the utility of parallel data sources with respect to extractable parallel segments. As a similar behaviour can be expected for other language pairs, our results can be interpreted as guidelines indicating which sources should other researchers exploit first.

pdf bib
Maximum Entropy Translation Model in Dependency-Based MT Framework
Zdeněk Žabokrtský | Martin Popel | David Mareček
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
Hidden Markov Tree Model in Dependency-based Machine Translation
Zdeněk Žabokrtský | Martin Popel
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Czech Named Entity Corpus and SVM-based Recognizer
Jana Kravalová | Zdeněk Žabokrtský
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf bib
Comparison of Classification and Ranking Approaches to Pronominal Anaphora Resolution in Czech
Giang Linh Ngụy | Václav Novák | Zdeněk Žabokrtský
Proceedings of the SIGDIAL 2009 Conference

2008

pdf bib
Automatic alignment of Czech and English deep syntactic dependency trees
David Mareček | Zdeněk Žabokrtský | Václav Novák
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf bib abs
CzEng 0.7: Parallel Corpus with Community-Supplied Translations
Ondřej Bojar | Miroslav Janíček | Zdeněk Žabokrtský | Pavel Češka | Peter Beňa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes CzEng 0.7, a new release of Czech-English parallel corpus freely available for research and educational purposes. We provide basic statistics of the corpus and focus on data produced by a community of volunteers. Anonymous contributors manually correct the output of a machine translation (MT) system, generating on average 2000 sentences a month, 70% of which are indeed correct translations. We compare the utility of community-supplied and of professionally translated training data for a baseline English-to-Czech MT system.

pdf bib
TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer
Zdeněk Žabokrtský | Jan Ptáček | Petr Pajas
Proceedings of the Third Workshop on Statistical Machine Translation

2006

The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Ourapproach to annotation is based on the Prague Dependency Treebank, which serves as an excellent model due to the similarity of the languages, the existence of a detailed annotation guide and an annotation editor. The initial treebank contains a portion of theMULTEXT-East parallel word-level annotated corpus, namely the firstpart of the Slovene translation of Orwell's 1984. This corpus was first parsed automatically, to arrive at the initial analytic level dependency trees. These were then hand corrected using the tree editorTrEd; simultaneously, the Czech annotation manual was modified forSlovene. The current version is available in XML/TEI, as well asderived formats, and has been used in a comparative evaluation using the MALT parser, and as one of the languages present in the CoNLL-Xshared task on dependency parsing. The paper also discusses further work, in the first instance the composition of the corpus to be annotated next.

pdf bib abs
Valency Lexicon of Czech Verbs: Alternation-Based Model
Markéta Lopatková | Zdeněk Žabokrtský | Karolina Skwarska
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The main objective of this paper is to introduce an alternation-based model of valency lexicon of Czech verbs VALLEX. Alternations describe regular changes in valency structure of verbs -- they are seen as transformations taking one lexical unit and return a modified lexical unit as a result. We characterize and exemplify syntactically-based and semantically-based' alternations and their effects on verb argument structure. The alternation-based model allows to distinguish a minimal form of lexicon, which provides compact characterization of valency structure of Czech verbs, and an expanded form of lexicon useful for some applications.