Jan Hajic

Also published as: J. Hajič, Jan Hajič

Other people with similar names: Jan Hajič jr.


2023

pdf bib
Corpus-Based Multilingual Event-type Ontology: Annotation Tools and Principles
Eva Fučíková | Jan Hajič | Zdeňka Urešová
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

In the course of building a multilingual Event-type Ontology resource called SynSemClass, it was necessary to provide the maintainers and the annotators with a set of tools to facilitate their job, achieve data format consistency, and in general obtain high-quality data. We have adapted a previously existing tool (Urešová et al., 2018b), developed to assist the work in capturing bilingual synonymy. This tool needed to be both substantially expanded with some new features and fundamentally changed in the context of developing the resource for more languages, which necessarily is to be done in parallel. We are thus presenting here the tool, the new data structure design which had to change at the same time, and the associated workflow.

pdf bib
Spanish Verbal Synonyms in the SynSemClass Ontology
Cristina Fernández-Alcaina | Eva Fučíková | Jan Hajič | Zdeňka Urešová
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

This paper presents ongoing work in the expansion of the multilingual semantic event-type ontology SynSemClass (Czech-English-German) to include Spanish. As in previous versions of the lexicon, Spanish verbal synonyms have been collected from a sentence-aligned parallel corpus and classified into classes based on their syntactic-semantic properties. Each class member is linked to a number of syntactic and/or semantic resources specific to each language, thus enriching the annotation and enabling interoperability. This paper describes the procedure for the data extraction and annotation of Spanish verbal synonyms in the lexicon.

pdf bib
Mapping AMR to UMR: Resources for Adapting Existing Corpora for Cross-Lingual Compatibility
Julia Bonn | Skatje Myers | Jens E. L. Van Gysel | Lukas Denk | Meagan Vigus | Jin Zhao | Andrew Cowell | William Croft | Jan Hajič | James H. Martin | Alexis Palmer | Martha Palmer | James Pustejovsky | Zdenka Urešová | Rosa Vallejos | Nianwen Xue
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

This paper presents detailed mappings between the structures used in Abstract Meaning Representation (AMR) and those used in Uniform Meaning Representation (UMR). These structures include general semantic roles, rolesets, and concepts that are largely shared between AMR and UMR, but with crucial differences. While UMR annotation of new low-resource languages is ongoing, AMR-annotated corpora already exist for many languages, and these AMR corpora are ripe for conversion to UMR format. Rather than focusing on semantic coverage that is new to UMR (which will likely need to be dealt with manually), this paper serves as a resource (with illustrated mappings) for users looking to understand the fine-grained adjustments that have been made to the representation techniques for semantic categoriespresent in both AMR and UMR.

pdf bib
Extending an Event-type Ontology: Adding Verbs and Classes Using Fine-tuned LLMs Suggestions
Jana Straková | Eva Fučíková | Jan Hajič | Zdeňka Urešová
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

In this project, we have investigated the use of advanced machine learning methods, specifically fine-tuned large language models, for pre-annotating data for a lexical extension task, namely adding descriptive words (verbs) to an existing (but incomplete, as of yet) ontology of event types. Several research questions have been focused on, from the investigation of a possible heuristics to provide at least hints to annotators which verbs to include and which are outside the current version of the ontology, to the possible use of the automatic scores to help the annotators to be more efficient in finding a threshold for identifying verbs that cannot be assigned to any existing class and therefore they are to be used as seeds for a new class. We have also carefully examined the correlation of the automatic scores with the human annotation. While the correlation turned out to be strong, its influence on the annotation proper is modest due to its near linearity, even though the mere fact of such pre-annotation leads to relatively short annotation times.

pdf bib
UMR Annotation of Multiword Expressions
Julia Bonn | Andrew Cowell | Jan Hajič | Alexis Palmer | Martha Palmer | James Pustejovsky | Haibo Sun | Zdenka Uresova | Shira Wein | Nianwen Xue | Jin Zhao
Proceedings of the Fourth International Workshop on Designing Meaning Representations

Rooted in AMR, Uniform Meaning Representation (UMR) is a graph-based formalism with nodes as concepts and edges as relations between them. When used to represent natural language semantics, UMR maps words in a sentence to concepts in the UMR graph. Multiword expressions (MWEs) pose a particular challenge to UMR annotation because they deviate from the default one-to-one mapping between words and concepts. There are different types of MWEs which require different kinds of annotation that must be specified in guidelines. This paper discusses the specific treatment for each type of MWE in UMR.

pdf bib
What’s the Meaning of Superhuman Performance in Today’s NLU?
Simone Tedeschi | Johan Bos | Thierry Declerck | Jan Hajič | Daniel Hershcovich | Eduard Hovy | Alexander Koller | Simon Krek | Steven Schockaert | Rico Sennrich | Ekaterina Shutova | Roberto Navigli
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In the last five years, there has been a significant focus in Natural Language Processing (NLP) on developing larger Pretrained Language Models (PLMs) and introducing benchmarks such as SuperGLUE and SQuAD to measure their abilities in language understanding, reasoning, and reading comprehension. These PLMs have achieved impressive results on these benchmarks, even surpassing human performance in some cases. This has led to claims of superhuman capabilities and the provocative idea that certain tasks have been solved. In this position paper, we take a critical look at these claims and ask whether PLMs truly have superhuman abilities and what the current benchmarks are really evaluating. We show that these benchmarks have serious limitations affecting the comparison between humans and PLMs and provide recommendations for fairer and more transparent benchmarks.

2022

pdf bib
Making a Semantic Event-type Ontology Multilingual
Zdenka Uresova | Karolina Zaczynska | Peter Bourgonje | Eva Fučíková | Georg Rehm | Jan Hajic
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present an extension of the SynSemClass Event-type Ontology, originally conceived as a bilingual Czech-English resource. We added German entries to the classes representing the concepts of the ontology. Having a different starting point than the original work (unannotated parallel corpus without links to a valency lexicon and, of course, different existing lexical resources), it was a challenge to adapt the annotation guidelines, the data model and the tools used for the original version. We describe the process and results of working in such a setup. We also show the next steps to adapt the annotation process, data structures and formats and tools necessary to make the addition of a new language in the future more smooth and efficient, and possibly to allow for various teams to work on SynSemClass extensions to many languages concurrently. We also present the latest release which contains the results of adding German, freely available for download as well as for online access.

pdf bib
Quality and Efficiency of Manual Annotation: Pre-annotation Bias
Marie Mikulová | Milan Straka | Jan Štěpánek | Barbora Štěpánková | Jan Hajic
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents an analysis of annotation using an automatic pre-annotation for a mid-level annotation complexity task - dependency syntax annotation. It compares the annotation efforts made by annotators using a pre-annotated version (with a high-accuracy parser) and those made by fully manual annotation. The aim of the experiment is to judge the final annotation quality when pre-annotation is used. In addition, it evaluates the effect of automatic linguistically-based (rule-formulated) checks and another annotation on the same data available to the annotators, and their influence on annotation quality and efficiency. The experiment confirmed that the pre-annotation is an efficient tool for faster manual syntactic annotation which increases the consistency of the resulting annotation without reducing its quality.

pdf bib
Overview of the ELE Project
Itziar Aldabe | Jane Dunne | Aritz Farwell | Owen Gallagher | Federico Gaspari | Maria Giagkou | Jan Hajic | Jens Peter Kückens | Teresa Lynn | Georg Rehm | German Rigau | Katrin Marheinecke | Stelios Piperidis | Natalia Resende | Tea Vojtěchová | Andy Way
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper provides an overview of the ongoing European Language Equality(ELE) project, an 18-month action funded by the European Commission which involves 52 partners. The primary goal of ELE is to prepare the European Language Equality Programme, in the form of a strategic research, innovation and implementation agenda and a roadmap for achieving full digital language equality (DLE) in Europe by 2030.

2021

pdf bib
European Language Grid: A Joint Platform for the European Language Technology Community
Georg Rehm | Stelios Piperidis | Kalina Bontcheva | Jan Hajic | Victoria Arranz | Andrejs Vasiļjevs | Gerhard Backfried | Jose Manuel Gomez-Perez | Ulrich Germann | Rémi Calizzano | Nils Feldhus | Stefanie Hegele | Florian Kintzel | Katrin Marheinecke | Julian Moreno-Schneider | Dimitris Galanis | Penny Labropoulou | Miltos Deligiannis | Katerina Gkirtzou | Athanasia Kolovou | Dimitris Gkoumas | Leon Voukoutis | Ian Roberts | Jana Hamrlova | Dusan Varis | Lukas Kacena | Khalid Choukri | Valérie Mapelli | Mickaël Rigault | Julija Melnika | Miro Janosik | Katja Prinz | Andres Garcia-Silva | Cristian Berrio | Ondrej Klejch | Steve Renals
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.

2020

pdf bib
SynSemClass Linked Lexicon: Mapping Synonymy between Languages
Zdenka Uresova | Eva Fucikova | Eva Hajicova | Jan Hajic
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

This paper reports on an extended version of a synonym verb class lexicon, newly called SynSemClass (formerly CzEngClass). This lexicon stores cross-lingual semantically similar verb senses in synonym classes extracted from a richly annotated parallel corpus, the Prague Czech-English Dependency Treebank. When building the lexicon, we make use of predicate-argument relations (valency) and link them to semantic roles; in addition, each entry is linked to several external lexicons of more or less “semantic” nature, namely FrameNet, WordNet, VerbNet, OntoNotes and PropBank, and Czech VALLEX. The aim is to provide a linguistic resource that can be used to compare semantic roles and their syntactic properties and features across languages within and across synonym groups (classes, or ’synsets’), as well as gold standard data for automatic NLP experiments with such synonyms, such as synonym discovery, feature mapping, etc. However, perhaps the most important goal is to eventually build an event type ontology that can be referenced and used as a human-readable and human-understandable “database” for all types of events, processes and states. While the current paper describes primarily the content of the lexicon, we are also presenting a preliminary design of a format compatible with Linked Data, on which we are hoping to get feedback during discussions at the workshop. Once the resource (in whichever form) is applied to corpus annotation, deep analysis will be possible using such combined resources as training data.

pdf bib
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajič | Daniel Hershcovich | Bin Li | Tim O'Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

pdf bib
MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajic | Daniel Hershcovich | Bin Li | Tim O’Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

pdf bib
FGD at MRP 2020: Prague Tectogrammatical Graphs
Daniel Zeman | Jan Hajic
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

Prague Tectogrammatical Graphs (PTG) is a meaning representation framework that originates in the tectogrammatical layer of the Prague Dependency Treebank (PDT) and is theoretically founded in Functional Generative Description of language (FGD). PTG in its present form has been prepared for the CoNLL 2020 shared task on Cross-Framework Meaning Representation Parsing (MRP). It is generated automatically from the Prague treebanks and stored in the JSON-based MRP graph interchange format. The conversion is partially lossy; in this paper we describe what part of annotation was included and how it is represented in PTG.

pdf bib
Proceedings of the 1st International Workshop on Language Technology Platforms
Georg Rehm | Kalina Bontcheva | Khalid Choukri | Jan Hajič | Stelios Piperidis | Andrejs Vasiļjevs
Proceedings of the 1st International Workshop on Language Technology Platforms

pdf bib
CLARIN: Distributed Language Resources and Technology in a European Infrastructure
Maria Eskevich | Franciska de Jong | Alexander König | Darja Fišer | Dieter Van Uytvanck | Tero Aalto | Lars Borin | Olga Gerassimenko | Jan Hajic | Henk van den Heuvel | Neeme Kahusk | Krista Liin | Martin Matthiesen | Stelios Piperidis | Kadri Vider
Proceedings of the 1st International Workshop on Language Technology Platforms

CLARIN is a European Research Infrastructure providing access to digital language resources and tools from across Europe and beyond to researchers in the humanities and social sciences. This paper focuses on CLARIN as a platform for the sharing of language resources. It zooms in on the service offer for the aggregation of language repositories and the value proposition for a number of communities that benefit from the enhanced visibility of their data and services as a result of integration in CLARIN. The enhanced findability of language resources is serving the social sciences and humanities (SSH) community at large and supports research communities that aim to collaborate based on virtual collections for a specific domain. The paper also addresses the wider landscape of service platforms based on language technologies which has the potential of becoming a powerful set of interoperable facilities to a variety of communities of use.

pdf bib
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib
European Language Grid: An Overview
Georg Rehm | Maria Berger | Ela Elsholz | Stefanie Hegele | Florian Kintzel | Katrin Marheinecke | Stelios Piperidis | Miltos Deligiannis | Dimitris Galanis | Katerina Gkirtzou | Penny Labropoulou | Kalina Bontcheva | David Jones | Ian Roberts | Jan Hajič | Jana Hamrlová | Lukáš Kačena | Khalid Choukri | Victoria Arranz | Andrejs Vasiļjevs | Orians Anvari | Andis Lagzdiņš | Jūlija Meļņika | Gerhard Backfried | Erinç Dikici | Miroslav Janosik | Katja Prinz | Christoph Prinz | Severin Stampler | Dorothea Thomas-Aniola | José Manuel Gómez-Pérez | Andres Garcia Silva | Christian Berrío | Ulrich Germann | Steve Renals | Ondrej Klejch
Proceedings of the Twelfth Language Resources and Evaluation Conference

With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented – by nation states, languages, verticals and sectors, significantly holding back its impact. The European Language Grid (ELG) project addresses this fragmentation by establishing the ELG as the primary platform for LT in Europe. The ELG is a scalable cloud platform, providing, in an easy-to-integrate way, access to hundreds of commercial and non-commercial LTs for all European languages, including running tools and services as well as data sets and resources. Once fully operational, it will enable the commercial and non-commercial European LT community to deposit and upload their technologies and data sets into the ELG, to deploy them through the grid, and to connect with other resources. The ELG will boost the Multilingual Digital Single Market towards a thriving European LT community, creating new jobs and opportunities. Furthermore, the ELG project organises two open calls for up to 20 pilot projects. It also sets up 32 national competence centres and the European LT Council for outreach and coordination purposes.

pdf bib
Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Joakim Nivre | Marie-Catherine de Marneffe | Filip Ginter | Jan Hajič | Christopher D. Manning | Sampo Pyysalo | Sebastian Schuster | Francis Tyers | Daniel Zeman
Proceedings of the Twelfth Language Resources and Evaluation Conference

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

pdf bib
Prague Dependency Treebank - Consolidated 1.0
Jan Hajič | Eduard Bejček | Jaroslava Hlavacova | Marie Mikulová | Milan Straka | Jan Štěpánek | Barbora Štěpánková
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research. PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here). The texts come from different sources: daily newspaper articles, Czech translation of the Wall Street Journal, transcribed dialogs and a small amount of user-generated, short, often non-standard language segments typed into a web translator. Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation. The diversity of the texts and annotations should serve well the NLP applications as well as it is an invaluable resource for linguistic research, including comparative studies regarding texts of different genres. The corpus is publicly and freely available.

pdf bib
Proceedings of the Second International Workshop on Designing Meaning Representations
Nianwen Xue | Johan Bos | William Croft | Jan Hajič | Chu-Ren Huang | Stephan Oepen | Martha Palmer | James Pustejovsky
Proceedings of the Second International Workshop on Designing Meaning Representations

2019

pdf bib
Neural Architectures for Nested NER through Linearization
Jana Straková | Milan Straka | Jan Hajic
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose two neural network architectures for nested named entity recognition (NER), a setting in which named entities may overlap and also be labeled with more than one label. We encode the nested labels using a linearized scheme. In our first proposed approach, the nested labels are modeled as multilabels corresponding to the Cartesian product of the nested labels in a standard LSTM-CRF architecture. In the second one, the nested NER is viewed as a sequence-to-sequence problem, in which the input sequence consists of the tokens and output sequence of the labels, using hard attention on the word whose label is being predicted. The proposed methods outperform the nested NER state of the art on four corpora: ACE-2004, ACE-2005, GENIA and Czech CNEC. We also enrich our architectures with the recently published contextual embeddings: ELMo, BERT and Flair, reaching further improvements for the four nested entity corpora. In addition, we report flat NER state-of-the-art results for CoNLL-2002 Dutch and Spanish and for CoNLL-2003 English.

pdf bib
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

pdf bib
MRP 2019: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue | Jayeol Chun | Milan Straka | Zdenka Uresova
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks. Five distinct approaches to the representation of sentence meaning in the form of directed graph were represented in the training and evaluation data for the task, packaged in a uniform abstract graph representation and serialization. The task received submissions from eighteen teams, of which five do not participate in the official ranking because they arrived after the closing deadline, made use of additional training data, or involved one of the task co-organizers. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

pdf bib
Proceedings of the First International Workshop on Designing Meaning Representations
Nianwen Xue | William Croft | Jan Hajic | Chu-Ren Huang | Stephan Oepen | Martha Palmer | James Pustejovksy
Proceedings of the First International Workshop on Designing Meaning Representations

pdf bib
UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging
Milan Straka | Jana Straková | Jan Hajic
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

We present our contribution to the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis and lemmatization. We submitted a modification of the UDPipe 2.0, one of best-performing systems of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies and an overall winner of the The 2018 Shared Task on Extrinsic Parser Evaluation. As our first improvement, we use the pretrained contextualized embeddings (BERT) as additional inputs to the network; secondly, we use individual morphological features as regularization; and finally, we merge the selected corpora of the same language. In the lemmatization task, our system exceeds all the submitted systems by a wide margin with lemmatization accuracy 95.78 (second best was 95.00, third 94.46). In the morphological analysis, our system placed tightly second: our morphological analysis accuracy was 93.19, the winning system’s 93.23.

pdf bib
Parallel Dependency Treebank Annotated with Interlinked Verbal Synonym Classes and Roles
Zdeňka Urešová | Eva Fučíková | Eva Hajičová | Jan Hajič
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf bib
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Jan Hajič
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

pdf bib
CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Jan Hajič | Martin Popel | Martin Potthast | Milan Straka | Filip Ginter | Joakim Nivre | Slav Petrov
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

Every year, the Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2018, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on test input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. This shared task constitutes a 2nd edition—the first one took place in 2017 (Zeman et al., 2017); the main metric from 2017 has been kept, allowing for easy comparison, also in 2018, and two new main metrics have been used. New datasets added to the Universal Dependencies collection between mid-2017 and the spring of 2018 have contributed to increased difficulty of the task this year. In this overview paper, we define the task and the updated evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib
Expletives in Universal Dependency Treebanks
Gosse Bouma | Jan Hajic | Dag Haug | Joakim Nivre | Per Erik Solberg | Lilja Øvrelid
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Although treebanks annotated according to the guidelines of Universal Dependencies (UD) now exist for many languages, the goal of annotating the same phenomena in a cross-linguistically consistent fashion is not always met. In this paper, we investigate one phenomenon where we believe such consistency is lacking, namely expletive elements. Such elements occupy a position that is structurally associated with a core argument (or sometimes an oblique dependent), yet are non-referential and semantically void. Many UD treebanks identify at least some elements as expletive, but the range of phenomena differs between treebanks, even for closely related languages, and sometimes even for different treebanks for the same language. In this paper, we present criteria for identifying expletives that are applicable across languages and compatible with the goals of UD, give an overview of expletives as found in current UD treebanks, and present recommendations for the annotation of expletives so that more consistent annotation can be achieved in future releases.

pdf bib
Synonymy in Bilingual Context: The CzEngClass Lexicon
Zdeňka Urešová | Eva Fučíková | Eva Hajičová | Jan Hajič
Proceedings of the 27th International Conference on Computational Linguistics

This paper describes CzEngClass, a bilingual lexical resource being built to investigate verbal synonymy in bilingual context and to relate semantic roles common to one synonym class to verb arguments (verb valency). In addition, the resource is linked to existing resources with the same of a similar aim: English and Czech WordNet, FrameNet, PropBank, VerbNet (SemLink), and valency lexicons for Czech and English (PDT-Vallex, Vallex, and EngVallex). There are several goals of this work and resource: (a) to provide gold standard data for automatic experiments in the future (such as automatic discovery of synonym classes, word sense disambiguation, assignment of classes to occurrences of verbs in text, coreferential linking of verb and event arguments in text, etc.), (b) to build a core (bilingual) lexicon linked to existing resources, for comparative studies and possibly for training automatic tools, and (c) to enrich the annotation of a parallel treebank, the Prague Czech English Dependency Treebank, which so far contained valency annotation but has not linked synonymous senses of verbs together. The method used for extracting the synonym classes is a semi-automatic process with a substantial amount of manual work during filtering, role assignment to classes and individual Class members’ arguments, and linking to the external lexical resources. We present the first version with 200 classes (about 1800 verbs) and evaluate interannotator agreement using several metrics.

pdf bib
Tools for Building an Interlinked Synonym Lexicon Network
Zdeňka Urešová | Eva Fučíková | Eva Hajičová | Jan Hajič
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Bridging the LAPPS Grid and CLARIN
Erhard Hinrichs | Nancy Ide | James Pustejovsky | Jan Hajič | Marie Hinrichs | Mohammad Fazleh Elahi | Keith Suderman | Marc Verhagen | Kyeongmin Rim | Pavel Straňák | Jozef Mišutka
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Creating a Verb Synonym Lexicon Based on a Parallel Corpus
Zdeňka Urešová | Eva Fučíková | Eva Hajičová | Jan Hajič
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Diacritics Restoration Using Neural Networks
Jakub Náplava | Milan Straka | Pavel Straňák | Jan Hajič
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
SumeCzech: Large Czech News-Based Summarization Dataset
Milan Straka | Nikita Mediankin | Tom Kocmi | Zdeněk Žabokrtský | Vojtěch Hudeček | Jan Hajič
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs
Daniel Kondratyuk | Tomáš Gavenčiak | Milan Straka | Jan Hajič
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network, predicting tag subcategories, and using the tagger output as an input to the lemmatizer. We evaluate our model across several languages with complex morphology, which surpasses state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.

2017

pdf bib
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories
Jan Hajič
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

pdf bib
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Jan Hajič | Dan Zeman
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

pdf bib
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

2016

pdf bib
Fostering the Next Generation of European Language Technology: Recent Developments ― Emerging Initiatives ― Challenges and Opportunities
Georg Rehm | Jan Hajič | Josef van Genabith | Andrejs Vasiljevs
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

META-NET is a European network of excellence, founded in 2010, that consists of 60 research centres in 34 European countries. One of the key visions and goals of META-NET is a truly multilingual Europe, which is substantially supported and realised through language technologies. In this article we provide an overview of recent developments around the multilingual Europe topic, we also describe recent and upcoming events as well as recent and upcoming strategy papers. Furthermore, we provide overviews of two new emerging initiatives, the CEF.AT and ELRC activity on the one hand and the Cracking the Language Barrier federation on the other. The paper closes with several suggested next steps in order to address the current challenges and to open up new opportunities.

pdf bib
Universal Dependencies v1: A Multilingual Treebank Collection
Joakim Nivre | Marie-Catherine de Marneffe | Filip Ginter | Yoav Goldberg | Jan Hajič | Christopher D. Manning | Ryan McDonald | Slav Petrov | Sampo Pyysalo | Natalia Silveira | Reut Tsarfaty | Daniel Zeman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Cross-linguistically consistent annotation is necessary for sound comparative evaluation and cross-lingual learning experiments. It is also useful for multilingual system development and comparative linguistic studies. Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. In this paper, we describe v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages.

pdf bib
QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages
Arantxa Otegi | Nora Aranberri | Antonio Branco | Jan Hajič | Martin Popel | Kiril Simov | Eneko Agirre | Petya Osenova | Rita Pereira | João Silva | Steven Neale
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

pdf bib
Towards Comparability of Linguistic Graph Banks for Semantic Parsing
Stephan Oepen | Marco Kuhlmann | Yusuke Miyao | Daniel Zeman | Silvie Cinková | Dan Flickinger | Jan Hajič | Angelina Ivanova | Zdeňka Urešová
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We announce a new language resource for research on semantic parsing, a large, carefully curated collection of semantic dependency graphs representing multiple linguistic traditions. This resource is called SDP~2016 and provides an update and extension to previous versions used as Semantic Dependency Parsing target representations in the 2014 and 2015 Semantic Evaluation Exercises. For a common core of English text, this third edition comprises semantic dependency graphs from four distinct frameworks, packaged in a unified abstract format and aligned at the sentence and token levels. SDP 2016 is the first general release of this resource and available for licensing from the Linguistic Data Consortium in May 2016. The data is accompanied by an open-source SDP utility toolkit and system results from previous contrastive parsing evaluations against these target representations.

pdf bib
UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing
Milan Straka | Jan Hajič | Jana Straková
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Automatic natural language processing of large texts often presents recurring challenges in multiple languages: even for most advanced tasks, the texts are first processed by basic processing steps – from tokenization to parsing. We present an extremely simple-to-use tool consisting of one binary and one model (per language), which performs these tasks for multiple languages without the need for any other external data. UDPipe, a pipeline processing CoNLL-U-formatted files, performs tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing for nearly all treebanks of Universal Dependencies 1.2 (namely, the whole pipeline is currently available for 32 out of 37 treebanks). In addition, the pipeline is easily trainable with training data in CoNLL-U format (and in some cases also with additional raw corpora) and requires minimal linguistic knowledge on the users’ part. The training code is also released.

pdf bib
Non-projectivity and valency
Zdenka Uresova | Eva Fucikova | Jan Hajic
Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing

pdf bib
Inherently Pronominal Verbs in Czech: Description and Conversion Based on Treebank Annotation
Zdeňka Urešová | Eduard Bejček | Jan Hajič
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib
Enriching a Valency Lexicon by Deverbative Nouns
Eva Fučíková | Jan Hajič | Zdeňka Urešová
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)

We present an attempt to automatically identify Czech deverbative nouns using several methods that use large corpora as well as existing lexical resources. The motivation for the task is to extend a verbal valency (i.e., predicate-argument) lexicon by adding nouns that share the valency properties with the base verb, assuming their properties can be derived (even if not trivially) from the underlying verb by deterministic grammatical rules. At the same time, even in inflective languages, not all deverbatives are simply created from their underlying base verb by regular lexical derivation processes. We have thus developed hybrid techniques that use both large parallel corpora and several standard lexical resources. Thanks to the use of parallel corpora, the resulting sets contain also synonyms, which the lexical derivation rules cannot get. For evaluation, we have manually created a small, 100-verb gold data since no such dataset was initially available for Czech.

pdf bib
Proceedings of the 2nd Deep Machine Translation Workshop
Jan Hajič | Gertjan van Noord | António Branco
Proceedings of the 2nd Deep Machine Translation Workshop

pdf bib
TectoMT – a deep linguistic core of the combined Cimera MT system
Martin Popel | Roman Sudarikov | Ondřej Bojar | Rudolf Rosa | Jan Hajič
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf bib
Joint search in a bilingual valency lexicon and an annotated corpus
Eva Fučíková | Jan Hajič | Zdeňka Urešová
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper and the associated system demo, we present an advanced search system that allows to perform a joint search over a (bilingual) valency lexicon and a correspondingly annotated linked parallel corpus. This search tool has been developed on the basis of the Prague Czech-English Dependency Treebank, but its ideas are applicable in principle to any bilingual parallel corpus that is annotated for dependencies and valency (i.e., predicate-argument structure), and where verbs are linked to appropriate entries in an associated valency lexicon. Our online search tool consolidates more search interfaces into one, providing expanded structured search capability and a more efficient advanced way to search, allowing users to search for verb pairs, verbal argument pairs, their surface realization as recorded in the lexicon, or for their surface form actually appearing in the linked parallel corpus. The search system is currently under development, and is replacing our current search tool available at http://lindat.mff.cuni.cz/services/CzEngVallex, which could search the lexicon but the queries cannot take advantage of the underlying corpus nor use the additional surface form information from the lexicon(s). The system is available as open source.

2015

pdf bib
SemEval 2015 Task 18: Broad-Coverage Semantic Dependency Parsing
Stephan Oepen | Marco Kuhlmann | Yusuke Miyao | Daniel Zeman | Silvie Cinková | Dan Flickinger | Jan Hajič | Zdeňka Urešová
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Bilingual English-Czech Valency Lexicon Linked to a Parallel Corpus
Zdeňka Urešová | Ondřej Dušek | Eva Fučíková | Jan Hajič | Jana Šindlerová
Proceedings of the 9th Linguistic Annotation Workshop

pdf bib
Using Parallel Texts and Lexicons for Verbal Word Sense Disambiguation
Ondřej Dušek | Eva Fučíková | Jan Hajič | Martin Popel | Jana Šindlerová | Zdeňka Urešová
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf bib
Proceedings of the 1st Deep Machine Translation Workshop
Jan Hajič | António Branco
Proceedings of the 1st Deep Machine Translation Workshop

2014

pdf bib
SemEval 2014 Task 8: Broad-Coverage Semantic Dependency Parsing
Stephan Oepen | Marco Kuhlmann | Yusuke Miyao | Daniel Zeman | Dan Flickinger | Jan Hajič | Angelina Ivanova | Yi Zhang
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Not an Interlingua, But Close: Comparison of English AMRs to Chinese and Czech
Nianwen Xue | Ondřej Bojar | Jan Hajič | Martha Palmer | Zdeňka Urešová | Xiuhong Zhang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Abstract Meaning Representations (AMRs) are rooted, directional and labeled graphs that abstract away from morpho-syntactic idiosyncrasies such as word category (verbs and nouns), word order, and function words (determiners, some prepositions). Because these syntactic idiosyncrasies account for many of the cross-lingual differences, it would be interesting to see if this representation can serve, e.g., as a useful, minimally divergent transfer layer in machine translation. To answer this question, we have translated 100 English sentences that have existing AMRs into Chinese and Czech to create AMRs for them. A cross-linguistic comparison of English to Chinese and Czech AMRs reveals both cases where the AMRs for the language pairs align well structurally and cases of linguistic divergence. We found that the level of compatibility of AMR between English and Chinese is higher than between English and Czech. We believe this kind of comparison is beneficial to further refining the annotation standards for each of the three languages and will lead to more compatible annotation guidelines between the languages.

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

pdf bib
CLARA: A New Generation of Researchers in Common Language Resources and Their Applications
Koenraad De Smedt | Erhard Hinrichs | Detmar Meurers | Inguna Skadiņa | Bolette Pedersen | Costanza Navarretta | Núria Bel | Krister Lindén | Markéta Lopatková | Jan Hajič | Gisle Andersen | Przemyslaw Lenkiewicz
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multimodal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.

pdf bib
Multilingual Test Sets for Machine Translation of Search Queries for Cross-Lingual Information Retrieval in the Medical Domain
Zdeňka Urešová | Jan Hajič | Pavel Pecina | Ondřej Dušek
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents development and test sets for machine translation of search queries in cross-lingual information retrieval in the medical domain. The data consists of the total of 1,508 real user queries in English translated to Czech, German, and French. We describe the translation and review process involving medical professionals and present a baseline experiment where our data sets are used for tuning and evaluation of a machine translation system.

pdf bib
Verbal Valency Frame Detection and Selection in Czech and English
Ondřej Dušek | Jan Hajič | Zdeňka Urešová
Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Machine Translation of Medical Texts in the Khresmoi Project
Ondřej Dušek | Jan Hajič | Jaroslava Hlaváčová | Michal Novák | Pavel Pecina | Rudolf Rosa | Aleš Tamchyna | Zdeňka Urešová | Daniel Zeman
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Comparing Czech and English AMRs
Zdeňka Urešová | Jan Hajič | Ondřej Bojar
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf bib
Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition
Jana Straková | Milan Straka | Jan Hajič
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
Junichi Tsujii | Jan Hajic
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
An Analysis of Annotation of Verb-Noun Idiomatic Combinations in a Parallel Dependency Corpus
Zdenka Uresova | Jan Hajic | Eva Fucikova | Jana Sindlerova
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
Joint Morphological and Syntactic Analysis for Richly Inflected Languages
Bernd Bohnet | Joakim Nivre | Igor Boguslavsky | Richárd Farkas | Filip Ginter | Jan Hajič
Transactions of the Association for Computational Linguistics, Volume 1

Joint morphological and syntactic analysis has been proposed as a way of improving parsing accuracy for richly inflected languages. Starting from a transition-based model for joint part-of-speech tagging and dependency parsing, we explore different ways of integrating morphological features into the model. We also investigate the use of rule-based morphological analyzers to provide hard or soft lexical constraints and the use of word clusters to tackle the sparsity of lexical features. Evaluation on five morphologically rich languages (Czech, Finnish, German, Hungarian, and Russian) shows consistent improvements in both morphological and syntactic accuracy for joint prediction over a pipeline model, with further improvements thanks to lexical constraints and word clusters. The final results improve the state of the art in dependency parsing for all languages.

2012

pdf bib
HamleDT: To Parse or Not to Parse?
Daniel Zeman | David Mareček | Martin Popel | Loganathan Ramasamy | Jan Štěpánek | Zdeněk Žabokrtský | Jan Hajič
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We propose HamleDT ― HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. While the license terms prevent us from directly redistributing the corpora, most of them are easily acquirable for research purposes. What we provide instead is the software that normalizes tree structures in the data obtained by the user from their original providers.

pdf bib
Announcing Prague Czech-English Dependency Treebank 2.0
Jan Hajič | Eva Hajičová | Jarmila Panevová | Petr Sgall | Ondřej Bojar | Silvie Cinková | Eva Fučíková | Marie Mikulová | Petr Pajas | Jan Popelka | Jiří Semecký | Jana Šindlerová | Jan Štěpánek | Josef Toman | Zdeňka Urešová | Zdeněk Žabokrtský
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.

2010

pdf bib
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Jan Hajič | Sandra Carberry | Stephen Clark | Joakim Nivre
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Proceedings of the ACL 2010 Conference Short Papers
Jan Hajič | Sandra Carberry | Stephen Clark | Joakim Nivre
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Resources for adding semantics to machine translation
Jan Hajič
Proceedings of the 7th International Workshop on Spoken Language Translation: Plenaries

2009

pdf bib
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task
Jan Hajič
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

pdf bib
The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages
Jan Hajič | Massimiliano Ciaramita | Richard Johansson | Daisuke Kawahara | Maria Antònia Martí | Lluís Màrquez | Adam Meyers | Joakim Nivre | Sebastian Padó | Jan Štěpánek | Pavel Straňák | Mihai Surdeanu | Nianwen Xue | Yi Zhang
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

pdf bib
Semi-Supervised Training for the Averaged Perceptron POS Tagger
Drahomíra “johanka” Spoustová | Jan Hajič | Jan Raab | Miroslav Spousta
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Validating the Quality of Full Morphological Annotation
Drahomíra „johanka“ Spoustová | Pavel Pecina | Jan Hajič | Miroslav Spousta
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In our paper we present a methodology used for low-cost validation of quality of Part-of-Speech annotation of the Prague Dependency Treebank based on multiple re-annotation of data samples carefully selected with the help of several different Part-of-Speech taggers.

pdf bib
Phrase-Based and Deep Syntactic English-to-Czech Statistical Machine Translation
Ondřej Bojar | Jan Hajič
Proceedings of the Third Workshop on Statistical Machine Translation

2007

pdf bib
The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech
Drahomíra “johanka” Spoustová | Jan Hajič | Jan Votrubec | Pavel Krbec | Pavel Květoň
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing

2006

pdf bib
Perspectives of Turning Prague Dependency Treebank into a Knowledge Base
Václav Novák | Jan Hajič
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Recently, the Prague Dependency Treebank 2.0 (PDT 2.0) has emerged as the largest text corpora annotated on the level of tectogrammatical representation (“linguistic meaning”) described in Sgall et al. (2004) and containing about 0.8 milion words (see Hajic (2004)). We hope that this level of annotation is so close to the meaning of the utterances contained in the corpora that it should enable us to automatically transform texts contained in the corpora to the form of knowledge base, usable for information extraction, question answering, summarization, etc. We can use Multilayered Extended Semantic Networks (MultiNet) described in Helbig (2006) as the target formalism. In this paper we discuss the suitability of such approach and some of the main issues that will arise in the process. In section 1, we introduce formalisms underlying PDT 2.0 and MultiNet, in section 2. We describe the role MultiNet can play in the system of Functional Generative Description (FGD), section 3 discusses issues of automatic conversion to MultiNet and section 4 gives some conclusions.

pdf bib
Leveraging Reusability: Cost-Effective Lexical Acquisition for Large-Scale Ontology Translation
G. Craig Murray | Bonnie J. Dorr | Jimmy Lin | Jan Hajič | Pavel Pecina
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Leveraging Recurrent Phrase Structure in Large-scale Ontology Translation
G. Craig Murray | Bonnie J. Dorr | Jimmy Lin | Jan Hajič | Pavel Pecina
Proceedings of the 11th Annual Conference of the European Association for Machine Translation

2005

pdf bib
Prague Czech-English dependency treebank: resource for structure-based MT
Martin Čmejrek | Jan Cuřín | Jan Hajič | Jiří Havelka
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf bib
Non-Projective Dependency Parsing using Spanning Tree Algorithms
Ryan McDonald | Fernando Pereira | Kiril Ribarov | Jan Hajič
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
Issues in Annotation of the Czech Spontaneous Speech Corpus in the MALACH project
Josef Psutka | Pavel Ircing | Jan Hajič | Vlasta Radová | Josef V. Psutka | William J. Byrne | Samuel Gustman
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation
Martin Čmejrek | Jan Cuřín | Jiří Havelka | Jan Hajič | Vladislav Kuboň
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
A simple multilingual machine translation system
Jan Hajič | Petr Homola | Vladislav Kuboň
Proceedings of Machine Translation Summit IX: Papers

The multilingual machine translation system described in the first part of this paper demonstrates that the translation memory (TM) can be used in a creative way for making the translation process more automatic (in a way which in fact does not depend on the languages used). The MT system is based upon exploitation of syntactic similarities between more or less related natural languages. It currently covers the translation from Czech to Slovak, Polish and Lithuanian. The second part of the paper also shows that one of the most popular TM based commercial systems, TRADOS, can be used not only for the translation itself, but also for a relatively fast and natural method of evaluation of the translation quality of MT systems.

bib
10th Conference of the European Chapter of the Association for Computational Linguistics
Ann Copestake | Jan Hajič
10th Conference of the European Chapter of the Association for Computational Linguistics

2002

pdf bib
Tectogrammatical representation: towards a minimal transfer in machine translation
Jan Hajič
Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6)

2001

pdf bib
Serial Combination of Rules and Statistics: A Case Study in Czech Tagging
Jan Hajic | Pavel Krbec | Pavel Kveton | Karel Oliva | Vladimir Petkevic
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

2000

pdf bib
Machine Translation of Very Close Languages
Jan Hajic
Sixth Applied Natural Language Processing Conference

pdf bib
Morphological Tagging: Data vs. Dictionaries
Jan Hajic
1st Meeting of the North American Chapter of the Association for Computational Linguistics

1999

bib
Machine translation of very closely related languages
Jan Hajič | Jan Hric | Vladislav Kuboň
EAMT Workshop: EU and the new languages

pdf bib
A Statistical Parser for Czech
Michael Collins | Jan Hajic | Lance Ramshaw | Christoph Tillmann
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf bib
Tagging Inflective Languages: Prediction of Morphological Categories for a Rich Structured Tagset
Jan Hajič | Barbora Hladká
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset
Jan Hajic | Barbora Hladka
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

1997

pdf bib
Probabilistic and Rule-Based Tagger of an Inflective Language- a Comparison
Jan Hajic | Barbora Hladka
Fifth Conference on Applied Natural Language Processing

1995

pdf bib
Machine Translation in the Czech Republic: history, methods, systems
Jan Hajič
Proceedings of Machine Translation Summit V

1993

pdf bib
But Dictionaries Are Data Too
Peter F. Brown | Stephen A. Della Pietra | Vincent J. Della Pietra | Meredith J. Goldsmith | Jan Hajic | Robert L. Mercer | Surya Mohanty
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993

1992

pdf bib
Derivation of Underlying Valency Frames From a Learner’s Dictionary
Alexandr Rosen | Eva Hajicova | Jan Hajic
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics

pdf bib
Tagging and Alignment of Parallel Texts: Current Status of BCP
A. Winarske | S. Warwick-Armstrong | J. Hajič
Third Conference on Applied Natural Language Processing

1990

pdf bib
Spelling-checking for Highly Inflective Languages
Jan Hajic | Janus Drozd
COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics

1988

pdf bib
Formal Morphology
Jan Hajic
Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics

1987

pdf bib
RUSLAN - An MT System Between Closely Related Languages
Jan Hajic
Third Conference of the European Chapter of the Association for Computational Linguistics

1982

pdf bib
Inferencing and Search for an Answer in TIBAQ
Petr Jirku | Jan Hajic
Coling 1982 Abstracts: Proceedings of the Ninth International Conference on Computational Linguistics Abstracts

Search
Co-authors