Pavel Pecina


2020

pdf bib
Document Translation vs. Query Translation for Cross-Lingual Information Retrieval in the Medical Domain
Shadi Saleh | Pavel Pecina
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present a thorough comparison of two principal approaches to Cross-Lingual Information Retrieval: document translation (DT) and query translation (QT). Our experiments are conducted using the cross-lingual test collection produced within the CLEF eHealth information retrieval tasks in 2013–2015 containing English documents and queries in several European languages. We exploit the Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) paradigms and train several domain-specific and task-specific machine translation systems to translate the non-English queries into English (for the QT approach) and the English documents to all the query languages (for the DT approach). The results show that the quality of QT by SMT is sufficient enough to outperform the retrieval results of the DT approach for all the languages. NMT then further boosts translation quality and retrieval quality for both QT and DT for most languages, but still, QT provides generally better retrieval results than DT.

2019

pdf bib
English-Czech Systems in WMT19: Document-Level Transformer
Martin Popel | Dominik Macháček | Michal Auersperger | Ondřej Bojar | Pavel Pecina
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe our NMT systems submitted to the WMT19 shared task in English→Czech news translation. Our systems are based on the Transformer model implemented in either Tensor2Tensor (T2T) or Marian framework. We aimed at improving the adequacy and coherence of translated documents by enlarging the context of the source and target. Instead of translating each sentence independently, we split the document into possibly overlapping multi-sentence segments. In case of the T2T implementation, this “document-level”-trained system achieves a +0.6 BLEU improvement (p < 0.05) relative to the same system applied on isolated sentences. To assess the potential effect document-level models might have on lexical coherence, we performed a semi-automatic analysis, which revealed only a few sentences improved in this aspect. Thus, we cannot draw any conclusions from this week evidence.

2017

pdf bib
Findings of the WMT 2017 Biomedical Translation Shared Task
Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Karin Verspoor | Ondřej Bojar | Arthur Boyer | Cristian Grozea | Barry Haddow | Madeleine Kittner | Yvonne Lichtblau | Pavel Pecina | Roland Roller | Rudolf Rosa | Amy Siu | Philippe Thomas | Saskia Trescher
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Liane Guillou | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Pavel Pecina | Martin Popel | Philipp Koehn | Christof Monz | Matteo Negri | Matt Post | Lucia Specia | Karin Verspoor | Jörg Tiedemann | Marco Turchi
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

bib
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Liane Guillou | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Pavel Pecina | Martin Popel | Philipp Koehn | Christof Monz | Matteo Negri | Matt Post | Lucia Specia | Karin Verspoor | Jörg Tiedemann | Marco Turchi
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks
Jindřich Libovický | Jindřich Helcl | Marek Tlustý | Ondřej Bojar | Pavel Pecina
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib
Proceedings of the Tenth Workshop on Statistical Machine Translation
Ondřej Bojar | Rajan Chatterjee | Christian Federmann | Barry Haddow | Chris Hokamp | Matthias Huck | Varvara Logacheva | Pavel Pecina
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Feature Extraction for Native Language Identification Using Language Modeling
Vincent Kríž | Martin Holub | Pavel Pecina
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
Findings of the 2014 Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Johannes Leveling | Christof Monz | Pavel Pecina | Matt Post | Herve Saint-Amand | Radu Soricut | Lucia Specia | Aleš Tamchyna
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Machine Translation of Medical Texts in the Khresmoi Project
Ondřej Dušek | Jan Hajič | Jaroslava Hlaváčová | Michal Novák | Pavel Pecina | Rudolf Rosa | Aleš Tamchyna | Zdeňka Urešová | Daniel Zeman
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Tolerant BLEU: a Submission to the WMT14 Metrics Task
Jindřich Libovický | Pavel Pecina
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Multilingual Test Sets for Machine Translation of Search Queries for Cross-Lingual Information Retrieval in the Medical Domain
Zdeňka Urešová | Jan Hajič | Pavel Pecina | Ondřej Dušek
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents development and test sets for machine translation of search queries in cross-lingual information retrieval in the medical domain. The data consists of the total of 1,508 real user queries in English translated to Czech, German, and French. We describe the translation and review process involving medical professionals and present a baseline experiment where our data sets are used for tuning and evaluation of a machine translation system.

2013

pdf bib
Increasing the Quality and Quantity of Source Language Data for Unsupervised Cross-Lingual POS Tagging
Long Duong | Paul Cook | Steven Bird | Pavel Pecina
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Determining Compositionality of Word Expressions Using Word Space Models
Lubomír Krčmář | Karel Ježek | Pavel Pecina
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures
Eduard Bejček | Pavel Straňák | Pavel Pecina
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
Determining Compositionality of Expresssions Using Various Word Space Models and Methods
Lubomír Krčmář | Karel Ježek | Pavel Pecina
Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality

pdf bib
Simpler unsupervised POS tagging with bilingual projections
Long Duong | Paul Cook | Steven Bird | Pavel Pecina
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
Eleftherios Avramidis | Marta R. Costa-jussà | Christian Federmann | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approaches can be designed so that the resulting hybrid translations show an improvement over the individual “component” translations. As a first step towards achieving this objective we have developed a parallel corpus with source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybrid machine translation algorithms and system combination techniques can benefit from additional (linguistically motivated, decoding, and runtime) information provided by the different systems involved. In this paper, we describe the annotated corpus we have created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experiments with the ML4HMT corpus data.

pdf bib
Arabic Word Generation and Modelling for Spell Checking
Khaled Shaalan | Mohammed Attia | Pavel Pecina | Younes Samih | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Arabic is a language known for its rich and complex morphology. Although many research projects have focused on the problem of Arabic morphological analysis using different techniques and approaches, very few have addressed the issue of generation of fully inflected words for the purpose of text authoring. Available open-source spell checking resources for Arabic are too small and inadequate. Ayaspell, for example, the official resource used with OpenOffice applications, contains only 300,000 fully inflected words. We try to bridge this critical gap by creating an adequate, open-source and large-coverage word list for Arabic containing 9,000,000 fully inflected surface words. Furthermore, from a large list of valid forms and invalid forms we create a character-based tri-gram language model to approximate knowledge about permissible character clusters in Arabic, creating a novel method for detecting spelling errors. Testing of this language model gives a precision of 98.2% at a recall of 100%. We take our research a step further by creating a context-independent spelling correction tool using a finite-state automaton that measures the edit distance between input words and candidate corrections, the Noisy Channel Model, and knowledge-based rules. Our system performs significantly better than Hunspell in choosing the best solution, but it is still below the MS Spell Checker.

pdf bib
The ML4HMT Workshop on Optimising the Division of Labour in Hybrid Machine Translation
Christian Federmann | Eleftherios Avramidis | Marta R. Costa-jussà | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the “Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation” (ML4HMT) which aims to foster research on improved system combination approaches for machine translation (MT). Participants of the challenge are requested to build hybrid translations by combining the output of several MT systems of different types. We first describe the ML4HMT corpus used in the shared task, then explain the XLIFF-based annotation format we have designed for it, and briefly summarize the participating systems. Using both automated metrics scores and extensive manual evaluation, we discuss the individual performance of the various systems. An interesting result from the shared task is the fact that we were able to observe different systems winning according to the automated metrics scores when compared to the results from the manual evaluation. We conclude by summarising the first edition of the challenge and by giving an outlook to future work.

pdf bib
Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation
Pavel Pecina | Antonio Toral | Josef van Genabith
Proceedings of COLING 2012

pdf bib
Improved Spelling Error Detection and Correction for Arabic
Mohammed Attia | Pavel Pecina | Younes Samih | Khaled Shaalan | Josef van Genabith
Proceedings of COLING 2012: Posters

pdf bib
Efficiency-based evaluation of aligners for industrial applications
Antonio. Toral | Marc Poch | Pavel Pecina | Gregor Thurmair
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study
Pavel Pecina | Antonio Toral | Vassilis Papavassiliou | Prokopis Prokopidis | Josef van Genabith
Proceedings of the 16th Annual conference of the European Association for Machine Translation

2011

pdf bib
An Open-Source Finite State Morphological Transducer for Modern Standard Arabic
Mohammed Attia | Pavel Pecina | Antonio Toral | Lamia Tounsi | Josef van Genabith
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

pdf bib
Book Reviews: Syntax-Based Collocation Extraction by Violeta Seretan
Pavel Pecina
Computational Linguistics, Volume 37, Issue 3 - September 2011

pdf bib
Towards a User-Friendly Webservice Architecture for Statistical Machine Translation in the PANACEA project
Antonio Toral | Pavel Pecina | Marc Poch | Andy Way
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation
Pavel Pecina | Antonio Toral | Andy Way | Vassilis Papavassiliou | Prokopis Prokopidis | Maria Giagkou
Proceedings of the 15th Annual conference of the European Association for Machine Translation

2010

pdf bib
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
An Augmented Three-Pass System Combination Framework: DCU Combination System for WMT 2010
Jinhua Du | Pavel Pecina | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Automatic Extraction of Arabic Multiword Expressions
Mohammed Attia | Antonio Toral | Lamia Tounsi | Pavel Pecina | Josef van Genabith
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf bib
Handling Named Entities and Compound Verbs in Phrase-Based Statistical Machine Translation
Santanu Pal | Sudip Kumar Naskar | Pavel Pecina | Sivaji Bandyopadhyay | Andy Way
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf bib
Czech Information Retrieval with Syntax-based Language Models
Jana Straková | Pavel Pecina
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In recent years, considerable attention has been dedicated to language modeling methods in information retrieval. Although these approaches generally allow exploitation of any type of language model, most of the published experiments were conducted with a classical n-gram model, usually limited only to unigrams. A few works exploiting syntax in information retrieval can be cited in this context, but significant contribution of syntax based language modeling for information retrieval is yet to be proved. In this paper, we propose, implement, and evaluate an enrichment of language model employing syntactic dependency information acquired automatically from both documents and queries. Our experiments are conducted on Czech which is a morphologically rich language and has a considerably free word order, therefore a syntactic language model is expected to contribute positively to the unigram and bigram language model based on surface word order. By testing our model on the Czech test collection from Cross Language Evaluation Forum 2007 Ad-Hoc track, we show positive contribution of using dependency syntax in this context.

pdf bib
Building a Web Corpus of Czech
Drahomíra „johanka“ Spoustová | Miroslav Spousta | Pavel Pecina
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to keep the whole process language independent and thus applicable also for building web corpora of other languages. In the paper, we briefly describe the crawling, cleaning, and part-of-speech tagging procedures. Using a prototype corpus, we provide a comparison with a current corpora (in particular, SYN2005, part of the Czech National Corpora). We analyse part-of-speech tag distribution, OOV word ratio, average sentence length and Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words. Our results show that our prototype corpus is now quite homogenous. The challenging task is to find a way to decrease the homogeneity of the text while keeping the high quality of the data.

2009

pdf bib
A Simple Automatic MT Evaluation Metric
Petr Homola | Vladislav Kuboň | Pavel Pecina
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf bib
Validating the Quality of Full Morphological Annotation
Drahomíra „johanka“ Spoustová | Pavel Pecina | Jan Hajič | Miroslav Spousta
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In our paper we present a methodology used for low-cost validation of quality of Part-of-Speech annotation of the Prague Dependency Treebank based on multiple re-annotation of data samples carefully selected with the help of several different Part-of-Speech taggers.

2006

pdf bib
Leveraging Recurrent Phrase Structure in Large-scale Ontology Translation
G. Craig Murray | Bonnie J. Dorr | Jimmy Lin | Jan Hajič | Pavel Pecina
Proceedings of the 11th Annual conference of the European Association for Machine Translation

pdf bib
Leveraging Reusability: Cost-Effective Lexical Acquisition for Large-Scale Ontology Translation
G. Craig Murray | Bonnie J. Dorr | Jimmy Lin | Jan Hajič | Pavel Pecina
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Combining Association Measures for Collocation Extraction
Pavel Pecina | Pavel Schlesinger
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Semi-automatic Building of Swedish Collocation Lexicon
Silvie Cinková | Pavel Pecina | Petr Podveský | Pavel Schlesinger
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This work focuses on semi-automatic extraction of verb-noun collocations from a corpus, performed to provide lexical evidence for the manual lexicographical processing of Support Verb Constructions (SVCs) in the Swedish-Czech Combinatorial Valency Lexicon of Predicate Nouns. Efficiency of pure manual extractionprocedure is significantly improved by utilization of automatic statistical methods based lexical association measures.

2005

pdf bib
An Extensive Empirical Study of Collocation Extraction Methods
Pavel Pecina
Proceedings of the ACL Student Research Workshop

Search
Co-authors
Venues