Sergio Penkale


2014

pdf bib
Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation
Rejwanul Haque | Sergio Penkale | Andy Way
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)

2013

pdf bib
Tailor-made quality-controlled translation
Sergio Penkale
Proceedings of Translating and the Computer 35

2012

pdf bib
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles
Volha Petukhova | Rodrigo Agerri | Mark Fishel | Sergio Penkale | Arantza del Pozo | Mirjam Sepesy Maučec | Andy Way | Panayota Georgakopoulou | Martin Volk
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process. The FP7 European project SUMAT (An Online Service for SUbtitling by MAchine Translation: http://www.sumat-project.eu) aims to develop an online subtitle translation service for nine European languages, combined into 14 different language pairs, in order to semi-automate the subtitle translation processes of both freelance translators and subtitling companies on a large scale. In this paper we discuss the data collection and parallel corpus compilation for training SMT systems, which includes several procedures such as data partition, conversion, formatting, normalization and alignment. We discuss in detail each data pre-processing step using various approaches. Apart from the quantity (around 1 million subtitles per language pair), the SUMAT corpus has a number of very important characteristics. First of all, high quality both in terms of translation and in terms of high-precision alignment of parallel documents and their contents has been achieved. Secondly, the contents are provided in one consistent format and encoding. Finally, additional information such as type of content in terms of genres and domain is available.

pdf bib
SmartMATE: An Online End-To-End MT Post-Editing Framework
Sergio Penkale | Andy Way
Workshop on Post-Editing Technology and Practice

It is a well-known fact that the amount of content which is available to be translated and localized far outnumbers the current amount of translation resources. Automation in general and Machine Translation (MT) in particular are one of the key technologies which can help improve this situation. However, a tool that integrates all of the components needed for the localization process is still missing, and MT is still out of reach for most localisation professionals. In this paper we present an online translation environment which empowers users with MT by enabling engines to be created from their data, without a need for technical knowledge or special hardware requirements and at low cost. Documents in a variety of formats can then be post-edited after being processed with their Translation Memories, MT engines and glossaries. We give an overview of the tool and present a case study of a project for a large games company, showing the applicability of our tool.

pdf bib
From Subtitles to Parallel Corpora
Mark Fishel | Yota Georgakopoulou | Sergio Penkale | Volha Petukhova | Matej Rojc | Martin Volk | Andy Way
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

2010

pdf bib
Accuracy-Based Scoring for Phrase-Based Statistical Machine Translation
Sergio Penkale | Yanjun May | Daniel Galron | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Although the scoring features of state-of-the-art Phrase-Based Statistical Machine Translation (PB-SMT) models are weighted so as to optimise an objective function measuring translation quality, the estimation of the features themselves does not have any relation to such quality metrics. In this paper, we introduce a translation quality-based feature to PB-SMT in a bid to improve the translation quality of the system. Our feature is estimated by averaging the edit-distance between phrase pairs involved in the translation of oracle sentences, chosen by automatic evaluation metrics from the N-best outputs of a baseline system, and phrase pairs occurring in the N-best list. Using our method, we report a statistically significant 2.11% relative improvement in BLEU score for the WMT 2009 Spanish-to-English translation task. We also report that using our method we can achieve statistically significant improvements over the baseline using many other MT evaluation metrics, and a substantial increase in speed and reduction in memory use (due to a reduction in phrase-table size of 87%) while maintaining significant gains in translation quality.

pdf bib
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
Accuracy-Based Scoring for DOT: Towards Direct Error Minimization for Data-Oriented Translation
Daniel Galron | Sergio Penkale | Andy Way | I. Dan Melamed
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
MATREX: The DCU MT System for WMT 2009
Jinhua Du | Yifan He | Sergio Penkale | Andy Way
Proceedings of the Fourth Workshop on Statistical Machine Translation