Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT

Jesús González-Rubio; Jorge Civera; Alfons Juan; Francisco Casacuberta

Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT

Jesús González-Rubio, Jorge Civera, Alfons Juan, Francisco Casacuberta

Abstract

Currently, a great effort is being carried out in the digitalisation of large historical document collections for preservation purposes. The documents in these collections are usually written in ancient languages, such as Latin or Greek, which limits the access of the general public to their content due to the language barrier. Therefore, digital libraries aim not only at storing raw images of digitalised documents, but also to annotate them with their corresponding text transcriptions and translations into modern languages. Unfortunately, ancient languages have at their disposal scarce electronic resources to be exploited by natural language processing techniques. This paper describes the compilation process of a novel Latin-Catalan parallel corpus as a new task for statistical machine translation (SMT). Preliminary experimental results are also reported using a state-of-the-art phrase-based SMT system. The results presented in this work reveal the complexity of the task and its challenging, but interesting nature for future development.

Anthology ID:: L10-1373
Volume:: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:: May
Year:: 2010
Address:: Valletta, Malta
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/541_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Jesús González-Rubio, Jorge Civera, Alfons Juan, and Francisco Casacuberta. 2010. Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):: Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT (González-Rubio et al., LREC 2010)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/541_Paper.pdf

PDF Cite Search Fix data