Alfons Juan

Also published as: Alfons Juan-Císcar


pdf bib
Direct Segmentation Models for Streaming Speech Translation
Javier Iranzo-Sánchez | Adrià Giménez Pastor | Joan Albert Silvestre-Cerdà | Pau Baquero-Arnal | Jorge Civera Saiz | Alfons Juan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into hopefully, semantically self-contained chunks to be fed into the MT system. This is specially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and throughly experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.


pdf bib
The MLLP-UPV Supervised Machine Translation Systems for WMT19 News Translation Task
Javier Iranzo-Sánchez | Gonçal Garcés Díaz-Munío | Jorge Civera | Alfons Juan
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 News Translation Shared Task. In this edition, we have submitted systems for the German ↔ English and German ↔ French language pairs, participating in both directions of each pair. Our submitted systems, based on the Transformer architecture, make ample use of data filtering, synthetic data and domain adaptation through fine-tuning.

pdf bib
The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task
Pau Baquero-Arnal | Javier Iranzo-Sánchez | Jorge Civera | Alfons Juan
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 Similar Language Translation Shared Task. We have submitted systems for the Portuguese ↔ Spanish language pair, in both directions. We have submitted systems based on the Transformer architecture as well as an in development novel architecture which we have called 2D alternating RNN. We have carried out domain adaptation through fine-tuning.


pdf bib
The MLLP-UPV German-English Machine Translation System for WMT18
Javier Iranzo-Sánchez | Pau Baquero-Arnal | Gonçal V. Garcés Díaz-Munío | Adrià Martínez-Villaronga | Jorge Civera | Alfons Juan
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the statistical machine translation system built by the MLLP research group of Universitat Politècnica de València for the German→English news translation shared task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We used an ensemble of Transformer architecture–based neural machine translation systems. To train our system under “constrained” conditions, we filtered the provided parallel data with a scoring technique using character-based language models, and we added parallel data based on synthetic source sentences generated from the provided monolingual corpora.


pdf bib
The MLLP ASR systems for IWSLT 2015
Miguel Ángel Del Agua Teba | Adrià Agusti Martinez Villaronga | Santiago Piqueras Gozalbes | Adrià Giménez Pastor | José Alberto Sanchis Navarro | Jorge Civera Saiz | Alfons Juan-Císcar
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign


pdf bib
Comparison of data selection techniques for the translation of video lectures
Joern Wuebker | Hermann Ney | Adrià Martínez-Villaronga | Adrià Giménez | Alfons Juan | Christophe Servan | Marc Dymetman | Shachar Mirkin
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.


pdf bib
Minimum Bayes-risk System Combination
Jesús González-Rubio | Alfons Juan | Francisco Casacuberta
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies


pdf bib
The RODRIGO Database
Nicolas Serrano | Francisco Castro | Alfons Juan
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Annotation of digitized pages from historical document collections is very important to research on automatic extraction of text blocks, lines, and handwriting recognition. We have recently introduced a new handwritten text database, GERMANA, which is based on a Spanish manuscript from 1891. To our knowledge, GERMANA is the first publicly available database mostly written in Spanish and comparable in size to standard databases. In this paper, we present another handwritten text database, RODRIGO, completely written in Spanish and comparable in size to GERMANA. However, RODRIGO comes from a much older manuscript, from 1545, where the typical difficult characteristics of historical documents are more evident. In particular, the writing style, which has clear Gothic influences, is significantly more complex than that of GERMANA. We also provide baseline results of handwriting recognition for reference in future studies, using standard techniques and tools for preprocessing, feature extraction, HMM-based image modelling, and language modelling.

pdf bib
Saturnalia: A Latin-Catalan Parallel Corpus for Statistical MT
Jesús González-Rubio | Jorge Civera | Alfons Juan | Francisco Casacuberta
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Currently, a great effort is being carried out in the digitalisation of large historical document collections for preservation purposes. The documents in these collections are usually written in ancient languages, such as Latin or Greek, which limits the access of the general public to their content due to the language barrier. Therefore, digital libraries aim not only at storing raw images of digitalised documents, but also to annotate them with their corresponding text transcriptions and translations into modern languages. Unfortunately, ancient languages have at their disposal scarce electronic resources to be exploited by natural language processing techniques. This paper describes the compilation process of a novel Latin-Catalan parallel corpus as a new task for statistical machine translation (SMT). Preliminary experimental results are also reported using a state-of-the-art phrase-based SMT system. The results presented in this work reveal the complexity of the task and its challenging, but interesting nature for future development.


pdf bib
A Phrase-Based Hidden Semi-Markov Approach to Machine Translation
Jesús Andrés-Ferrer | Alfons Juan
Proceedings of the 13th Annual conference of the European Association for Machine Translation


pdf bib
Bilingual Text Classification using the IBM 1 Translation Model
Jorge Civera | Alfons Juan-Císcar
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Manual categorisation of documents is a time-consuming task that has been significantly alleviated with the deployment of automatic and machine-aided text categorisation systems. However, the proliferation of multilingual documentation has become a common phenomenon in many international organisations, while most of the current systems have focused on the categorisation of monolingual text. It has been recently shown that the inherent redundancy in bilingual documents can be effectively exploited by relatively simple, bilingual naive Bayes (multinomial) models. In this work, we present a refined version of these models in which this redundancy is explicitly captured by a combination of a unigram (multinomial) model and the well-known IBM 1 translation model. The proposed model is evaluated on two bilingual classification tasks and compared to previous work.

pdf bib
A novel alignment model inspired on IBM Model 1
Jesús González-Rubio | Germán Sanchis-Trilles | Alfons Juan | Francisco Casacuberta
Proceedings of the 12th Annual conference of the European Association for Machine Translation


pdf bib
Domain Adaptation in Statistical Machine Translation with Mixture Modelling
Jorge Civera | Alfons Juan
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Estimation of confidence measures for machine translation
Alberto Sanchis | Alfons Juan | Enrique Vidal
Proceedings of Machine Translation Summit XI: Papers


pdf bib
Mixtures of IBM Model 2
Jorge Civera | Alfons Juan
Proceedings of the 11th Annual conference of the European Association for Machine Translation

pdf bib
Bilingual Machine-Aided Indexing
Jorge Civera | Alfons Juan
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The proliferation of multilingual documentation in our Information Society has become a common phenomenon. This documentation is usually categorised by hand, entailing a time-consuming and arduous burden. This is particularly true in the case of keyword assignment, in which a list of keywords (descriptors) from a controlled vocabulary (thesaurus) is assigned to a document. A possible solution to alleviate this problem comes from the hand of the so-called Machine-Aided Indexing (MAI) systems. These systems work in cooperation with professional indexer by providing a initial list of descriptors from which those most appropiated will be selected. This way of proceeding increases the productivity and eases the task of indexers. In this paper, we propose a statistical text classification framework for bilingual documentation, from which we derive two novel bilingual classifiers based on the naive combination of monolingual classifiers. We report preliminary results on the multilingual corpus Acquis Communautaire (AC) that demonstrates the suitability of the proposed classifiers as the backend of a fully-working MAI system.


pdf bib
Adapting finite-state translation to the TransType2 project
Elsa Cubel | Jorge González | Antonio Lagarda | Francisco Casacuberta | Alfons Juan | Enrique Vidal
EAMT Workshop: Improving MT through other language technology tools: resources and tools for building MT