Holger Schwenk


2021

pdf bib
FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task
Yun Tang | Hongyu Gong | Xian Li | Changhan Wang | Juan Pino | Holger Schwenk | Naman Goyal
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.

pdf bib
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Holger Schwenk | Vishrav Chaudhary | Shuo Sun | Hongyu Gong | Francisco Guzmán
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but we systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 16720 different language pairs, out of which only 34M are aligned with English. This corpus is freely available. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.

pdf bib
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
Holger Schwenk | Guillaume Wenzek | Sergey Edunov | Edouard Grave | Armand Joulin | Angela Fan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 snapshots of a curated common crawl corpus (Wenzel et al, 2019) totaling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT’19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems outperform the best single ones by over 4 BLEU points and are on par with best WMT’19 systems, which train on the WMT training data and augment it with backtranslation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2020 WAT workshop. All of the mined bitext will be freely available.

2020

pdf bib
MLQA: Evaluating Cross-lingual Extractive Question Answering
Patrick Lewis | Barlas Oguz | Ruty Rinott | Sebastian Riedel | Holger Schwenk
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Question answering (QA) models have shown rapid progress enabled by the availability of large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to collect, and rarely exist in languages other than English, making building QA systems that work well in other languages challenging. In order to develop such systems, it is crucial to invest in high quality multilingual evaluation benchmarks to measure progress. We present MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area. MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA has over 12K instances in English and 5K in each other language, with each instance parallel between 4 languages on average. We evaluate state-of-the-art cross-lingual models and machine-translation-based baselines on MLQA. In all cases, transfer results are shown to be significantly behind training-language performance.

2019

pdf bib
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
Vishrav Chaudhary | Yuqing Tang | Francisco Guzmán | Holger Schwenk | Philipp Koehn
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

pdf bib
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
Mikel Artetxe | Holger Schwenk
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC mining task and the UN reconstruction task by more than 10 F1 and 30 precision points, respectively. Filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.

pdf bib
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Mikel Artetxe | Holger Schwenk
Transactions of the Association for Computational Linguistics, Volume 7

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI data set), cross-lingual document classification (MLDoc data set), and parallel corpus mining (BUCC data set) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low- resource languages. Our implementation, the pre-trained encoder, and the multilingual test set are available at https://github.com/facebookresearch/LASER.

2018

pdf bib
A Corpus for Multilingual Document Classification in Eight Languages
Holger Schwenk | Xian Li
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Filtering and Mining Parallel Data in a Joint Multilingual Space
Holger Schwenk
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We learn a joint multilingual sentence embedding and use the distance between sentences in different languages to filter noisy parallel data and to mine for parallel data in large news collections. We are able to improve a competitive baseline on the WMT’14 English to German task by 0.3 BLEU by filtering out 25% of the training data. The same approach is used to mine additional bitexts for the WMT’14 system and to obtain competitive results on the BUCC shared task to identify parallel sentences in comparable corpora. The approach is generic, it can be applied to many language pairs and it is independent of the architecture of the machine translation system.

pdf bib
XNLI: Evaluating Cross-lingual Sentence Representations
Alexis Conneau | Ruty Rinott | Guillaume Lample | Adina Williams | Samuel Bowman | Holger Schwenk | Veselin Stoyanov
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 14 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.

2017

pdf bib
Learning Joint Multilingual Sentence Representations with Neural Machine Translation
Holger Schwenk | Matthijs Douze
Proceedings of the 2nd Workshop on Representation Learning for NLP

In this paper, we use the framework of neural machine translation to learn joint sentence representations across six very different languages. Our aim is that a representation which is independent of the language, is likely to capture the underlying semantics. We define a new cross-lingual similarity measure, compare up to 1.4M sentence representations and study the characteristics of close sentences. We provide experimental evidence that sentences that are close in embedding space are indeed semantically highly related, but often have quite different structure and syntax. These relations also hold when comparing sentences in different languages.

pdf bib
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau | Douwe Kiela | Holger Schwenk | Loïc Barrault | Antoine Bordes
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.

pdf bib
Very Deep Convolutional Networks for Text Classification
Alexis Conneau | Holger Schwenk | Loïc Barrault | Yann Lecun
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which have pushed the state-of-the-art in computer vision. We present a new architecture (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations. We are able to show that the performance of this model increases with the depth: using up to 29 convolutional layers, we report improvements over the state-of-the-art on several public text classification tasks. To the best of our knowledge, this is the first time that very deep convolutional nets have been applied to text processing.

2015

pdf bib
Incremental Adaptation Strategies for Neural Network Language Models
Alex Ter-Sarkisov | Holger Schwenk | Fethi Bougares | Loïc Barrault
Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality

pdf bib
Improving continuous space language models auxiliary features
Walid Aransa | Holger Schwenk | Loïc Barrault
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

pdf bib
Continuous Adaptation to User Feedback for Statistical Machine Translation
Frédéric Blain | Fethi Bougares | Amir Hazem | Loïc Barrault | Holger Schwenk
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
The MateCat Tool
Marcello Federico | Nicola Bertoldi | Mauro Cettolo | Matteo Negri | Marco Turchi | Marco Trombetti | Alessandro Cattelan | Antonio Farina | Domenico Lupinetti | Andrea Martines | Alberto Massidda | Holger Schwenk | Loïc Barrault | Frederic Blain | Philipp Koehn | Christian Buck | Ulrich Germann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

pdf bib
LIUM English-to-French spoken language translation system and the Vecsys/LIUM automatic speech recognition system for Italian language for IWSLT 2014
Anthony Rousseau | Loïc Barrault | Paul Deléglise | Yannick Estève | Holger Schwenk | Samir Bennacef | Armando Muscariello | Stephan Vanni
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the Spoken Language Translation system developed by the LIUM for the IWSLT 2014 evaluation campaign. We participated in two of the proposed tasks: (i) the Automatic Speech Recognition task (ASR) in two languages, Italian with the Vecsys company, and English alone, (ii) the English to French Spoken Language Translation task (SLT). We present the approaches and specificities found in our systems, as well as the results from the evaluation campaign.

pdf bib
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Kyunghyun Cho | Bart van Merriënboer | Caglar Gulcehre | Dzmitry Bahdanau | Fethi Bougares | Holger Schwenk | Yoshua Bengio
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction
Haithem Afli | Loïc Barrault | Holger Schwenk
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
A Multi-Domain Translation Model Framework for Statistical Machine Translation
Rico Sennrich | Holger Schwenk | Walid Aransa
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Issues in incremental adaptation of statistical MT from human post-edits
Mauro Cettolo | Christophe Servan | Nicola Bertoldi | Marcello Federico | Loïc Barrault | Holger Schwenk
Proceedings of the 2nd Workshop on Post-editing Technology and Practice

2012

pdf bib
Semi-supervised transliteration mining from parallel and comparable corpora
Walid Aransa | Holger Schwenk | Loic Barrault
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers

Transliteration is the process of writing a word (mainly proper noun) from one language in the alphabet of another language. This process requires mapping the pronunciation of the word from the source language to the closest possible pronunciation in the target language. In this paper we introduce a new semi-supervised transliteration mining method for parallel and comparable corpora. The method is mainly based on a new suggested Three Levels of Similarity (TLS) scores to extract the transliteration pairs. The first level calculates the similarity of of all vowel letters and consonants letters. The second level calculates the similarity of long vowels and vowel letters at beginning and end position of the words and consonants letters. The third level calculates the similarity consonants letters only. We applied our method on Arabic-English parallel and comparable corpora. We evaluated the extracted transliteration pairs using a statistical based transliteration system. This system is built using letters instead or words as tokens. The transliteration system achieves an accuracy of 0.50 and a mean F-score 0.8958 when trained on transliteration pairs extracted from a parallel corpus. The accuracy is 0.30 and the mean F-score 0.84 when we used instead a comparable corpus to automatically extract the transliteration pairs. This shows that the proposed semi-supervised transliteration mining algorithm is effective and can be applied to other language pairs. We also evaluated two segmentation techniques and reported the impact on the transliteration performance.

pdf bib
Incremental adaptation using translation information and post-editing analysis
Frédéric Blain | Holger Schwenk | Jean Senellart
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers

It is well known that statistical machine translation systems perform best when they are adapted to the task. In this paper we propose new methods to quickly perform incremental adaptation without the need to obtain word-by-word alignments from GIZA or similar tools. The main idea is to use an automatic translation as pivot to infer alignments between the source sentence and the reference translation, or user correction. We compared our approach to the standard method to perform incremental re-training. We achieve similar results in the BLEU score using less computational resources. Fast retraining is particularly interesting when we want to almost instantly integrate user feed-back, for instance in a post-editing context or machine translation assisted CAT tool. We also explore several methods to combine the translation models.

pdf bib
Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation
Holger Schwenk | Anthony Rousseau | Mohammed Attik
Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT

pdf bib
LIUM’s SMT Machine Translation Systems for WMT 2012
Christophe Servan | Patrik Lambert | Anthony Rousseau | Holger Schwenk | Loïc Barrault
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Automatic Translation of Scientific Documents in the HAL Archive
Patrik Lambert | Holger Schwenk | Frédéric Blain
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the development of a statistical machine translation system between French and English for scientific papers. This system will be closely integrated into the French HAL open archive, a collection of more than 100.000 scientific papers. We describe the creation of in-domain parallel and monolingual corpora, the development of a domain specific translation system with the created resources, and its adaptation using monolingual resources only. These techniques allowed us to improve a generic system by more than 10 BLEU points.

pdf bib
Continuous Space Translation Models for Phrase-Based Statistical Machine Translation
Holger Schwenk
Proceedings of COLING 2012: Posters

pdf bib
A General Framework to Weight Heterogeneous Parallel Data for Model Adaptation in Statistical MT
Kashif Shah | Loïc Barrault | Holger Schwenk
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

The standard procedure to train the translation model of a phrase-based SMT system is to concatenate all available parallel data, to perform word alignment, to extract phrase pairs and to calculate translation probabilities by simple relative frequency. However, parallel data is quite inhomogeneous in many practical applications with respect to several factors like data source, alignment quality, appropriateness to the task, etc. We propose a general framework to take into account these factors during the calculation of the phrase-table, e.g. by better distributing the probability mass of the individual phrase pairs. No additional feature functions are needed. We report results on two well-known tasks: the IWSLT’11 and WMT’11 evaluations, in both conditions translating from English to French. We give detailed results for different functions to weight the bitexts. Our best systems improve a strong baseline by up to one BLEU point without any impact on the computational complexity during training or decoding.

pdf bib
Collaborative Machine Translation Service for Scientific texts
Patrik Lambert | Jean Senellart | Laurent Romary | Holger Schwenk | Florian Zipser | Patrice Lopez | Frédéric Blain
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Traduction automatique à partir de corpus comparables: extraction de phrases parallèles à partir de données comparables multimodales (Automatic Translation from Comparable corpora : extracting parallel sentences from multimodal comparable corpora) [in French]
Haithem Afli | Loïc Barrault | Holger Schwenk
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

2011

pdf bib
Investigations on Translation Model Adaptation Using Monolingual Data
Patrik Lambert | Holger Schwenk | Christophe Servan | Sadaf Abdul-Rauf
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
LIUM’s SMT Machine Translation Systems for WMT 2011
Holger Schwenk | Patrik Lambert | Loïc Barrault | Christophe Servan | Sadaf Abdul-Rauf | Haithem Afli | Kashif Shah
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Parametric Weighting of Parallel Data for Statistical Machine Translation
Kashif Shah | Loïc Barrault | Holger Schwenk
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Qualitative Analysis of Post-Editing for High Quality Machine Translation
Frédéric Blain | Jean Senellart | Holger Schwenk | Mirko Plitt | Johann Roturier
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
LIUM’s systems for the IWSLT 2011 speech translation tasks
Anthony Rousseau | Fethi Bougares | Paul Deléglise | Holger Schwenk | Yannick Estève
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the three systems developed by the LIUM for the IWSLT 2011 evaluation campaign. We participated in three of the proposed tasks, namely the Automatic Speech Recognition task (ASR), the ASR system combination task (ASR_SC) and the Spoken Language Translation task (SLT), since these tasks are all related to speech translation. We present the approaches and specificities we developed on each task.

2010

pdf bib
LIUM SMT Machine Translation System for WMT 2010
Patrik Lambert | Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Translation Model Adaptation by Resampling
Kashif Shah | Loïc Barrault | Holger Schwenk
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
N-gram-based machine translation enhanced with neural networks
Francisco Zamora-Martinez | Maria Jose Castro-Bleda | Holger Schwenk
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Adaptation d’un Système de Traduction Automatique Statistique avec des Ressources monolingues
Holger Schwenk
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les performances d’un système de traduction statistique dépendent beaucoup de la qualité et de la quantité des données d’apprentissage disponibles. La plupart des textes parallèles librement disponibles proviennent d’organisations internationales. Le jargon observé dans ces textes n’est pas très adapté pour construire un système de traduction pour d’autres domaines. Nous présentons dans cet article une technique pour adapter le modèle de traduction à un domaine différent en utilisant des textes dans la langue source uniquement. Nous obtenons des améliorations significatives du score BLEU dans des systèmes de traduction de l’arabe vers le français et vers l’anglais.

2009

pdf bib
Translation Model Adaptation for an Arabic/French News Translation System by Lightly- Supervised Training
Holger Schwenk | Jean Senellart
Proceedings of Machine Translation Summit XII: Posters

pdf bib
On the Use of Comparable Corpora to Improve SMT performance
Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
LIUM’s statistical machine translation system for IWSLT 2009
Holger Schwenk | Loïc Barrault | Yannick Estève | Patrik Lambert
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the systems developed by the LIUM laboratory for the 2009 IWSLT evaluation. We participated in the Arabic and Chinese to English BTEC tasks. We developed three different systems: a statistical phrase-based system using the Moses toolkit, an Statistical Post-Editing system and a hierarchical phrase-based system based on Joshua. A continuous space language model was deployed to improve the modeling of the target language. These systems are combined by a confusion network based approach.

pdf bib
SMT and SPE Machine Translation Systems for WMT‘09
Holger Schwenk | Sadaf Abdul-Rauf | Loïc Barrault | Jean Senellart
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Exploiting Comparable Corpora with TER and TERp
Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2008

pdf bib
First Steps towards a General Purpose French/English Statistical Machine Translation System
Holger Schwenk | Jean-Baptiste Fouet | Jean Senellart
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Large and Diverse Language Models for Statistical Machine Translation
Holger Schwenk | Philipp Koehn
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
The LIUM Arabic/English statistical machine translation system for IWSLT 2008.
Holger Schwenk | Yannick Estève | Sadaf Abdul Rauf
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the system developed by the LIUM laboratory for the 2008 IWSLT evaluation. We only participated in the Arabic/English BTEC task. We developed a statistical phrase-based system using the Moses toolkit and SYSTRAN’s rule-based translation system to perform a morphological decomposition of the Arabic words. A continuous space language model was deployed to improve the modeling of the target language. Both approaches achieved significant improvements in the BLEU score. The system achieves a score of 49.4 on the test set of the 2008 IWSLT evaluation.

pdf bib
Investigations on large-scale lightly-supervised training for statistical machine translation.
Holger Schwenk
Proceedings of the 5th International Workshop on Spoken Language Translation: Papers

Sentence-aligned bilingual texts are a crucial resource to build statistical machine translation (SMT) systems. In this paper we propose to apply lightly-supervised training to produce additional parallel data. The idea is to translate large amounts of monolingual data (up to 275M words) with an SMT system, and to use those as additional training data. Results are reported for the translation from French into English. We consider two setups: first the intial SMT system is only trained with a very limited amount of human-produced translations, and then the case where we have more than 100 million words. In both conditions, lightly-supervised training achieves significant improvements of the BLEU score.

2007

pdf bib
Modèles statistiques enrichis par la syntaxe pour la traduction automatique
Holger Schwenk | Daniel Déchelotte | Hélène Bonneau-Maynard | Alexandre Allauzen
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

La traduction automatique statistique par séquences de mots est une voie prometteuse. Nous présentons dans cet article deux évolutions complémentaires. La première permet une modélisation de la langue cible dans un espace continu. La seconde intègre des catégories morpho-syntaxiques aux unités manipulées par le modèle de traduction. Ces deux approches sont évaluées sur la tâche Tc-Star. Les résultats les plus intéressants sont obtenus par la combinaison de ces deux méthodes.

pdf bib
Combining Morphosyntactic Enriched Representation with n-best Reranking in Statistical Translation
Hélène Bonneau-Maynard | Alexandre Allauzen | Daniel Déchelotte | Holger Schwenk
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf bib
Building a Statistical Machine Translation System for French Using the Europarl Corpus
Holger Schwenk
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Smooth Bilingual N-Gram Translation
Holger Schwenk | Marta R. Costa-jussà | Jose A. R. Fonollosa
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
A state-of-the-art statistical machine translation system based on Moses
Daniel Déchelotte | Holger Schwenk | Hélène Bonneau-Maynard | Alexandre Allauzen | Gilles Adda
Proceedings of Machine Translation Summit XI: Papers

pdf bib
The TALP ngram-based SMT system for IWSLT 2007
Patrik Lambert | Marta R. Costa-jussà | Josep M. Crego | Maxim Khalilov | José B. Mariño | Rafael E. Banchs | José A. R. Fonollosa | Holger Schwenk
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper describes TALPtuples, the 2007 N-gram-based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Polite`cnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years. Mainly, these include optimizing alignment parameters in function of translation metric scores and rescoring with a neural network language model. Results on two translation directions are reported, namely from Arabic and Chinese into English, thoroughly explaining all language-related preprocessing and translation schemes.

2006

pdf bib
Continuous Space Language Models for Statistical Machine Translation
Holger Schwenk | Daniel Dechelotte | Jean-Luc Gauvain
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Continuous space language models for the IWSLT 2006 task
Holger Schwenk | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the Third International Workshop on Spoken Language Translation: Papers

2005

pdf bib
Training Neural Network Language Models on Very Large Corpora
Holger Schwenk | Jean-Luc Gauvain
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Search