Laurent Besacier - ACL Anthology

Laurent Besacier

Also published as: L. Besacier

2025

ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Thibaut Thonet | Laurent Besacier | Jos Rozen
Proceedings of the 31st International Conference on Computational Linguistics

Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending the models’ context size to better capture dependencies within long documents. While benchmarks have been proposed to assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, we propose a new benchmark for long-context LLMs focused on a practical meeting assistant scenario in which the long contexts consist of transcripts obtained by automatic speech recognition, presenting unique challenges for LLMs due to the inherent noisiness and oral nature of such data. Our benchmark, ELITR-Bench, augments the existing ELITR corpus by adding 271 manually crafted questions with their ground-truth answers, as well as noisy versions of meeting transcripts altered to target different Word Error Rate levels. Our experiments with 12 long-context LLMs on ELITR-Bench confirm the progress made across successive generations of both proprietary and open models, and point out their discrepancies in terms of robustness to transcript noise. We also provide a thorough analysis of our GPT-4-based evaluation, including insights from a crowdsourcing study. Our findings indicate that while GPT-4’s scores align with human judges, its ability to distinguish beyond three score levels may be limited.

NAVER LABS Europe Submission to the Instruction-following Track
Beomseok Lee | Marcely Zanon Boito | Laurent Besacier | Ioan Calapodescu
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

In this paper we describe NAVER LABS Europe submission to the instruction-following speech processing short track at IWSLT 2025. We participate in the constrained settings, developing systems that can simultaneously perform ASR, ST, and SQA tasks from English speech input into the following target languages: Chinese, Italian, and German. Our solution leverages two pretrained modules: (1) a speech-to-LLM embedding projector trained using representations from the SeamlessM4T-v2-large speech encoder; and (2) LoRA adapters trained on text data on top of Llama-3.1-8B-Instruct. These modules are jointly loaded and further instruction-tuned for 1K steps on multilingual and multimodal data to form our final system submitted for evaluation.

Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
Beomseok Lee | Marco Gaido | Ioan Calapodescu | Laurent Besacier | Matteo Negri
Proceedings of the 31st International Conference on Computational Linguistics

While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.

2023

Encoding Sentence Position in Context-Aware Neural Machine Translation with Concatenation
Lorenzo Lupo | Marco Dinarelli | Laurent Besacier
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

Context-aware translation can be achieved by processing a concatenation of consecutive sentences with the standard Transformer architecture. This paper investigates the intuitive idea of providing the model with explicit information about the position of the sentences contained in the concatenation window. We compare various methods to encode sentence positions into token representations, including novel methods. Our results show that the Transformer benefits from certain sentence position encoding methods on English to Russian translation, if trained with a context-discounted loss. However, the same benefits are not observed on English to German. Further empirical efforts are necessary to define the conditions under which the proposed approach is beneficial.

2022

Learning From Failure: Data Capture in an Australian Aboriginal Community
Eric Le Ferrand | Steven Bird | Laurent Besacier
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most low resource language technology development is premised on the need to collect data for training statistical models. When we follow the typical process of recording and transcribing text for small Indigenous languages, we hit up against the so-called “transcription bottleneck.” Therefore it is worth exploring new ways of engaging with speakers which generate data while avoiding the transcription bottleneck. We have deployed a prototype app for speakers to use for confirming system guesses in an approach to transcription based on word spotting. However, in the process of testing the app we encountered many new problems for engagement with speakers. This paper presents a close-up study of the process of deploying data capture technology on the ground in an Australian Aboriginal community. We reflect on our interactions with participants and draw lessons that apply to anyone seeking to develop methods for language data collection in an Indigenous community.

What Do Compressed Multilingual Machine Translation Models Forget?
Alireza Mohammadshahi | Vassilina Nikoulina | Alexandre Berard | Caroline Brun | James Henderson | Laurent Besacier
Findings of the Association for Computational Linguistics: EMNLP 2022

Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP) tasks, but their size makes it more challenging to apply them in resource-constrained environments. Compression techniques allow to drastically reduce the size of the models and therefore their inference time with negligible impact on top-tier metrics. However, the general performance averaged across multiple tasks and/or languages may hide a drastic performance drop on under-represented features, which could result in the amplification of biases encoded by the models. In this work, we assess the impact of compression methods on Multilingual Neural Machine Translation models (MNMT) for various language groups, gender, and semantic biases by extensive analysis of compressed models on different machine translation benchmarks, i.e. FLORES-101, MT-Gender, and DiBiMT. We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases. Interestingly, the removal of noisy memorization with compression leads to a significant improvement for some medium-resource languages. Finally, we demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.

Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings
Marcely Zanon Boito | Bolaji Yusuf | Lucas Ondel | Aline Villavicencio | Laurent Besacier
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

Documenting languages helps to prevent the extinction of endangered dialects - many of which are otherwise expected to disappear by the end of the century. When documenting oral languages, unsupervised word segmentation (UWS) from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding to words, being performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units that can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. For the UWS task, we experiment with two models, using as our target language the Mboshi (Bantu C25), an unwritten language from Congo-Brazzaville. Additionally, we report results for Finnish, Hungarian, Romanian and Russian in equally low-resource settings, using only 4 hours of speech. Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using Bayesian models that produce high quality, yet compressed, discrete representations of the input speech signal.

SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages
Alireza Mohammadshahi | Vassilina Nikoulina | Alexandre Berard | Caroline Brun | James Henderson | Laurent Besacier
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In recent years, multilingual machine translation models have achieved promising performance on low-resource language pairs by sharing information between similar languages, thus enabling zero-shot translation. To overcome the “curse of multilinguality”, these models often opt for scaling up the number of parameters, which makes their use in resource-constrained environments challenging. We introduce SMaLL-100, a distilled version of the M2M-100(12B) model, a massively multilingual machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. We evaluate SMaLL-100 on different low-resource benchmarks: FLORES-101, Tatoeba, and TICO-19 and demonstrate that it outperforms previous massively multilingual models of comparable sizes (200-600M) while improving inference latency and memory usage. Additionally, our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.

Fashioning Local Designs from Generic Speech Technologies in an Australian Aboriginal Community
Éric Le Ferrand | Steven Bird | Laurent Besacier
Proceedings of the 29th International Conference on Computational Linguistics

An increasing number of papers have been addressing issues related to low-resource languages and the transcription bottleneck paradigm. After several years spent in Northern Australia, where some of the strongest Aboriginal languages are spoken, we could observe a gap between the motivations depicted in research contributions in this space and the Northern Australian context. In this paper, we address this gap in research by exploring the potential of speech recognition in an Aboriginal community. We describe our work from training a spoken term detection system to its implementation in an activity with Aboriginal participants. We report here on one side how speech recognition technologies can find their place in an Aboriginal context and, on the other, methodological paths that allowed us to reach better comprehension and engagement from Aboriginal participants.

Divide and Rule: Effective Pre-Training for Context-Aware Multi-Encoder Translation Models
Lorenzo Lupo | Marco Dinarelli | Laurent Besacier
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-encoder models are a broad family of context-aware neural machine translation systems that aim to improve translation quality by encoding document-level contextual information alongside the current sentence. The context encoding is undertaken by contextual parameters, trained on document-level data. In this work, we discuss the difficulty of training these parameters effectively, due to the sparsity of the words in need of context (i.e., the training signal), and their relevant context. We propose to pre-train the contextual parameters over split sentence pairs, which makes an efficient use of the available data for two reasons. Firstly, it increases the contextual training signal by breaking intra-sentential syntactic relations, and thus pushing the model to search the context for disambiguating clues more frequently. Secondly, it eases the retrieval of relevant context, since context segments become shorter. We propose four different splitting methods, and evaluate our approach with BLEU and contrastive test sets. Results show that it consistently improves learning of contextual parameters, both in low and high resource settings.

Weakly Supervised Word Segmentation for Computational Language Documentation
Shu Okabe | Laurent Besacier | François Yvon
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word and morpheme segmentation are fundamental steps of language documentation as they allow to discover lexical units in a language for which the lexicon is unknown. However, in most language documentation scenarios, linguists do not start from a blank page: they may already have a pre-existing dictionary or have initiated manual segmentation of a small part of their data. This paper studies how such a weak supervision can be taken advantage of in Bayesian non-parametric models of segmentation. Our experiments on two very low resource languages (Mboshi and Japhug), whose documentation is still in progress, show that weak supervision can be beneficial to the segmentation quality. In addition, we investigate an incremental learning scenario where manual segmentations are provided in a sequential manner. This work opens the way for interactive annotation tools for documentary linguists.

Using ASR-Generated Text for Spoken Language Modeling
Nicolas Hervé | Valentin Pelloin | Benoit Favre | Franck Dary | Antoine Laurent | Sylvain Meignier | Laurent Besacier
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. The new models (FlauBERT-Oral) will be shared with the community and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks : classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.

Focused Concatenation for Context-Aware Neural Machine Translation
Lorenzo Lupo | Marco Dinarelli | Laurent Besacier
Proceedings of the Seventh Conference on Machine Translation (WMT)

A straightforward approach to context-aware neural machine translation consists in feeding the standard encoder-decoder architecture with a window of consecutive sentences, formed by the current sentence and a number of sentences from its context concatenated to it. In this work, we propose an improved concatenation approach that encourages the model to focus on the translation of the current sentence, discounting the loss generated by target context. We also propose an additional improvement that strengthen the notion of sentence boundaries and of relative sentence distance, facilitating model compliance to the context-discounted objective. We evaluate our approach with both average-translation quality metrics and contrastive test sets for the translation of inter-sentential discourse phenomena, proving its superiority to the vanilla concatenation approach and other sophisticated context-aware systems.

2021

User-friendly Automatic Transcription of Low-resource Languages: Plugging ESPnet into Elpis
Oliver Adams | Benjamin Galliot | Guillaume Wisniewski | Nicholas Lambourne | Ben Foley | Rahasya Sanders-Dwyer | Janet Wiles | Alexis Michaud | Séverine Guillaume | Laurent Besacier | Christopher Cox | Katya Aplonova | Guillaume Jacques | Nathan Hill
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation
Siddique Latif | Inyoung Kim | Ioan Calapodescu | Laurent Besacier
Proceedings of the 25th Conference on Computational Natural Language Learning

While End-2-End Text-to-Speech (TTS) has made significant progresses over the past few years, these systems still lack intuitive user controls over prosody. For instance, generating speech with fine-grained prosody control (prosodic prominence, contextually appropriate emotions) is still an open challenge. In this paper, we investigate whether we can control prosody directly from the input text, in order to code information related to contrastive focus which emphasizes a specific word that is contrary to the presuppositions of the interlocutor. We build and share a specific dataset for this purpose and show that it allows to train a TTS system were this fine-grained prosodic feature can be correctly conveyed using control tokens. Our evaluation compares synthetic and natural utterances and shows that prosodic patterns of contrastive focus (variations of Fo, Intensity and Duration) can be learnt accurately. Such a milestone is important to allow, for example, smart speakers to be programmatically controlled in terms of output prosody.

Multilingual Unsupervised Neural Machine Translation with Denoising Adapters
Ahmet Üstün | Alexandre Berard | Laurent Besacier | Matthias Gallé
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data by using auxiliary parallel language pairs. For this problem the standard procedure so far to leverage the monolingual data is _back-translation_, which is computationally costly and hard to tune. In this paper we propose instead to use _denoising adapters_, adapter layers with a denoising objective, on top of pre-trained mBART-50. In addition to the modularity and flexibility of such an approach we show that the resulting translations are on-par with back-translating as measured by BLEU, and furthermore it allows adding unseen languages incrementally.

Visualizing Cross‐Lingual Discourse Relations in Multilingual TED Corpora
Zae Myung Kim | Vassilina Nikoulina | Dongyeop Kang | Didier Schwab | Laurent Besacier
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

This paper presents an interactive data dashboard that provides users with an overview of the preservation of discourse relations among 28 language pairs. We display a graph network depicting the cross-lingual discourse relations between a pair of languages for multilingual TED talks and provide a search function to look for sentences with specific keywords or relation types, facilitating ease of analysis on the cross-lingual discourse relations.

Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?
Zae Myung Kim | Laurent Besacier | Vassilina Nikoulina | Didier Schwab
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Phone Based Keyword Spotting for Transcribing Very Low Resource Languages
Eric Le Ferrand | Steven Bird | Laurent Besacier
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient to train a robust speech recognition system. This work is grounded in a very low-resource language documentation scenario where only a few minutes of recording have been transcribed for a given language so far. Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned with only a few minutes of target language speech, can be used for spoken term detection through searches in phone confusion networks with a lexicon expressed as a finite state automaton. Experimental results show that a phone recognition based approach provides better overall performances than Dynamic Time Warping when working with clean data, and highlight the benefits of each methods for two types of speech corpus.

Investigating the Impact of Gender Representation in ASR Training Data: a Case Study on Librispeech
Mahault Garnerin | Solange Rossato | Laurent Besacier
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing

In this paper we question the impact of gender representation in training data on the performance of an end-to-end ASR system. We create an experiment based on the Librispeech corpus and build 3 different training corpora varying only the proportion of data produced by each gender category. We observe that if our system is overall robust to the gender balance or imbalance in training data, it is nonetheless dependant of the adequacy between the individuals present in the training and testing sets.

Findings of the WMT Shared Task on Machine Translation Using Terminologies
Md Mahfuz Ibn Alam | Ivana Kvapilíková | Antonios Anastasopoulos | Laurent Besacier | Georgiana Dinu | Marcello Federico | Matthias Gallé | Kweonwoo Jung | Philipp Koehn | Vassilina Nikoulina
Proceedings of the Sixth Conference on Machine Translation

Language domains that require very careful use of terminology are abundant and reflect a significant part of the translation industry. In this work we introduce a benchmark for evaluating the quality and consistency of terminology translation, focusing on the medical (and COVID-19 specifically) domain for five language pairs: English to French, Chinese, Russian, and Korean, as well as Czech to German. We report the descriptions and results of the participating systems, commenting on the need for further research efforts towards both more adequate handling of terminologies as well as towards a proper formulation and evaluation of the task.

Contribution d’informations syntaxiques aux capacités de généralisation compositionelle des modèles seq2seq convolutifs (Assessing the Contribution of Syntactic Information for Compositional Generalization of seq2seq Convolutional Networks)
Diana Nicoleta Popa | William N. Havard | Maximin Coavoux | Eric Gaussier | Laurent Besacier
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Les modèles neuronaux de type seq2seq manifestent d’étonnantes capacités de prédiction quand ils sont entraînés sur des données de taille suffisante. Cependant, ils échouent à généraliser de manière satisfaisante quand la tâche implique d’apprendre et de réutiliser des règles systématiques de composition et non d’apprendre simplement par imitation des exemples d’entraînement. Le jeu de données SCAN, constitué d’un ensemble de commandes en langage naturel associées à des séquences d’action, a été spécifiquement conçu pour évaluer les capacités des réseaux de neurones à apprendre ce type de généralisation compositionnelle. Dans cet article, nous nous proposons d’étudier la contribution d’informations syntaxiques sur les capacités de généralisation compositionnelle des réseaux de neurones seq2seq convolutifs.

Lightweight Adapter Tuning for Multilingual Speech Translation
Hang Le | Juan Pino | Changhan Wang | Jiatao Gu | Didier Schwab | Laurent Besacier
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Adapter modules were recently introduced as an efficient alternative to fine-tuning in NLP. Adapter tuning consists in freezing pre-trained parameters of a model and injecting lightweight modules between layers, resulting in the addition of only a small number of task-specific trainable parameters. While adapter tuning was investigated for multilingual neural machine translation, this paper proposes a comprehensive analysis of adapters for multilingual speech translation (ST). Starting from different pre-trained models (a multilingual ST trained on parallel data or a multilingual BART (mBART) trained on non parallel multilingual data), we show that adapters can be used to: (a) efficiently specialize ST to specific language pairs with a low extra cost in terms of parameters, and (b) transfer from an automatic speech recognition (ASR) task and an mBART pre-trained model to a multilingual ST task. Experiments show that adapter tuning offer competitive results to full fine-tuning, while being much more parameter-efficient.

2020

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible
Marcely Zanon Boito | William Havard | Mahault Garnerin | Éric Le Ferrand | Laurent Besacier
Proceedings of the Twelfth Language Resources and Evaluation Conference

The CMU Wilderness Multilingual Speech Dataset (Black, 2019) is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible) is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for typologically different language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs). Lastly, we showcase the usefulness of the final product on a bilingual speech retrieval task.

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation
Hang Le | Juan Pino | Changhan Wang | Jiatao Gu | Didier Schwab | Laurent Besacier
Proceedings of the 28th International Conference on Computational Linguistics

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other: one decoder can attend to different information sources from the other via a dual-attention mechanism. We propose two variants of these architectures corresponding to two different levels of dependencies between the decoders, called the parallel and cross dual-decoder Transformers, respectively. Extensive experiments on the MuST-C dataset show that our models outperform the previously-reported highest translation performance in the multilingual settings, and outperform as well bilingual one-to-one results. Furthermore, our parallel models demonstrate no trade-off between ASR and ST compared to the vanilla multi-task architecture. Our code and pre-trained models are available at https://github.com/formiel/speech-translation.

ON-TRAC Consortium for End-to-End and Simultaneous Speech Translation Challenge Tasks at IWSLT 2020
Maha Elbayad | Ha Nguyen | Fethi Bougares | Natalia Tomashenko | Antoine Caubrière | Benjamin Lecouteux | Yannick Estève | Laurent Besacier
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2020, offline speech translation and simultaneous speech translation. ON-TRAC Consortium is composed of researchers from three French academic laboratories: LIA (Avignon Université), LIG (Université Grenoble Alpes), and LIUM (Le Mans Université). Attention-based encoder-decoder models, trained end-to-end, were used for our submissions to the offline speech translation track. Our contributions focused on data augmentation and ensembling of multiple models. In the simultaneous speech translation track, we build on Transformer-based wait-k models for the text-to-text subtask. For speech-to-text simultaneous translation, we attach a wait-k MT system to a hybrid ASR system. We propose an algorithm to control the latency of the ASR+MT cascade and achieve a good latency-quality trade-off on both subtasks.

Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
Marcely Zanon Boito | Aline Villavicencio | Laurent Besacier
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models’ input representation increases their translation and alignment quality, specially for challenging language pairs.

Représentation du genre dans des données open source de parole (Gender representation in open source speech resources 1 With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics and transparency in AI systems has become a central concern within the research community)
Mahault Garnerin | Solange Rossato | Laurent Besacier
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole

Avec l’essor de l’intelligence artificielle (IA) et l’utilisation croissante des architectures d’apprentissage profond, la question de l’éthique et de la transparence des systèmes d’IA est devenue une préoccupation centrale au sein de la communauté de recherche. Dans cet article, nous proposons une étude sur la représentation du genre dans les ressources de parole disponibles sur la plateforme Open Speech and Language Resource. Un tout premier résultat est la difficulté d’accès aux informations sur le genre des locuteurs. Ensuite, nous montrons que l’équilibre entre les catégories de genre dépend de diverses caractéristiques des corpus (discours élicité ou non, tâche adressée). En nous appuyant sur des travaux antérieurs, nous reprenons quelques principes concernant les métadonnées dans l’optique d’assurer une meilleure transparence des systèmes de parole construits à l’aide de ces corpus.

Gender Representation in Open Source Speech Resources
Mahault Garnerin | Solange Rossato | Laurent Besacier
Proceedings of the Twelfth Language Resources and Evaluation Conference

With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics, transparency and fairness of AI systems has become a central concern within the research community. We address transparency and fairness in spoken language systems by proposing a study about gender representation in speech resources available through the Open Speech and Language Resource platform. We show that finding gender information in open source corpora is not straightforward and that gender balance depends on other corpus characteristics (elicited/non elicited speech, low/high resource language, speech task targeted). The paper ends with recommendations about metadata and gender information for researchers in order to assure better transparency of the speech systems built using such corpora.

Monolingual Adapters for Zero-Shot Neural Machine Translation
Jerin Philip | Alexandre Berard | Matthias Gallé | Laurent Besacier
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose a novel adapter layer formalism for adapting multilingual models. They are more parameter-efficient than existing adapter layers while obtaining as good or better performance. The layers are specific to one language (as opposed to bilingual adapters) allowing to compose them and generalize to unseen language-pairs. In this zero-shot setting, they obtain a median improvement of +2.77 BLEU points over a strong 20-language multilingual Transformer baseline trained on TED talks.

FlauBERT: Unsupervised Language Model Pre-training for French
Hang Le | Loïc Vial | Jibril Frej | Vincent Segonne | Maximin Coavoux | Benjamin Lecouteux | Alexandre Allauzen | Benoit Crabbé | Laurent Besacier | Didier Schwab
Proceedings of the Twelfth Language Resources and Evaluation Conference

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Dorothee Beermann | Laurent Besacier | Sakriani Sakti | Claudia Soria
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Online Versus Offline NMT Quality: An In-depth Analysis on English-German and German-English
Maha Elbayad | Michael Ustaszewski | Emmanuelle Esperança-Rodier | Francis Brunet-Manquat | Jakob Verbeek | Laurent Besacier
Proceedings of the 28th International Conference on Computational Linguistics

We conduct in this work an evaluation study comparing offline and online neural machine translation architectures. Two sequence-to-sequence models: convolutional Pervasive Attention (Elbayad et al. 2018) and attention-based Transformer (Vaswani et al. 2017) are considered. We investigate, for both architectures, the impact of online decoding constraints on the translation quality through a carefully designed human evaluation on English-German and German-English language pairs, the latter being particularly sensitive to latency constraints. The evaluation results allow us to identify the strengths and shortcomings of each model when we shift to the online setup.

Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech
William Havard | Laurent Besacier | Jean-Pierre Chevrot
Proceedings of the 24th Conference on Computational Natural Language Learning

The language acquisition literature shows that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words from them, but rather adopt a top-down approach and start by segmenting word-like units and then break them down into smaller units. This suggests that the ideal way of learning a language is by starting from full semantic units. In this paper, we investigate if this is also the case for a neural model of Visually Grounded Speech trained on a speech-image retrieval task. We evaluated how well such a network is able to learn a reliable speech-to-image mapping when provided with phone, syllable, or word boundary information. We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient. We also explore at which level of the network’s architecture such information should be introduced so as to maximise its performances. Finally, we show that using multiple boundary types at once in a hierarchical structure, by which low-level segments are used to recompose high-level segments, is beneficial and yields better results than using low-level or high-level segments in isolation.

Pratiques d’évaluation en ASR et biais de performance (Evaluation methodology in ASR and performance bias)
Mahault Garnerin | Solange Rossato | Laurent Besacier
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)

Nous proposons une réflexion sur les pratiques d’évaluation des systèmes de reconnaissance automatique de la parole (ASR). Après avoir défini la notion de discrimination d’un point de vue légal et la notion d’équité dans les systèmes d’intelligence artificielle, nous nous intéressons aux pratiques actuelles lors des grandes campagnes d’évaluation. Nous observons que la variabilité de la parole et plus particulièrement celle de l’individu n’est pas prise en compte dans les protocoles d’évaluation actuels rendant impossible l’étude de biais potentiels dans les systèmes.

Enabling Interactive Transcription in an Indigenous Community
Eric Le Ferrand | Steven Bird | Laurent Besacier
Proceedings of the 28th International Conference on Computational Linguistics

We propose a novel transcription workflow which combines spoken term detection and human-in-the-loop, together with a pilot experiment. This work is grounded in an almost zero-resource scenario where only a few terms have so far been identified, involving two endangered languages. We show that in the early stages of transcription, when the available data is insufficient to train a robust ASR system, it is possible to take advantage of the transcription of a small number of isolated words in order to bootstrap the transcription of a speech collection.

FlauBERT : des modèles de langue contextualisés pré-entraînés pour le français (FlauBERT : Unsupervised Language Model Pre-training for French)
Hang Le | Loïc Vial | Jibril Frej | Vincent Segonne | Maximin Coavoux | Benjamin Lecouteux | Alexandre Allauzen | Benoît Crabbé | Laurent Besacier | Didier Schwab
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue pré-entraînés sont désormais indispensables pour obtenir des résultats à l’état-de-l’art dans de nombreuses tâches du TALN. Tirant avantage de l’énorme quantité de textes bruts disponibles, ils permettent d’extraire des représentations continues des mots, contextualisées au niveau de la phrase. L’efficacité de ces représentations pour résoudre plusieurs tâches de TALN a été démontrée récemment pour l’anglais. Dans cet article, nous présentons et partageons FlauBERT, un ensemble de modèles appris sur un corpus français hétérogène et de taille importante. Des modèles de complexité différente sont entraînés à l’aide du nouveau supercalculateur Jean Zay du CNRS. Nous évaluons nos modèles de langue sur diverses tâches en français (classification de textes, paraphrase, inférence en langage naturel, analyse syntaxique, désambiguïsation automatique) et montrons qu’ils surpassent souvent les autres approches sur le référentiel d’évaluation FLUE également présenté ici.

2019

The LIG system for the English-Czech Text Translation Task of IWSLT 2019
Loïc Vial | Benjamin Lecouteux | Didier Schwab | Hang Le | Laurent Besacier
Proceedings of the 16th International Conference on Spoken Language Translation

In this paper, we present our submission for the English to Czech Text Translation Task of IWSLT 2019. Our system aims to study how pre-trained language models, used as input embeddings, can improve a specialized machine translation system trained on few data. Therefore, we implemented a Transformer-based encoder-decoder neural system which is able to use the output of a pre-trained language model as input embeddings, and we compared its performance under three configurations: 1) without any pre-trained language model (constrained), 2) using a language model trained on the monolingual parts of the allowed English-Czech data (constrained), and 3) using a language model trained on a large quantity of external monolingual data (unconstrained). We used BERT as external pre-trained language model (configuration 3), and BERT architecture for training our own language model (configuration 2). Regarding the training data, we trained our MT system on a small quantity of parallel text: one set only consists of the provided MuST-C corpus, and the other set consists of the MuST-C corpus and the News Commentary corpus from WMT. We observed that using the external pre-trained BERT improves the scores of our system by +0.8 to +1.5 of BLEU on our development set, and +0.97 to +1.94 of BLEU on the test set. However, using our own language model trained only on the allowed parallel data seems to improve the machine translation performances only when the system is trained on the smallest dataset.

Naver Labs Europe’s Systems for the Document-Level Generation and Translation Task at WNGT 2019
Fahimeh Saleh | Alexandre Berard | Ioan Calapodescu | Laurent Besacier
Proceedings of the 3rd Workshop on Neural Generation and Translation

Recently, neural models led to significant improvements in both machine translation (MT) and natural language generation tasks (NLG). However, generation of long descriptive summaries conditioned on structured data remains an open challenge. Likewise, MT that goes beyond sentence-level context is still an open issue (e.g., document-level MT or MT with metadata). To address these challenges, we propose to leverage data from both tasks and do transfer learning between MT, NLG, and MT with source-side metadata (MT+NLG). First, we train document-based MT systems with large amounts of parallel data. Then, we adapt these models to pure NLG and MT+NLG tasks by fine-tuning with smaller amounts of domain-specific data. This end-to-end NLG approach, without data selection and planning, outperforms the previous state of the art on the Rotowire NLG task. We participated to the “Document Generation and Translation” task at WNGT 2019, and ranked first in all tracks.

Controlling Utterance Length in NMT-based Word Segmentation with Attention
Pierre Godard | Laurent Besacier | François Yvon
Proceedings of the 16th International Conference on Spoken Language Translation

One of the basic tasks of computational language documentation (CLD) is to identify word boundaries in an unsegmented phonemic stream. While several unsupervised monolingual word segmentation algorithms exist in the literature, they are challenged in real-world CLD settings by the small amount of available data. A possible remedy is to take advantage of glosses or translation in a foreign, well- resourced, language, which often exist for such data. In this paper, we explore and compare ways to exploit neural machine translation models to perform unsupervised boundary detection with bilingual information, notably introducing a new loss function for jointly learning alignment and segmentation. We experiment with an actual under-resourced language, Mboshi, and show that these techniques can effectively control the output segmentation length.

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
William N. Havard | Jean-Pierre Chevrot | Laurent Besacier
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation learned by neural networks – the gating paradigm – and show that the correct representation of a word is only activated if the network has access to first phoneme of the target word, suggesting that the network does not rely on a global acoustic pattern. Furthermore, we find out that not all speech frames (MFCC vectors in our case) play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it. Finally we suggest that word representation could be activated through a process of lexical competition.

Motivations, challenges, and perspectives for the development of an Automatic Speech Recognition System for the under-resourced Ngiemboon Language
Patrice Yemmene | Laurent Besacier
Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers

ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task
Ha Nguyen | Natalia Tomashenko | Marcely Zanon Boito | Antoine Caubrière | Fethi Bougares | Mickael Rouvier | Laurent Besacier | Yannick Estève
Proceedings of the 16th International Conference on Spoken Language Translation

This paper describes the ON-TRAC Consortium translation systems developed for the end-to-end model task of IWSLT Evaluation 2019 for the English→ Portuguese language pair. ON-TRAC Consortium is composed of researchers from three French academic laboratories: LIA (Avignon Université), LIG (Université Grenoble Alpes), and LIUM (Le Mans Université). A single end-to-end model built as a neural encoder-decoder architecture with attention mechanism was used for two primary submissions corresponding to the two EN-PT evaluations sets: (1) TED (MuST-C) and (2) How2. In this paper, we notably investigate impact of pooling heterogeneous corpora for training, impact of target tokenization (characters or BPEs), impact of speech input segmentation and we also compare our best end-to-end model (BLEU of 26.91 on MuST-C and 43.82 on How2 validation sets) to a pipeline (ASR+MT) approach.

2018

Analyzing Learned Representations of a Deep ASR Performance Prediction Model
Zied Elloumi | Laurent Besacier | Olivier Galibert | Benjamin Lecouteux
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information at training time through multi-task learning. Our experiments show that this allows to train slightly more efficient ASR performance prediction systems that - in addition - simultaneously tag the analyzed utterances according to their speech style, accent and broadcast program origin.

A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Pierre Godard | Gilles Adda | Martine Adda-Decker | Juan Benjumea | Laurent Besacier | Jamison Cooper-Leavitt | Guy-Noel Kouarata | Lori Lamel | Hélène Maynard | Markus Mueller | Annie Rialland | Sebastian Stueker | François Yvon | Marcely Zanon-Boito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation
Xuanli He | Quan Tran | William Havard | Laurent Besacier | Ingrid Zukerman | Gholamreza Haffari
Proceedings of the Australasian Language Technology Association Workshop 2018

In spite of the recent success of Dialogue Act (DA) classification, the majority of prior works focus on text-based classification with oracle transcriptions, i.e. human transcriptions, instead of Automatic Speech Recognition (ASR)’s transcriptions. In spoken dialog systems, however, the agent would only have access to noisy ASR transcriptions, which may further suffer performance degradation due to domain shift. In this paper, we explore the effectiveness of using both acoustic and textual signals, either oracle or ASR transcriptions, and investigate speaker domain adaptation for DA classification. Our multimodal model proves to be superior to the unimodal models, particularly when the oracle transcriptions are not available. We also propose an effective method for speaker domain adaptation, which achieves competitive results.

Token-level and sequence-level loss smoothing for RNN language models
Maha Elbayad | Laurent Besacier | Jakob Verbeek
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the effectiveness of recurrent neural network language models, their maximum likelihood estimation suffers from two limitations. It treats all sentences that do not match the ground truth as equally poor, ignoring the structure of the output space. Second, it suffers from ’exposure bias’: during training tokens are predicted given ground-truth sequences, while at test time prediction is conditioned on generated output sequences. To overcome these limitations we build upon the recent reward augmented maximum likelihood approach that encourages the model to predict sentences that are close to the ground truth according to a given performance metric. We extend this approach to token-level loss smoothing, and propose improvements to the sequence-level smoothing approach. Our experiments on two different tasks, image captioning and machine translation, show that token-level and sequence-level loss smoothing are complementary, and significantly improve results.

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
Maha Elbayad | Laurent Besacier | Jakob Verbeek
Proceedings of the 22nd Conference on Computational Natural Language Learning

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention mechanism that recombines a fixed encoding of the source tokens based on the decoder state. We propose an alternative approach which instead relies on a single 2D convolutional neural network across both sequences. Each layer of our network re-codes source tokens on the basis of the output sequence produced so far. Attention-like properties are therefore pervasive throughout the network. Our model yields excellent results, outperforming state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters.

Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation
Ali Can Kocabiyikoglu | Laurent Besacier | Olivier Kraif
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Prédiction de performance des systèmes de reconnaissance automatique de la parole à l’aide de réseaux de neurones convolutifs [Performance prediction of automatic speech recognition systems using convolutional neural networks]
Zied Elloumi | Benjamin Lecouteux | Olivier Galibert | Laurent Besacier
Traitement Automatique des Langues, Volume 59, Numéro 2 : Apprentissage profond pour le traitement automatique des langues [Deep Learning for natural language processing]

Adaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages
Pierre Godard | Laurent Besacier | François Yvon | Martine Adda-Decker | Gilles Adda | Hélène Maynard | Annie Rialland
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30% token F-score from the results of a strong baseline.

Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville)
Annie Rialland | Martine Adda-Decker | Guy-Noël Kouarata | Gilles Adda | Laurent Besacier | Lori Lamel | Elodie Gauthier | Pierre Godard | Jamison Cooper-Leavitt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Disentangling ASR and MT Errors in Speech Translation
Ngoc-Tien Le | Benjamin Lecouteux | Laurent Besacier
Proceedings of Machine Translation Summit XVI: Research Track

CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in unsupervised and supervised way. Our best run ranked 1st on track 4a with a correlation of 83.02% with human annotations.

Using Word Embedding for Cross-Language Plagiarism Detection
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

Amharic-English Speech Translation in Tourism Domain
Michael Melese | Laurent Besacier | Million Meshesha
Proceedings of the Workshop on Speech-Centric Natural Language Processing

This paper describes speech translation from Amharic-to-English, particularly Automatic Speech Recognition (ASR) with post-editing feature and Amharic-English Statistical Machine Translation (SMT). ASR experiment is conducted using morpheme language model (LM) and phoneme acoustic model(AM). Likewise,SMT conducted using word and morpheme as unit. Morpheme based translation shows a 6.29 BLEU score at a 76.4% of recognition accuracy while word based translation shows a 12.83 BLEU score using 77.4% word recognition accuracy. Further, after post-edit on Amharic ASR using corpus based n-gram, the word recognition accuracy increased by 1.42%. Since post-edit approach reduces error propagation, the word based translation accuracy improved by 0.25 (1.95%) BLEU score. We are now working towards further improving propagated errors through different algorithms at each unit of speech translation cascading component.

Traitement des Mots Hors Vocabulaire pour la Traduction Automatique de Document OCRisés en Arabe (This article presents a new system that automatically translates images of Arabic documents)
Kamel Bouzidi | Zied Elloumi | Laurent Besacier | Benjamin Lecouteux | Mohamed-Faouzi Benzeghiba
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Cet article présente un système original de traduction de documents numérisés en arabe. Deux modules sont cascadés : un système de reconnaissance optique de caractères (OCR) en arabe et un système de traduction automatique (TA) arabe-français. Le couplage OCR-TA a été peu abordé dans la littérature et l’originalité de cette étude consiste à proposer un couplage étroit entre OCR et TA ainsi qu’un traitement spécifique des mots hors vocabulaire (MHV) engendrés par les erreurs d’OCRisation. Le couplage OCR-TA par treillis et notre traitement des MHV par remplacement selon une mesure composite qui prend en compte forme de surface et contexte du mot, permettent une amélioration significative des performances de traduction. Les expérimentations sont réalisés sur un corpus de journaux numérisés en arabe et permettent d’obtenir des améliorations en score BLEU de 3,73 et 5,5 sur les corpus de développement et de test respectivement.

Deep Investigation of Cross-Language Plagiarism Detection Methods
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw robust conclusions on the best methods while deeply analyzing correlations across document styles and languages.

LIG-CRIStAL Submission for the WMT 2017 Automatic Post-Editing Task
Alexandre Bérard | Laurent Besacier | Olivier Pietquin
Proceedings of the Second Conference on Machine Translation

2016

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof
Elodie Gauthier | Laurent Besacier | Sylvie Voisin | Michael Melese | Uriel Pascal Elingui
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This article presents the data collected and ASR systems developped for 4 sub-saharan african languages (Swahili, Hausa, Amharic and Wolof). To illustrate our methodology, the focus is made on Wolof (a very under-resourced language) for which we designed the first ASR system ever built in this language. All data and scripts are available online on our github repository.

Word2Vec vs DBnary: Augmenting METEOR using Vector Representations or Lexical Resources?
Christophe Servan | Alexandre Bérard | Zied Elloumi | Hervé Blanchon | Laurent Besacier
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper presents an approach combining lexico-semantic resources and distributed representations of words applied to the evaluation in machine translation (MT). This study is made through the enrichment of a well-known MT evaluation metric: METEOR. METEOR enables an approximate match (synonymy or morphological similarity) between an automatic and a reference translation. Our experiments are made in the framework of the Metrics task of WMT 2014. We show that distributed representations are a good alternative to lexico-semanticresources for MT evaluation and they can even bring interesting additional information. The augmented versions of METEOR, using vector representations, are made available on our Github page.

Word2Vec vs DBnary ou comment (ré)concilier représentations distribuées et réseaux lexico-sémantiques ? Le cas de l’évaluation en traduction automatique (Word2Vec vs DBnary or how to bring back together vector representations and lexical resources ? A case study for machine translation evaluation)
Christophe Servan | Zied Elloumi | Hervé Blanchon | Laurent Besacier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cet article présente une approche associant réseaux lexico-sémantiques et représentations distribuées de mots appliquée à l’évaluation de la traduction automatique. Cette étude est faite à travers l’enrichissement d’une métrique bien connue pour évaluer la traduction automatique (TA) : METEOR. METEOR permet un appariement approché (similarité morphologique ou synonymie) entre une sortie de système automatique et une traduction de référence. Nos expérimentations s’appuient sur la tâche Metrics de la campagne d’évaluation WMT 2014 et montrent que les représentations distribuées restent moins performantes que les ressources lexico-sémantiques pour l’évaluation en TA mais peuvent néammoins apporter un complément d’information intéressant à ces dernières.

Joint ASR and MT Features for Quality Estimation in Spoken Language Translation
Ngoc-Tien Le | Benjamin Lecouteux | Laurent Besacier
Proceedings of the 13th International Conference on Spoken Language Translation

This paper aims to unravel the automatic quality assessment for spoken language translation (SLT). More precisely, we propose several effective estimators based on our estimation of transcription (ASR) quality, translation (MT) quality, or both (combined and joint features using ASR and MT information). Our experiments provide an important opportunity to advance the understanding of the prediction quality of words in a SLT output that were revealed by MT and ASR features. These results could be applied to interactive speech translation or computer-assisted translation of speeches and lectures. For reproducible experiments, the code allowing to call our WCE-LIG application and the corpora used are made available to the research community.

Projection Interlingue d’Étiquettes pour l’Annotation Sémantique Non Supervisée (Cross-lingual Annotation Projection for Unsupervised Semantic Tagging)
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Nos travaux portent sur la construction rapide d’outils d’analyse linguistique pour des langues peu dotées en ressources. Dans une précédente contribution, nous avons proposé une méthode pour la construction automatique d’un analyseur morpho-syntaxique via une projection interlingue d’annotations linguistiques à partir de corpus parallèles (méthode fondée sur les réseaux de neurones récurrents). Nous présentons, dans cet article, une amélioration de notre modèle neuronal, avec la prise en compte d’informations linguistiques externes pour un annotateur plus complexe. En particulier, nous proposons d’intégrer des annotations morpho-syntaxiques dans notre architecture neuronale pour l’apprentissage non supervisé d’annotateurs sémantiques multilingues à gros grain (annotation en SuperSenses). Nous montrons la validité de notre méthode et sa généricité sur l’italien et le français et étudions aussi l’impact de la qualité du corpus parallèle sur notre approche (généré par traduction manuelle ou automatique). Nos expériences portent sur la projection d’annotations de l’anglais vers le français et l’italien.

MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP
Alexandre Bérard | Christophe Servan | Olivier Pietquin | Laurent Besacier
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present MultiVec, a new toolkit for computing continuous representations for text at different granularity levels (word-level or sequences of words). MultiVec includes word2vec’s features, paragraph vector (batch and online) and bivec for bilingual distributed representations. MultiVec also includes different distance measures between words and sequences of words. The toolkit is written in C++ and is aimed at being fast (in the same order of magnitude as word2vec), easy to use, and easy to extend. It has been evaluated on several NLP tasks: the analogical reasoning task, sentiment analysis, and crosslingual document classification.

A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection
Jérémy Ferrero | Frédéric Agnès | Laurent Besacier | Didier Schwab
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present preexisting corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to document), is based on both parallel and comparable corpora and contains human and machine translated texts. Moreover, it includes texts written by multiple types of authors (from average to professionals). With the obtained dataset, we conduct a systematic and rigorous evaluation of several state-of-the-art cross-language textual similarity detection methods. The evaluation results are reviewed and discussed. Finally, dataset and scripts are made publicly available on GitHub: http://github.com/FerreroJeremy/Cross-Language-Dataset.

In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.

Inducing Multilingual Text Analysis Tools Using Bidirectional Recurrent Neural Networks
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This work focuses on the development of linguistic analysis tools for resource-poor languages. We use a parallel corpus to produce a multilingual word representation based only on sentence level alignment. This representation is combined with the annotated source side (resource-rich language) of the parallel corpus to train text analysis tools for resource-poor languages. Our approach is based on Recurrent Neural Networks (RNN) and has the following advantages: (a) it does not use word alignment information, (b) it does not assume any knowledge about foreign languages, which makes it applicable to a wide range of resource-poor languages, (c) it provides truly multilingual taggers. In a previous study, we proposed a method based on Simple RNN to automatically induce a Part-Of-Speech (POS) tagger. In this paper, we propose an improvement of our neural model. We investigate the Bidirectional RNN and the inclusion of external information (for instance low level information from Part-Of-Speech tags) in the RNN to train a more complex tagger (for instance, a multilingual super sense tagger). We demonstrate the validity and genericity of our method by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual POS and super sense taggers.

2015

Utilisation des réseaux de neurones récurrents pour la projection interlingue d’étiquettes morpho-syntaxiques à partir d’un corpus parallèle
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

La construction d’outils d’analyse linguistique pour les langues faiblement dotées est limitée, entre autres, par le manque de corpus annotés. Dans cet article, nous proposons une méthode pour construire automatiquement des outils d’analyse via une projection interlingue d’annotations linguistiques en utilisant des corpus parallèles. Notre approche n’utilise pas d’autres sources d’information, ce qui la rend applicable à un large éventail de langues peu dotées. Nous proposons d’utiliser les réseaux de neurones récurrents pour projeter les annotations d’une langue à une autre (sans utiliser d’information d’alignement des mots). Dans un premier temps, nous explorons la tâche d’annotation morpho-syntaxique. Notre méthode combinée avec une méthode de projection d’annotation basique (utilisant l’alignement mot à mot), donne des résultats comparables à ceux de l’état de l’art sur une tâche similaire.

Utilisation de mesures de confiance pour améliorer le décodage en traduction de parole
Laurent Besacier | Benjamin Lecouteux | Luong Ngoc Quang
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les mesures de confiance au niveau mot (Word Confidence Estimation - WCE) pour la traduction auto- matique (TA) ou pour la reconnaissance automatique de la parole (RAP) attribuent un score de confiance à chaque mot dans une hypothèse de transcription ou de traduction. Dans le passé, l’estimation de ces mesures a le plus souvent été traitée séparément dans des contextes RAP ou TA. Nous proposons ici une estimation conjointe de la confiance associée à un mot dans une hypothèse de traduction automatique de la parole (TAP). Cette estimation fait appel à des paramètres issus aussi bien des systèmes de transcription de la parole (RAP) que des systèmes de traduction automatique (TA). En plus de la construction de ces estimateurs de confiance robustes pour la TAP, nous utilisons les informations de confiance pour re-décoder nos graphes d’hypothèses de traduction. Les expérimentations réalisées montrent que l’utilisation de ces mesures de confiance au cours d’une seconde passe de décodage permettent d’obtenir une amélioration significative des performances de traduction (évaluées avec la métrique BLEU - gains de deux points par rapport à notre système de traduc- tion de parole de référence). Ces expériences sont faites pour une tâche de TAP (français-anglais) pour laquelle un corpus a été spécialement conçu (ce corpus, mis à la disposition de la communauté TALN, est aussi décrit en détail dans l’article).

Unsupervised and Lightly Supervised Part-of-Speech Tagging Using Recurrent Neural Networks
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

Automated Translation of a Literary Work: A Pilot Study
Laurent Besacier | Lane Schwartz
Proceedings of the Fourth Workshop on Computational Linguistics for Literature

METEOR for multiple target languages using DBnary
Zied Elloumi | Hervé Blanchon | Gilles Serasset | Laurent Besacier
Proceedings of Machine Translation Summit XV: Papers

An open-source toolkit for word-level confidence estimation in machine translation
Christophe Servan | Ngoc Tien Le | Ngoc Quang Luong | Benjamin Lecouteux | Laurent Besacier
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

2014

An efficient two-pass decoder for SMT using word confidence estimation
Ngoc-Quang Luong | Laurent Besacier | Benjamin Lecouteux
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

LIG System for Word Level QE task at WMT14
Ngoc-Quang Luong | Laurent Besacier | Benjamin Lecouteux
Proceedings of the Ninth Workshop on Statistical Machine Translation

Word confidence estimation for speech translation
L. Besacier | B. Lecouteux | N. Q. Luong | K. Hour | M. Hadjsalah
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

Word Confidence Estimation (WCE) for machine translation (MT) or automatic speech recognition (ASR) consists in judging each word in the (MT or ASR) hypothesis as correct or incorrect by tagging it with an appropriate label. In the past, this task has been treated separately in ASR or MT contexts and we propose here a joint estimation of word confidence for a spoken language translation (SLT) task involving both ASR and MT. This research work is possible because we built a specific corpus which is first presented. This corpus contains 2643 speech utterances for which a quintuplet containing: ASR output (src-asr), verbatim transcript (src-ref), text translation output (tgt-mt), speech translation output (tgt-slt) and post-edition of translation (tgt-pe), is made available. The rest of the paper illustrates how such a corpus (made available to the research community) can be used for evaluating word confidence estimators in ASR, MT or SLT scenarios. WCE for SLT could help rescoring SLT output graphs, improving translators productivity (for translation of lectures or movie subtitling) or it could be useful in interactive speech-to-speech translation scenarios.

Data selection for compact adapted SMT models
Shachar Mirkin | Laurent Besacier
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection relies on some in-domain data, of the same domain of the texts expected to be translated. Selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts has been shown to be effective; yet, this approach holds the risk of resulting in a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain data that is available in advance. Some methods select additional data based on the actual text that needs to be translated. While useful, this is not always a practical scenario. In this work we describe an extensive exploration of data selection techniques over Arabic to French datasets, and propose methods to address both similarity and coverage considerations while maintaining a limited model size.

Machine translation for litterature: a pilot study (Traduction automatisée d’une oeuvre littéraire: une étude pilote) [in French]
Laurent Besacier
Proceedings of TALN 2014 (Volume 2: Short Papers)

Word Confidence Estimation for SMT N-best List Re-ranking
Ngoc-Quang Luong | Laurent Besacier | Benjamin Lecouteux
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

Préface [Foreword]
Laurent Besacier | Wolfgang Minker
Traitement Automatique des Langues, Volume 55, Numéro 2 : Traitement automatique du langage parlé [Spoken language processing]

2013

LIG System for WMT13 QE Task: Investigating the Usefulness of Features in Word Confidence Estimation for MT
Ngoc-Quang Luong | Benjamin Lecouteux | Laurent Besacier
Proceedings of the Eighth Workshop on Statistical Machine Translation

How hard is it to automatically translate phrasal verbs from English to French?
Carlos Ramish | Laurent Besacier | Alexander Kobzar
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technologies

Discriminative statistical approaches for multilingual speech understanding (Approches statistiques discriminantes pour l’interprétation sémantique multilingue de la parole) [in French]
Bassam Jabaian | Fabrice Lefèvre | Laurent Besacier
Proceedings of TALN 2013 (Volume 1: Long Papers)

Driven Decoding for machine translation (Vers un décodage guidé pour la traduction automatique) [in French]
Benjamin Lecouteux | Laurent Besacier
Proceedings of TALN 2013 (Volume 2: Short Papers)

Urdu Hindi Machine Transliteration using SMT
M. G. Abbas Malik | Christian Boitet | Laurent Besacier | Pushpak Bhattacharyya
Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing

Fast Bootstrapping of Grapheme to Phoneme System for Under-resourced Languages - Application to the Iban Language
Sarah Samson Juan | Laurent Besacier
Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing

2012

The LIG English to French machine translation system for IWSLT 2012
Laurent Besacier | Benjamin Lecouteux | Marwen Azouzi | Ngoc Quang Luong
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper presents the LIG participation to the E-F MT task of IWSLT 2012. The primary system proposed made a large improvement (more than 3 point of BLEU on tst2010 set) compared to our last year participation. Part of this improvment was due to the use of an extraction from the Gigaword corpus. We also propose a preliminary adaptation of the driven decoding concept for machine translation. This method allows an efficient combination of machine translation systems, by rescoring the log-linear model at the N-best list level according to auxiliary systems: the basis technique is essentially guiding the search using one or previous system outputs. The results show that the approach allows a significant improvement in BLEU score using Google translate to guide our own SMT system. We also try to use a confidence measure as an additional log-linear feature but we could not get any improvment with this technique.

Towards a better understanding of statistical post-editing
Marion Potet | Laurent Besacier | Hervé Blanchon | Marwen Azouzi
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers

We describe several experiments to better understand the usefulness of statistical post-edition (SPE) to improve phrase-based statistical MT (PBMT) systems raw outputs. Whatever the size of the training corpus, we show that SPE systems trained on general domain data offers no breakthrough to our baseline general domain PBMT system. However, using manually post-edited system outputs to train the SPE led to a slight improvement in the translations quality compared with the use of professional reference translations. We also show that SPE is far more effective for domain adaptation, mainly because it recovers a lot of specific terms unknown to our general PBMT system. Finally, we compare two domain adaptation techniques, post-editing a general domain PBMT system vs building a new domain-adapted PBMT system with two different techniques, and show that the latter outperforms the first one. Yet, when the PBMT is a “black box”, SPE trained with post-edited system outputs remains an interesting option for domain adaptation.

Développement de ressources en swahili pour un sytème de reconnaisance automatique de la parole (Developments of Swahili resources for an automatic speech recognition system) [in French]
Hadrien Gelas | Laurent Besacier | François Pellegrino
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 5: Software Demonstrations
Laurent Besacier | Hervé Blanchon | Gilles Sérasset
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 5: Software Demonstrations

Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 4: Invited Conferences
Laurent Besacier | Hervé Blanchon | Marie-Paule Jacques | Nathalie Vallée | Gilles Sérasset
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 4: Invited Conferences

Robustesse et portabilités multilingue et multi-domaines des systèmes de compréhension de la parole : les corpus du projet PortMedia (Robustness and portability of spoken language understanding systems among languages and domains : the PORTMEDIA project) [in French]
Fabrice Lefèvre | Djamel Mostefa | Laurent Besacier | Yannick Estève | Matthieu Quignard | Nathalie Camelin | Benoit Favre | Bassam Jabaian | Lina Rojas-Barahona
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

Collection of a Large Database of French-English SMT Output Corrections
Marion Potet | Emmanuelle Esperança-Rodier | Laurent Besacier | Hervé Blanchon
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Corpus-based approaches to machine translation (MT) rely on the availability of parallel corpora. To produce user-acceptable translation outputs, such systems need high quality data to be efficiency trained, optimized and evaluated. However, building high quality dataset is a relatively expensive task. In this paper, we describe the data collection and analysis of a large database of 10.881 SMT translation output hypotheses manually corrected. These post-editions were collected using Amazon's Mechanical Turk, following some ethical guidelines. A complete analysis of the collected data pointed out a high quality of the corrections with more than 87 % of the collected post-editions that improve hypotheses and more than 94 % of the crowdsourced post-editions which are at least of professional quality. We also post-edited 1,500 gold-standard reference translations (of bilingual parallel corpora generated by professional) and noticed that 72 % of these translations needed to be corrected during post-edition. We computed a proximity measure between the differents kind of translations and pointed out that reference translations are as far from the hypotheses than from the corrected hypotheses (i.e. the post-editions). In light of these last findings, we discuss the adequation of text-based generated reference translations to train setence-to-sentence based SMT systems.

Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]
Hadrien Gelas | Solomon Teferra Abate | Laurent Besacier | François Pellegrino
JEP-TALN-RECITAL 2012, Workshop TALAf 2012: Traitement Automatique des Langues Africaines (TALAf 2012: African Language Processing)

Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP
Laurent Besacier | Benjamin Lecouteux | Gilles Sérasset
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

Leveraging study of robustness and portability of spoken language understanding systems across languages and domains: the PORTMEDIA corpora
Fabrice Lefèvre | Djamel Mostefa | Laurent Besacier | Yannick Estève | Matthieu Quignard | Nathalie Camelin | Benoit Favre | Bassam Jabaian | Lina M. Rojas-Barahona
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The PORTMEDIA project is intended to develop new corpora for the evaluation of spoken language understanding systems. The newly collected data are in the field of human-machine dialogue systems for tourist information in French in line with the MEDIA corpus. Transcriptions and semantic annotations, obtained by low-cost procedures, are provided to allow a thorough evaluation of the systems' capabilities in terms of robustness and portability across languages and domains. A new test set with some adaptation data is prepared for each case: in Italian as an example of a new language, for ticket reservation as an example of a new domain. Finally the work is complemented by the proposition of a new high level semantic annotation scheme well-suited to dialogue data.

2011

LIG English-French spoken language translation system for IWSLT 2011
Benjamin Lecouteux | Laurent Besacier | Hervé Blanchon
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the system developed by the LIG laboratory for the 2011 IWSLT evaluation. We participated to the English-French MT and SLT tasks. The development of a reference translation system (MT task), as well as an ASR output translation system (SLT task) are presented. We focus this year on the SLT task and on the use of multiple 1-best ASR outputs to improve overall translation quality. The main experiment presented here compares the performance of a SLT system where multiple ASR 1-best are combined before translation (source combination), with a SLT system where multiple ASR 1-best are translated, the system combination being conducted afterwards on the target side (target combination). The experimental results show that the second approach (target combination) overpasses the first one, when the performance is measured with BLEU.

The LIGA (LIG/LIA) Machine Translation System for WMT 2011
Marion Potet | Raphaël Rubino | Benjamin Lecouteux | Stéphane Huet | Laurent Besacier | Hervé Blanchon | Fabrice Lefèvre
Proceedings of the Sixth Workshop on Statistical Machine Translation

Oracle-based Training for Phrase-based Statistical Machine Translation
Marion Potet | Emmanuelle Esperança-Rodier | Hervé Blanchon | Laurent Besacier
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

Comparaison et combinaison d’approches pour la portabilité vers une nouvelle langue d’un système de compréhension de l’oral (Comparison and combination of approaches for the portability to a new language of an oral comprehension system)
Bassam Jabaian | Laurent Besacier | Fabrice Lefèvre
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous proposons plusieurs approches pour la portabilité du module de compréhension de la parole (SLU) d’un système de dialogue d’une langue vers une autre. On montre que l’utilisation des traductions automatiques statistiques (SMT) aide à réduire le temps et le cout de la portabilité d’un tel système d’une langue source vers une langue cible. Pour la tache d’étiquetage sémantique on propose d’utiliser soit les champs aléatoires conditionnels (CRF), soit l’approche à base de séquences (PH-SMT). Les résultats expérimentaux montrent l’efficacité des méthodes proposées pour une portabilité rapide du SLU vers une nouvelle langue. On propose aussi deux méthodes pour accroître la robustesse du SLU aux erreurs de traduction. Enfin on montre que la combinaison de ces approches réduit les erreurs du système. Ces travaux sont motivés par la disponibilité du corpus MEDIA français et de la traduction manuelle vers l’italien d’une sous partie de ce corpus.

2010

Weak Translation Problems – a case study of Scriptural Translation
Muhammad Ghulam Abbas Malik | Christian Boitet | Pushpak Bhattacharyya | Laurent Besacier
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

General purpose, high quality and fully automatic MT is believed to be impossible. We are interested in scriptural translation problems, which are weak sub-problems of the general problem of translation. We introduce the characteristics of the weak problems of translation and of the scriptural translation problems, describe different computational approaches (finite-state, statistical and hybrid) to solve these problems, and report our results on several combinations of Indo-Pak languages and writing systems.

Improved Vietnamese-French parallel corpus mining using English language
Thi Ngoc Diep Do | Laurent Besacier | Eric Castelli
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

Boosting N-gram Coverage for Unsegmented Languages Using Multiple Text Segmentation Approach
Solomon Teferra Abate | Laurent Besacier | Sopheap Seng
Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing

A fully unsupervised approach for mining parallel data from comparable corpora
Thi Ngoc Diep Do | Laurent Besacier | Eric Castelli
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

The LIG Machine Translation System for WMT 2010
Marion Potet | Laurent Besacier | Hervé Blanchon
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

Automatic Identification of Arabic Dialects
Mohamed Belgacem | Georges Antoniadis | Laurent Besacier
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this work, automatic recognition of Arabic dialects is proposed. An acoustic survey of the proportion of vocalic intervals and the standard deviation of consonantal intervals in nine dialects (Tunisia, Morocco, Algeria, Egypt, Syria, Lebanon, Yemen, Golfs Countries and Iraq) is performed using the platform Alize and Gaussian Mixture Models (GMM). The results show the complexity of the automatic identification of Arabic dialects since. No clear border can be found between the dialects, but a gradual transition between them. They can even vary slightly from one city to another. The existence of this gradual change is easy to understand: it corresponds to a human and social reality, to the contact, friendships forged and affinity in the environment more or less immediate of the individual. This document also raises questions about the classes or macro classes of Arabic dialects noticed from the confusion matrix and the design of the hierarchical tree obtained.

Apprentissage non supervisé pour la traduction automatique : application à un couple de langues peu doté
Thi Ngoc Diep | Laurent Besacier | Eric Castelli
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une méthode non-supervisée pour extraire des paires de phrases parallèles à partir d’un corpus comparable. Un système de traduction automatique est utilisé pour exploiter le corpus comparable et détecter les paires de phrases parallèles. Un processus itératif est exécuté non seulement pour augmenter le nombre de paires de phrases parallèles extraites, mais aussi pour améliorer la qualité globale du système de traduction. Une comparaison avec une méthode semi-supervisée est présentée également. Les expériences montrent que la méthode non-supervisée peut être réellement appliquée dans le cas où on manque de données parallèles. Bien que les expériences préliminaires soient menées sur la traduction français-anglais, cette méthode non-supervisée est également appliquée avec succès à un couple de langues peu doté : vietnamien-français.

LIG statistical machine translation systems for IWSLT 2010
Laurent Besacier | Haitem Afli | Thi Ngoc Diep Do | Hervé Blanchon | Marion Potet
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

2009

Segmentation multiple d’un flux de données textuelles pour la modélisation statistique du langage
Sopheap Seng | Laurent Besacier | Brigitte Bigi | Eric Castelli
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous traitons du problème de la modélisation statistique du langage pour les langues peu dotées et sans segmentation entre les mots. Tandis que le manque de données textuelles a un impact sur la performance des modèles, les erreurs introduites par la segmentation automatique peuvent rendre ces données encore moins exploitables. Pour exploiter au mieux les données textuelles, nous proposons une méthode qui effectue des segmentations multiples sur le corpus d’apprentissage au lieu d’une segmentation unique. Cette méthode basée sur les automates d’état finis permet de retrouver les n-grammes non trouvés par la segmentation unique et de générer des nouveaux n-grammes pour l’apprentissage de modèle du langage. L’application de cette approche pour l’apprentissage des modèles de langage pour les systèmes de reconnaissance automatique de la parole en langue khmère et vietnamienne s’est montrée plus performante que la méthode par segmentation unique, à base de règles.

LIG approach for IWSLT09
Fethi Bougares | Laurent Besacier | Hervé Blanchon
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the LIG experiments in the context of IWSLT09 evaluation (Arabic to English Statistical Machine Translation task). Arabic is a morphologically rich language, and recent experimentations in our laboratory have shown that the performance of Arabic to English SMT systems varies greatly according to the Arabic morphological segmenters applied. Based on this observation, we propose to use simultaneously multiple segmentations for machine translation of Arabic. The core idea is to keep the ambiguity of the Arabic segmentation in the system input (using confusion networks or lattices). Then, we hope that the best segmentation will be chosen during MT decoding. The mathematics of this multiple segmentation approach are given. Practical implementations in the case of verbatim text translation as well as speech translation (outside of the scope of IWSLT09 this year) are proposed. Experiments conducted in the framework of IWSLT evaluation campaign show the potential of the multiple segmentation approach. The last part of this paper explains in detail the different systems submitted by LIG at IWSLT09 and the results obtained.

A Hybrid Model for Urdu Hindi Transliteration
Abbas Malik | Laurent Besacier | Christian Boitet | Pushpak Bhattacharyya
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

Exploitation d’un corpus bilingue pour la création d’un système de traduction probabiliste Vietnamien - Français
Thi-Ngoc-Diep Do | Viet-Bac Le | Brigitte Bigi | Laurent Besacier | Eric Castelli
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente nos premiers travaux en vue de la construction d’un système de traduction probabiliste pour le couple de langue vietnamien-français. La langue vietnamienne étant considérée comme une langue peu dotée, une des difficultés réside dans la constitution des corpus parallèles, indispensable à l’apprentissage des modèles. Nous nous concentrons sur la constitution d’un grand corpus parallèle vietnamien-français. La méthode d’identification automatique des paires de documents parallèles fondée sur la date de publication, les mots spéciaux et les scores d’alignements des phrases est appliquée. Cet article présente également la construction d’un premier système de traduction automatique probabiliste vietnamienfrançais et français-vietnamien à partir de ce corpus et discute l’opportunité d’utiliser des unités lexicales ou sous-lexicales pour le vietnamien (syllabes, mots, ou leurs combinaisons). Les performances du système sont encourageantes et se comparent avantageusement à celles du système de Google.

Mining a Comparable Text Corpus for a Vietnamese-French Statistical Machine Translation System
Thi-Ngoc-Diep Do | Viet-Bac Le | Brigitte Bigi | Laurent Besacier | Eric Castelli
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

First Broadcast News Transcription System for Khmer Language
Sopheap Seng | Sethserey Sam | Laurent Besacier | Brigitte Bigi | Eric Castelli
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present an overview on the development of a large vocabulary continuous speech recognition (LVCSR) system for Khmer, the official language of Cambodia, spoken by more than 15 million people. As an under-resourced language, develop a LVCSR system for Khmer is a challenging task. We describe our methodologies for quick language data collection and processing for language modeling and acoustic modeling. For language modeling, we investigate the use of word and sub-word as basic modeling unit in order to see the potential of sub-word units in the case of unsegmented language like Khmer. Grapheme-based acoustic modeling is used to quickly build our Khmer language acoustic model. Furthermore, the approaches and tools used for the development of our system are documented and made publicly available on the web. We hope this will contribute to accelerate the development of LVCSR system for a new language, especially for under-resource languages of developing countries where resources and expertise are limited.

The LIG Arabic/English speech translation system at IWSLT08.
L. Besacier | A. Ben-Youssef | H. Blanchon
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper is a description of the system presented by the LIG laboratory to the IWSLT08 speech translation evaluation. The LIG participated, for the second time this year, in the Arabic to English speech translation task. For translation, we used a conventional statistical phrase-based system developed using the moses open source decoder. We describe chronologically the improvements made since last year, starting from the IWSLT 2007 system, following with the improvements made for our 2008 submission. Then, we discuss in section 5 some post-evaluation experiments made very recently, as well as some on-going work on Arabic / English speech to text translation. This year, the systems were ranked according to the (BLEU+METEOR)/2 score of the primary ASR output run submissions. The LIG was ranked 5th/10 based on this rule.

2007

The LIG Arabic/English speech translation system at IWSLT07
Laurent Besacier | Amar Mahdhaoui | Viet-Bac Le
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper is a description of the system presented by the LIG laboratory to the IWSLT07 speech translation evaluation. The LIG participated, for the first time this year, in the Arabic to English speech translation task. For translation, we used a conventional statistical phrase-based system developed using the moses open source decoder. Our baseline MT system is described and we discuss particularly the use of an additional bilingual dictionary which seems useful when few training data is available. The main contribution of this paper concerns the proposal of a lattice decomposition algorithm that allows transforming a word lattice into a sub word lattice compatible with our MT model that uses word segmentation on the Arabic part. The lattice is then transformed into a confusion network which can be directly decoded into moses. The results show that this method outperforms the conventional 1-best translation which consists in translating only the most probable ASR hypothesis. The best BLEU score, from ASR output obtained on IWSLT06 evaluation data is 0.2253. The results confirm the interest of full CN decoding for speech translation, compared to traditional ASR 1-best approach. Our primary system was ranked 7/14 for IWSLT07 AE ASR task with a BLEU score of 0.3804.

2006

IBM MASTOR SYSTEM: Multilingual Automatic Speech-to-Speech Translator
Yuqing Gao | Bowen Zhou | Ruhi Sarikaya | Mohamed Afify | Hong-Kwang Kuo | Wei-zhong Zhu | Yonggang Deng | Charles Prosser | Wei Zhang | Laurent Besacier
Proceedings of the First International Workshop on Medical Speech Translation

A French Non-Native Corpus for Automatic Speech Recognition
Tien-Ping Tan | Laurent Besacier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Automatic speech recognition (ASR) technology has achieved a level of maturity, where it is already practical to be used by novice users. However, most non-native speakers are still not comfortable with services including ASR systems, because of the accuracy on non-native speakers. This paper describes our approach in constructing a non-native corpus particularly in French for testing and adapting non-native speaker for automatic speech recognition. Finally, we also propose in this paper a method for detecting pronunciation variants and possible pronunciation mistakes by non-native speakers.

2004

Traduction de dialogue: résultats du projet NESPOLE! et pistes pour le domaine
Hervé Blanchon | Laurent Besacier
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Dans cet article, nous détaillons les résultats de la seconde évaluation du projet européen NESPOLE! auquel nous avons pris part pour le français. Dans ce projet, ainsi que dans ceux qui l’ont précédé, des techniques d’évaluation subjectives — réalisées par des évaluateurs humains — ont été mises en oeuvre. Nous présentons aussi les nouvelles techniques objectives — automatiques — proposées en traduction de l’écrit et mises en oeuvre dans le projet C-STAR III. Nous conclurons en proposant quelques idées et perspectives pour le domaine.

Modèle de langage sémantique pour la reconnaissance automatique de parole dans un contexte de traduction
Quang Vu-minh | Laurent Besacier | Hervé Blanchon | Brigitte Bigi
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Le travail présenté dans cet article a été réalisé dans le cadre d’un projet global de traduction automatique de la parole. L’approche de traduction est fondée sur un langage pivot ou Interchange Format (IF), qui représente le sens de la phrase indépendamment de la langue. Nous proposons une méthode qui intègre des informations sémantiques dans le modèle statistique de langage du système de Reconnaissance Automatique de Parole. Le principe consiste a utiliser certaines classes définies dans l’IF comme des classes sémantiques dans le modèle de langage. Ceci permet au système de reconnaissance de la parole d’analyser partiellement en IF les tours de parole. Les expérimentations realisées montrent qu’avec cette approche, le système de reconnaissance peut analyser directement en IF une partie des données de dialogues de notre application, sans faire appel au système de traduction (35% des mots ; 58% des tours de parole), tout en maintenant le même niveau de performance du système global.

Spoken and Written Language Resources for Vietnamese
Viet-Bac Le | Do-Dat Tran | Eric Castelli | Laurent Besacier | Jean-François Serignat
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Spoken dialogue translation systems evaluation: results, new trends, problems and proposals
Herve Blanchon | Christian Boitet | Laurent Besacier
Proceedings of the First International Workshop on Spoken Language Translation: Papers

2000

A New Methodology for Speech Corpora Definition from Internet Documents
D. Vaufreydaz | C. Bergamini | J.F. Serignat | L. Besacier | M. Akbar
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Co-authors

Ngoc Quang Luong 6

Marcely Zanon Boito 6

Brigitte Bigi 5

Thi-Ngoc-Diep Do 5

Mahault Garnerin 5

Éric Le Ferrand 5

Fabrice Lefèvre 5

Vassilina Nikoulina 5

Frédéric Agnès 4

Christian Boitet 4

Ioan Calapodescu 4

Yannick Estève 4

Jérémy Ferrero 4

Pierre Godard 4

Bassam Jabaian 4

Solange Rossato 4

Nasredine Semmar 4

Christophe Servan 4

Gilles Sérasset 4

François Yvon 4

Othman Zennaki 4

Martine Adda-Decker 3

Pushpak Bhattacharyya 3

Fethi Bougares 3

Maximin Coavoux 3

Marco Dinarelli 3

Emmanuelle Esperança-Rodier 3

Matthias Gallé 3

William Havard 3

Annie Rialland 3

Jakob Verbeek 3

Solomon Teferra Abate 2

Alexandre Allauzen 2

Marwen Azouzi 2

Hélène Bonneau-Maynard 2

Caroline Brun 2

Nathalie Camelin 2

Antoine Caubrière 2

Jean-Pierre Chevrot 2

Jamison Cooper-Leavitt 2

Benoit Crabbé 2

Olivier Galibert 2

Elodie Gauthier 2

Hadrien Gelas 2

William N. Havard 2

James Henderson 2

Zae Myung Kim 2

Guy-Noel Kouarata 2

Michael Melese 2

Alireza Mohammadshahi 2

Djamel Mostefa 2

François Pellegrino 2

Olivier Pietquin 2

Matthieu Quignard 2

Lina M. Rojas Barahona 2

Vincent Segonne 2

Jean-François Serignat 2

Natalia Tomashenko 2

Aline Villavicencio 2

Changhan Wang 2

Mohamed Afify 1

Mohammad Akbar 1

Md Mahfuz Ibn Alam 1

Antonios Anastasopoulos 1

Georges Antoniadis 1

Katya Aplonova 1

Claude Barras 1

Dorothee Beermann 1

Mohamed Belgacem 1

A. Ben-Youssef 1

Juan Benjumea 1

Mohamed-Faouzi Benzeghiba 1

Carole Bergamini 1

Kamel Bouzidi 1

Hervé Bredin 1

Pierrick Bruneau 1

Francis Brunet-Manquat 1

Mateusz Budnik 1

Christopher Cox 1

Yonggang Deng 1

Georgiana Dinu 1

Uriel Pascal Elingui 1

Marcello Federico 1

Gil Francopoulo 1

Benjamin Galliot 1

Eric Gaussier 1

Muhammad Ghulam Abbas Malik 1

Séverine Guillaume 1

Gholamreza Haffari 1

Javier Hernando 1

Nicolas Hervé 1

Stéphane Huet 1

Guillaume Jacques 1

Marie-Paule Jacques 1

Kweonwoo Jung 1

Dongyeop Kang 1

Alexander Kobzar 1

Ali Can Kocabiyikoglu 1

Philipp Koehn 1

Olivier Kraif 1

Hong-Kwang Kuo 1

Ivana Kvapilíková 1

Nicholas Lambourne 1

Siddique Latif 1

Antoine Laurent 1

Amar Mahdhaoui 1

M. G. Abbas Malik 1

Joseph Mariani 1

Sylvain Meignier 1

Million Meshesha 1

Alexis Michaud 1

Wolfgang Minker 1

Shachar Mirkin 1

Markus Mueller 1

Thi Ngoc Diep 1

Valentin Pelloin 1

Johann Poignant 1

Diana Nicoleta Popa 1

Charles Prosser 1

Luong Ngoc Quang 1

Georges Quénot 1

Carlos Ramish 1

Sophie Rosset 1

Mickael Rouvier 1

Raphael Rubino 1

Sakriani Sakti 1

Fahimeh Saleh 1

Sethserey Sam*’ 1

Sarah Samson Juan 1

Rahasya Sanders-Dwyer 1

Ruhi Sarikaya 1

Lane Schwartz 1

Claudia Soria 1

Mickael Stefas 1

Sebastian Stüker 1

Thomas Tamisier 1

Tien-Ping Tan 1

Thibaut Thonet 1

Quan Hung Tran 1

Michael Ustaszewski 1

Nathalie Vallée 1

Dominique Vaufreydaz 1

Sylvie Voisin 1

Quang Vu-minh 1

Guillaume Wisniewski 1

Patrice Yemmene 1

Wei-zhong Zhu 1

Ingrid Zukerman 1

Ahmet Üstün 1

Venues