Fethi Bougares


2023

pdf bib
ELYADATA at WojoodNER Shared Task: Data and Model-centric Approaches for Arabic Flat and Nested NER
Imen Laouirine | Haroun Elleuch | Fethi Bougares
Proceedings of ArabicNLP 2023

This paper describes our submissions to the WojoodNER shared task organized during the first ArabicNLP conference. We participated in the two proposed sub-tasks of flat and nested Named Entity Recognition (NER). Our systems were ranked first over eight and third over eleven in the Nested NER and Flat NER, respectively. All our primary submissions are based on DiffusionNER models (Shen et al., 2023), where the NER task is formulated as a boundary-denoising diffusion process. Experiments on nested WojoodNER achieves the best results with a micro F1-score of 93.73%. For the flat sub-task, our primary system was the third-best system, with a micro F1-score of 91.92%.

pdf bib
ON-TRAC Consortium Systems for the IWSLT 2023 Dialectal and Low-resource Speech Translation Tasks
Antoine Laurent | Souhir Gahbiche | Ha Nguyen | Haroun Elleuch | Fethi Bougares | Antoine Thiol | Hugo Riguidel | Salima Mdhaffar | Gaëlle Laperrière | Lucas Maison | Sameer Khurana | Yannick Estève
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper describes the ON-TRAC consortium speech translation systems developed for IWSLT 2023 evaluation campaign. Overall, we participated in three speech translation tracks featured in the low-resource and dialect speech translation shared tasks, namely; i) spoken Tamasheq to written French, ii) spoken Pashto to written French, and iii) spoken Tunisian to written English. All our primary submissions are based on the end-to-end speech-to-text neural architecture using a pretrained SAMU-XLSR model as a speech encoder and a mbart model as a decoder. The SAMU-XLSR model is built from the XLS-R 128 in order to generate language agnostic sentence-level embeddings. This building is driven by the LaBSE model trained on multilingual text dataset. This architecture allows us to improve the input speech representations and achieve significant improvements compared to conventional end-to-end speech translation systems.

2022

pdf bib
Proceedings of the Seventh Conference on Machine Translation (WMT)
Philipp Koehn | Loïc Barrault | Ondřej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Tom Kocmi | André Martins | Makoto Morishita | Christof Monz | Masaaki Nagata | Toshiaki Nakazawa | Matteo Negri | Aurélie Névéol | Mariana Neves | Martin Popel | Marco Turchi | Marcos Zampieri
Proceedings of the Seventh Conference on Machine Translation (WMT)

pdf bib
Speech Resources in the Tamasheq Language
Marcely Zanon Boito | Fethi Bougares | Florentin Barbier | Souhir Gahbiche | Loïc Barrault | Mickael Rouvier | Yannick Estève
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from daily broadcast news in Niger (Studio Kalangou) and Mali (Studio Tamani). We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller 17 hours parallel corpus of audio recordings in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language.

pdf bib
ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks
Marcely Zanon Boito | John Ortega | Hugo Riguidel | Antoine Laurent | Loïc Barrault | Fethi Bougares | Firas Chaabani | Ha Nguyen | Florentin Barbier | Souhir Gahbiche | Yannick Estève
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation. For the Tunisian Arabic-English dataset (low-resource and dialect tracks), we build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR. Our results show that in our settings pipeline approaches are still very competitive, and that with the use of transfer learning, they can outperform end-to-end models for speech translation (ST). For the Tamasheq-French dataset (low-resource track) our primary submission leverages intermediate representations from a wav2vec 2.0 model trained on 234 hours of Tamasheq audio, while our contrastive model uses a French phonetic transcription of the Tamasheq audio as input in a Conformer speech translation architecture jointly trained on automatic speech recognition, ST and machine translation losses. Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models. Results also illustrate that even approximate phonetic transcriptions can improve ST scores.

pdf bib
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Houda Bouamor | Hend Al-Khalifa | Kareem Darwish | Owen Rambow | Fethi Bougares | Ahmed Abdelali | Nadi Tomeh | Salam Khalifa | Wajdi Zaghouani
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

pdf bib
End-to-End Speech Translation of Arabic to English Broadcast News
Fethi Bougares | Salim Jouili
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Speech translation (ST) is the task of directly translating acoustic speech signals in a source language into text in a foreign language. ST task has been addressed, for a long time, using a pipeline approach with two modules : first an Automatic Speech Recognition (ASR) in the source language followed by a text-to-text Machine translation (MT). In the past few years, we have seen a paradigm shift towards the end-to-end approaches using sequence-to-sequence deep neural network models. This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system. Starting from independent ASR and MT LDC releases, we were able to identify about 92 hours of Arabic audio recordings for which the manual transcription was also translated into English at the segment level. These data was used to train and compare pipeline and end-to-end speech translation systems under multiple scenarios including transfer learning and data augmentation techniques.

2021

pdf bib
Proceedings of the Sixth Conference on Machine Translation
Loic Barrault | Ondrej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussa | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Tom Kocmi | Andre Martins | Makoto Morishita | Christof Monz
Proceedings of the Sixth Conference on Machine Translation

pdf bib
Proceedings of the Sixth Arabic Natural Language Processing Workshop
Nizar Habash | Houda Bouamor | Hazem Hajj | Walid Magdy | Wajdi Zaghouani | Fethi Bougares | Nadi Tomeh | Ibrahim Abu Farha | Samia Touileb
Proceedings of the Sixth Arabic Natural Language Processing Workshop

2020

pdf bib
ON-TRAC Consortium for End-to-End and Simultaneous Speech Translation Challenge Tasks at IWSLT 2020
Maha Elbayad | Ha Nguyen | Fethi Bougares | Natalia Tomashenko | Antoine Caubrière | Benjamin Lecouteux | Yannick Estève | Laurent Besacier
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2020, offline speech translation and simultaneous speech translation. ON-TRAC Consortium is composed of researchers from three French academic laboratories: LIA (Avignon Université), LIG (Université Grenoble Alpes), and LIUM (Le Mans Université). Attention-based encoder-decoder models, trained end-to-end, were used for our submissions to the offline speech translation track. Our contributions focused on data augmentation and ensembling of multiple models. In the simultaneous speech translation track, we build on Transformer-based wait-k models for the text-to-text subtask. For speech-to-text simultaneous translation, we attach a wait-k MT system to a hybrid ASR system. We propose an algorithm to control the latency of the ASR+MT cascade and achieve a good latency-quality trade-off on both subtasks.

pdf bib
Text and Speech-based Tunisian Arabic Sub-Dialects Identification
Najla Ben Abdallah | Saméh Kchaou | Fethi Bougares
Proceedings of the Twelfth Language Resources and Evaluation Conference

Dialect IDentification (DID) is a challenging task, and it becomes more complicated when it is about the identification of dialects that belong to the same country. Indeed, dialects of the same country are closely related and exhibit a significant overlapping at the phonetic and lexical levels. In this paper, we present our first results on a dialect classification task covering four sub-dialects spoken in Tunisia. We use the term ’sub-dialect’ to refer to the dialects belonging to the same country. We conducted our experiments aiming to discriminate between Tunisian sub-dialects belonging to four different cities: namely Tunis, Sfax, Sousse and Tataouine. A spoken corpus of 1673 utterances is collected, transcribed and freely distributed. We used this corpus to build several speech- and text-based DID systems. Our results confirm that, at this level of granularity, dialects are much better distinguishable using the speech modality. Indeed, we were able to reach an F-1 score of 93.75% using our best speech-based identification system while the F-1 score is limited to 54.16% using text-based DID on the same test set.

pdf bib
Proceedings of the Fifth Arabic Natural Language Processing Workshop
Imed Zitouni | Muhammad Abdul-Mageed | Houda Bouamor | Fethi Bougares | Mahmoud El-Haj | Nadi Tomeh | Wajdi Zaghouani
Proceedings of the Fifth Arabic Natural Language Processing Workshop

pdf bib
Proceedings of the Fifth Conference on Machine Translation
Loïc Barrault | Ondřej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Alexander Fraser | Yvette Graham | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Makoto Morishita | Christof Monz | Masaaki Nagata | Toshiaki Nakazawa | Matteo Negri
Proceedings of the Fifth Conference on Machine Translation

pdf bib
Findings of the First Shared Task on Lifelong Learning Machine Translation
Loïc Barrault | Magdalena Biesialska | Marta R. Costa-jussà | Fethi Bougares | Olivier Galibert
Proceedings of the Fifth Conference on Machine Translation

A lifelong learning system can adapt to new data without forgetting previously acquired knowledge. In this paper, we introduce the first benchmark for lifelong learning machine translation. For this purpose, we provide training, lifelong and test data sets for two language pairs: English-German and English-French. Additionally, we report the results of our baseline systems, which we make available to the public. The goal of this shared task is to encourage research on the emerging topic of lifelong learning machine translation.

2019

pdf bib
Étude de l’apprentissage par transfert de systèmes de traduction automatique neuronaux (Study on transfer learning in neural machine translation )
Adrien Bardet | Fethi Bougares | Loïc Barrault
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

L’apprentissage par transfert est une solution au problème de l’apprentissage de systèmes de traduction automatique neuronaux pour des paires de langues peu dotées. Dans cet article, nous proposons une analyse de cette méthode. Nous souhaitons évaluer l’impact de la quantité de données et celui de la proximité des langues impliquées pour obtenir le meilleur transfert possible. Nous prenons en compte ces deux paramètres non seulement pour une tâche de traduction “classique” mais également lorsque les corpus de données font défaut. Enfin, il s’agit de proposer une approche où volume de données et proximité des langues sont combinées afin de ne plus avoir à trancher entre ces deux éléments.

pdf bib
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Wassim El-Hajj | Lamia Hadrich Belguith | Fethi Bougares | Walid Magdy | Imed Zitouni | Nadi Tomeh | Mahmoud El-Haj | Wajdi Zaghouani
Proceedings of the Fourth Arabic Natural Language Processing Workshop

pdf bib
LIUM-MIRACL Participation in the MADAR Arabic Dialect Identification Shared Task
Saméh Kchaou | Fethi Bougares | Lamia Hadrich-Belguith
Proceedings of the Fourth Arabic Natural Language Processing Workshop

This paper describes the joint participation of the LIUM and MIRACL Laboratories at the Arabic dialect identification challenge of the MADAR Shared Task (Bouamor et al., 2019) conducted during the Fourth Arabic Natural Language Processing Workshop (WANLP 2019). We participated to the Travel Domain Dialect Identification subtask. We built several systems and explored different techniques including conventional machine learning methods and deep learning algorithms. Deep learning approaches did not perform well on this task. We experimented several classification systems and we were able to identify the dialect of an input sentence with an F1-score of 65.41% on the official test set using only the training data supplied by the shared task organizers.

pdf bib
LIUM’s Contributions to the WMT2019 News Translation Task: Data and Systems for German-French Language Pairs
Fethi Bougares | Jane Wottawa | Anne Baillot | Loïc Barrault | Adrien Bardet
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the neural machine translation (NMT) systems of the LIUM Laboratory developed for the French↔German news translation task of the Fourth Conference onMachine Translation (WMT 2019). The chosen language pair is included for the first time in the WMT news translation task. We de-scribe how the training and the evaluation data was created. We also present our participation in the French↔German translation directions using self-attentional Transformer networks with small and big architectures.

2018

pdf bib
Findings of the Third Shared Task on Multimodal Machine Translation
Loïc Barrault | Fethi Bougares | Lucia Specia | Chiraag Lala | Desmond Elliott | Stella Frank
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present the results from the third shared task on multimodal machine translation. In this task a source sentence in English is supplemented by an image and participating systems are required to generate a translation for such a sentence into German, French or Czech. The image can be used in addition to (or instead of) the source sentence. This year the task was extended with a third target language (Czech) and a new test set. In addition, a variant of this task was introduced with its own test set where the source sentence is given in multiple languages: English, French and German, and participating systems are required to generate a translation in Czech. Seven teams submitted 45 different systems to the two variants of the task. Compared to last year, the performance of the multimodal submissions improved, but text-only systems remain competitive.

pdf bib
LIUM-CVC Submissions for WMT18 Multimodal Translation Task
Ozan Caglayan | Adrien Bardet | Fethi Bougares | Loïc Barrault | Kai Wang | Marc Masana | Luis Herranz | Joost van de Weijer
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT18 Shared Task on Multimodal Translation. This year we propose several modifications to our previous multimodal attention architecture in order to better integrate convolutional features and refine them using encoder-side information. Our final constrained submissions ranked first for English→French and second for English→German language pairs among the constrained submissions according to the automatic evaluation metric METEOR.

2017

pdf bib
Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments
Salima Medhaffar | Fethi Bougares | Yannick Estève | Lamia Hadrich-Belguith
Proceedings of the Third Arabic Natural Language Processing Workshop

Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political interest. In this paper we are interested in the SA of the Tunisian Dialect. We utilize Machine Learning techniques to determine the polarity of comments written in Tunisian Dialect. First, we evaluate the SA systems performances with models trained using freely available MSA and Multi-dialectal data sets. We then collect and annotate a Tunisian Dialect corpus of 17.000 comments from Facebook. This corpus allows us a significant accuracy improvement compared to the best model trained on other Arabic dialects or MSA data. We believe that this first freely available corpus will be valuable to researchers working in the field of Tunisian Sentiment Analysis and similar areas.

pdf bib
Word Representations in Factored Neural Machine Translation
Franck Burlot | Mercedes García-Martínez | Loïc Barrault | Fethi Bougares | François Yvon
Proceedings of the Second Conference on Machine Translation

pdf bib
Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description
Desmond Elliott | Stella Frank | Loïc Barrault | Fethi Bougares | Lucia Specia
Proceedings of the Second Conference on Machine Translation

pdf bib
LIUM Machine Translation Systems for WMT17 News Translation Task
Mercedes García-Martínez | Ozan Caglayan | Walid Aransa | Adrien Bardet | Fethi Bougares | Loïc Barrault
Proceedings of the Second Conference on Machine Translation

pdf bib
LIUM-CVC Submissions for WMT17 Multimodal Translation Task
Ozan Caglayan | Walid Aransa | Adrien Bardet | Mercedes García-Martínez | Fethi Bougares | Loïc Barrault | Marc Masana | Luis Herranz | Joost van de Weijer
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
Does Multimodality Help Human and Machine for Translation and Image Captioning?
Ozan Caglayan | Walid Aransa | Yaxing Wang | Marc Masana | Mercedes García-Martínez | Fethi Bougares | Loïc Barrault | Joost van de Weijer
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
SHEF-LIUM-NN: Sentence level Quality Estimation with Neural Network Features
Kashif Shah | Fethi Bougares | Loïc Barrault | Lucia Specia
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Factored Neural Machine Translation Architectures
Mercedes García-Martínez | Loïc Barrault | Fethi Bougares
Proceedings of the 13th International Conference on Spoken Language Translation

In this paper we investigate the potential of the neural machine translation (NMT) when taking into consideration the linguistic aspect of target language. From this standpoint, the NMT approach with attention mechanism [1] is extended in order to produce several linguistically derived outputs. We train our model to simultaneously output the lemma and its corresponding factors (e.g. part-of-speech, gender, number). The word level translation is built with a mapping function using a priori linguistic information. Compared to the standard NMT system, factored architecture increases significantly the vocabulary coverage while decreasing the number of unknown words. With its richer architecture, the Factored NMT approach allows us to implement several training setup that will be discussed in detail along this paper. On the IWSLT’15 English-to-French task, FNMT model outperforms NMT model in terms of BLEU score. A qualitative analysis of the output on a set of test sentences shows the effectiveness of the FNMT model.

2015

pdf bib
Investigating Continuous Space Language Models for Machine Translation Quality Estimation
Kashif Shah | Raymond W. M. Ng | Fethi Bougares | Lucia Specia
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Continuous Adaptation to User Feedback for Statistical Machine Translation
Frédéric Blain | Fethi Bougares | Amir Hazem | Loïc Barrault | Holger Schwenk
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
SHEF-NN: Translation Quality Estimation with Neural Networks
Kashif Shah | Varvara Logacheva | Gustavo Paetzold | Frederic Blain | Daniel Beck | Fethi Bougares | Lucia Specia
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
UMMU@QALB-2015 Shared Task: Character and Word level SMT pipeline for Automatic Error Correction of Arabic Text
Fethi Bougares | Houda Bouamor
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
Incremental Adaptation Strategies for Neural Network Language Models
Alex Ter-Sarkisov | Holger Schwenk | Fethi Bougares | Loïc Barrault
Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality

2014

pdf bib
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Kyunghyun Cho | Bart van Merriënboer | Caglar Gulcehre | Dzmitry Bahdanau | Fethi Bougares | Holger Schwenk | Yoshua Bengio
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2012

pdf bib
Avancées dans le domaine de la transcription automatique par décodage guidé (Improvements on driven decoding system combination) [in French]
Fethi Bougares | Yannick Estève | Paul Deléglise | Mickael Rouvier | Georges Linarès
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

2011

pdf bib
LIUM’s systems for the IWSLT 2011 speech translation tasks
Anthony Rousseau | Fethi Bougares | Paul Deléglise | Holger Schwenk | Yannick Estève
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the three systems developed by the LIUM for the IWSLT 2011 evaluation campaign. We participated in three of the proposed tasks, namely the Automatic Speech Recognition task (ASR), the ASR system combination task (ASR_SC) and the Spoken Language Translation task (SLT), since these tasks are all related to speech translation. We present the approaches and specificities we developed on each task.

2009

pdf bib
LIG approach for IWSLT09
Fethi Bougares | Laurent Besacier | Hervé Blanchon
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the LIG experiments in the context of IWSLT09 evaluation (Arabic to English Statistical Machine Translation task). Arabic is a morphologically rich language, and recent experimentations in our laboratory have shown that the performance of Arabic to English SMT systems varies greatly according to the Arabic morphological segmenters applied. Based on this observation, we propose to use simultaneously multiple segmentations for machine translation of Arabic. The core idea is to keep the ambiguity of the Arabic segmentation in the system input (using confusion networks or lattices). Then, we hope that the best segmentation will be chosen during MT decoding. The mathematics of this multiple segmentation approach are given. Practical implementations in the case of verbatim text translation as well as speech translation (outside of the scope of IWSLT09 this year) are proposed. Experiments conducted in the framework of IWSLT evaluation campaign show the potential of the multiple segmentation approach. The last part of this paper explains in detail the different systems submitted by LIG at IWSLT09 and the results obtained.
Search
Co-authors