Kamel Smaili - ACL Anthology

Kamel Smaili

Also published as: Kamel Smaïli

2025

Modeling North African Dialects from Standard Languages
Yassine Toughrai | Kamel Smaïli | David Langlois
Proceedings of The Third Arabic Natural Language Processing Conference

Processing North African Arabic dialects presents significant challenges due to high lexical variability, frequent code-switching with French, and the use of both Arabic and Latin scripts. We address this with a phonemebased normalization strategy that maps Arabic and French text into a simplified representation (Arabic rendered in Latin script), reflecting native reading patterns. Using this method, we pretrain BERTbased models on normalized Modern Standard Arabic and French only and evaluate them on Named Entity Recognition (NER) and text classification. Experiments show that normalized standard-language corpora yield competitive performance on North African dialect tasks; in zero-shot NER, Ar_20k surpasses dialectpretrained baselines. Normalization improves vocabulary alignment, indicating that normalized standard corpora can suffice for developing dialect-supportive

ABDUL: A New Approach to Build Language Models for Dialects Using Formal Language Corpora Only
Yassine Toughrai | Kamel Smaïli | David Langlois
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)

Arabic dialects present major challenges for natural language processing (NLP) due to their diglossic nature, phonetic variability, and the scarcity of resources. To address this, we introduce a phoneme-like transcription approach that enables the training of robust language models for North African Dialects (NADs) using only formal language data, without the need for dialect-specific corpora.Our key insight is that Arabic dialects are highly phonetic, with NADs particularly influenced by European languages. This motivated us to develop a novel approach in which we convert Arabic script into a Latin-based representation, allowing our language model, ABDUL, to benefit from existing Latin-script corpora.Our method demonstrates strong performance in multi-label emotion classification and named entity recognition (NER) across various Arabic dialects. ABDUL achieves results comparable to or better than specialized and multilingual models such as DarijaBERT, DziriBERT, and mBERT. Notably, in the NER task, ABDUL outperforms mBERT by 5% in F1-score for Modern Standard Arabic (MSA), Moroccan, and Algerian Arabic, despite using a vocabulary four times smaller than mBERT.

2023

How can machine translation help generate Arab melodic improvisation?
Fadi Al-Ghawanmeh | Alexander Refsum Jensenius | Kamel Smaili
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This article presents a system to generate Arab music improvisation using machine translation (MT). To reach this goal, we developed a MT model to translate a vocal improvisation into an automatic instrumental oud (Arab lute) response. Given the melodic and non-metric musical form, it was necessary to develop efficient textual representations in order for classical MT models to be as successful as in common NLP applications. We experimented with Statistical and Neural MT to train our parallel corpus (Vocal → Instrument) of 6991 sentences. The best model was then used to generate improvisation by iteratively translating the translations of the most common patterns of each maqam (n-grams), producing elaborated variations conditioned to listener feedback. We constructed a dataset of 717 instrumental improvisations to extract their n-grams. Objective evaluation of MT was conducted at two levels: a sentence-level evaluation using the BLEU metric, and a higher level evaluation using musically informed metrics. Objective measures were consistent with one another. Subjective evaluations by experts from the maqam music tradition were promising, and a useful reference for understanding objective results.

2022

Language rehabilitation of people with BROCA aphasia using deep neural machine translation
Kamel Smaili | David Langlois | Peter Pribil
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

More than 13 million people suffer a stroke each year. Aphasia is known as a language disorder usually caused by a stroke that damages a specific area of the brain that controls the expression and understanding of language. Aphasia is characterized by a disturbance of the linguistic code affecting encoding and/or decoding of the language. Our project aims to propose a method that helps a person suffering from aphasia to communicate better with those around him. For this, we will propose a machine translation capable of correcting aphasic errors and helping the patient to communicate more easily. To build such a system, we need a parallel corpus; to our knowledge, this corpus does not exist, especially for French. Therefore, the main challenge and the objective of this task is to build a parallel corpus composed of sentences with aphasic errors and their corresponding correction. We will show how we create a pseudo-aphasia corpus from real data, and then we will show the feasibility of our project to translate from aphasia data to natural language. The preliminary results show that the deep learning methods we used achieve correct translations corresponding to a BLEU of 38.6.

The Only Chance to Understand: Machine Translation of the Severely Endangered Low-resource Languages of Eurasia
Anna Mosolova | Kamel Smaili
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

Numerous machine translation systems have been proposed since the appearance of this task. Nowadays, new large language model-based algorithms show results that sometimes overcome human ones on the rich-resource languages. Nevertheless, it is still not the case for the low-resource languages, for which all these algorithms did not show equally impressive results. In this work, we want to compare 3 generations of machine translation models on 7 low-resource languages and make a step further by proposing a new way of automatic parallel data augmentation using the state-of-the-art generative model.

2020

Analyse de sentiments des vidéos en dialecte algérien (Sentiment analysis of videos in Algerian dialect)
Mohamed Amine Menacer | Karima Abidi | Nouha Othman | Kamel Smaïli
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

La plupart des travaux existant sur l’analyse de sentiments traitent l’arabe standard moderne et ne prennent pas en considération les spécificités de l’arabe dialectal. Cet article présente un système d’analyse de sentiments de textes extraits de vidéos exprimées en dialecte algérien. Dans ce travail, nous avons deux défis à surmonter, la reconnaissance automatique de la parole pour le dialecte algérien et l’analyse de sentiments du texte reconnu. Le développement du système de reconnaissance automatique de la parole est basé sur un corpus oral restreint. Pour pallier le manque de données, nous proposons d’exploiter des données ayant un impact sur le dialecte algérien, à savoir l’arabe standard et le français. L’analyse de sentiments est fondée sur la détection automatique de la polarité des mots en fonction de leur proximité sémantique avec d’autres mots ayant une polarité prédéterminée.

Projet AMIS : résumé et traduction automatique de vidéos (AMIS project : automatic summarization and translation of videos)
Mohamed Amine Menacer | Dominique Fohr | Denis Jouvet | Karima Abidi | David Langlois | Kamel Smaïli
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 4 : Démonstrations et résumés d'articles internationaux

La démonstration de résumé et de traduction automatique de vidéos résulte de nos travaux dans le projet AMIS. L’objectif du projet était d’aider un voyageur à comprendre les nouvelles dans un pays étranger. Pour cela, le projet propose de résumer et traduire automatiquement une vidéo en langue étrangère (ici, l’arabe). Un autre objectif du projet était aussi de comparer les opinions et sentiments exprimés dans plusieurs vidéos comparables. La démonstration porte sur l’aspect résumé, transcription et traduction. Les exemples montrés permettront de comprendre et mesurer qualitativement les résultats du projet.

2019

The SMarT Classifier for Arabic Fine-Grained Dialect Identification
Karima Meftouh | Karima Abidi | Salima Harrat | Kamel Smaili
Proceedings of the Fourth Arabic Natural Language Processing Workshop

This paper describes the approach adopted by the SMarT research group to build a dialect identification system in the framework of the Madar shared task on Arabic fine-grained dialect identification. We experimented several approaches, but we finally decided to use a Multinomial Naive Bayes classifier based on word and character ngrams in addition to the language model probabilities. We achieved a score of 67.73% in terms of Macro accuracy and a macro-averaged F1-score of 67.31%

2018

An Automatic Learning of an Algerian Dialect Lexicon by using Multilingual Word Embeddings
Abidi Karima | Kamel Smaïli
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

An enhanced automatic speech recognition system for Arabic
Mohamed Amine Menacer | Odile Mella | Dominique Fohr | Denis Jouvet | David Langlois | Kamel Smaili
Proceedings of the Third Arabic Natural Language Processing Workshop

Automatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition, it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right Z hamoza above or below Alif.

2015

Statistical Machine Translation Improvement based on Phrase Selection
Cyrine Nasri | Chiraz Latiri | Kamel Smaili
Proceedings of the International Conference Recent Advances in Natural Language Processing

Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus
Karima Meftouh | Salima Harrat | Salma Jamoussi | Mourad Abbas | Kamel Smaili
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

2014

Phrase-based language modelling for statistical machine translation
Achraf Ben Romdhane | Salma Jamoussi | Abdelmajid Ben Hamadou | Kamel Smaïli
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we present our submitted MT system for the IWSLT2014 Evaluation Campaign. We participated in the English-French translation task. In this article we focus on one of the most important component of SMT: the language model. The idea is to use a phrase-based language model. For that, sequences from the source and the target language models are retrieved and used to calculate a phrase n-gram language model. These phrases are used to rewrite the parallel corpus which is then used to calculate a new translation model.

Building and Modelling Multilingual Subjective Corpora
Motaz Saad | David Langlois | Kamel Smaïli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that can be reliably transferred across topic domains and across languages. The method starts by building a classifier that annotates sentences into subjective/objective label using a training data from “movie reviews” domain which is in English language. The annotation can be transferred to another language by classifying English sentences in parallel corpora and transferring the same annotation to the same sentences of the other language. We also shed the light on the link between opinion mining and statistical language modelling, and how such corpora are useful for domain specific language modelling. We show the distinction between subjective and objective sentences which tends to be stable across domains and languages. Our experiments show that language models trained on objective (respectively subjective) corpus lead to better perplexities on objective (respectively subjective) test.

2013

LORIA System for the WMT13 Quality Estimation Shared Task
David Langlois | Kamel Smaïli
Proceedings of the Eighth Workshop on Statistical Machine Translation

Comparing Multilingual Comparable Articles Based On Opinions
Motaz Saad | David Langlois | Kamel Smaïli
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

LORIA System for the WMT12 Quality Estimation Shared Task
David Langlois | Sylvain Raybaud | Kamel Smaïli
Proceedings of the Seventh Workshop on Statistical Machine Translation

2011

Broadcast news speech-to-text translation experiments
Sylvain Raybaud | David Langlois | Kamel Smaïli
Proceedings of Machine Translation Summit XIII: System Presentations

2009

Word- and Sentence-Level Confidence Measures for Machine Translation
Sylvain Raybaud | Caroline Lavecchia | David Langlois | Kamel Smaïli
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

2008

Une alternative aux modèles de traduction statistique d’IBM: Les triggers inter-langues
Caroline Lavecchia | Kamel Smaïli | David Langlois
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous présentons une nouvelle approche pour la traduction automatique fondée sur les triggers inter-langues. Dans un premier temps, nous expliquons le concept de triggers inter-langues ainsi que la façon dont ils sont déterminés. Nous présentons ensuite les différentes expérimentations qui ont été menées à partir de ces triggers afin de les intégrer au mieux dans un processus complet de traduction automatique. Pour cela, nous construisons à partir des triggers inter-langues des tables de traduction suivant différentes méthodes. Nous comparons par la suite notre système de traduction fondé sur les triggers interlangues à un système état de l’art reposant sur le modèle 3 d’IBM (Brown & al., 1993). Les tests menés ont montré que les traductions automatiques générées par notre système améliorent le score BLEU (Papineni & al., 2001) de 2, 4% comparé à celles produites par le système état de l’art.

Phrase-Based Machine Translation based on Simulated Annealing
Caroline Lavecchia | David Langlois | Kamel Smaïli
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we propose a new phrase-based translation model based on inter-lingual triggers. The originality of our method is double. First we identify common source phrases. Then we use inter-lingual triggers in order to retrieve their translations. Furthermore, we consider the way of extracting phrase translations as an optimization issue. For that we use simulated annealing algorithm to find out the best phrase translations among all those determined by inter-lingual triggers. The best phrases are those which improve the translation quality in terms of Bleu score. Tests are achieved on movie subtitle corpora. They show that our phrase-based machine translation (PBMT) system outperforms a state-of-the-art PBMT system by almost 7 points.

2007

Building a bilingual dictionary from movie subtitles based on inter-lingual triggers
Caroline Lavecchia | Kamel Smaili | David Langlois
Proceedings of Translating and the Computer 29

2006

Exploration et utilisation d’informations distantes dans les modèles de langage statistiques
Armelle Brun | David Langlois | Kamel Smaïli
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Dans le cadre de la modélisation statistique du langage, nous montrons qu’il est possible d’utiliser un modèle n-grammes avec un historique qui n’est pas nécessairement celui avec lequel il a été appris. Par exemple, un adverbe présent dans l’historique peut ne pas avoir d’importance pour la prédiction, et devrait donc être ignoré en décalant l’historique utilisé pour la prédiction. Notre étude porte sur les modèles n-grammes classiques et les modèles n-grammes distants et est appliquée au cas des bigrammes. Nous présentons quatre cas d’utilisation pour deux modèles bigrammes : distants et non distants. Nous montrons que la combinaison linéaire dépendante de l’historique de ces quatre cas permet d’améliorer de 14 % la perplexité du modèle bigrammes classique. Par ailleurs, nous nous intéressons à quelques cas de combinaison qui permettent de mettre en valeur les historiques pour lesquels les modèles que nous proposons sont performants.

Linguistic features modeling based on Partial New Cache
Kamel Smaïli | Caroline Lavecchia | Jean-Paul Haton
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The agreement in gender and number is a critical problem in statistical language modeling. One of the main problems in the speech recognition of French language is the presence of misrecognized words due to the bad agreement (in gender and number) between words. Statistical language models do not treat this phenomenon directly. This paper focuses on how to handle the issue of agreements. We introduce an original model called Features-Cache (FC) to estimate the gender and the number of the word to predict. It is a dynamic variable-length Features-Cache for which the size is determined in accordance to syntagm delimitors. This model does not need any syntactic parsing, it is used as any other statistical language model. Several models have been carried out and the best one achieves an improvement of more than 8 points in terms of perplexity.

2004

Fiabilité de la référence humaine dans la détection de thème
Armelle Brun | Kamel Smaïli
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous nous intéressons à la tâche de détection de thème dans le cadre de la reconnaissance automatique de la parole. La combinaison de plusieurs méthodes de détection montre ses limites, avec des performances de 93.1 %. Ces performances nous mènent à remetttre en cause le thème de référence des paragraphes de notre corpus. Nous avons ainsi effectué une étude sur la fiabilité de ces références, en utilisant notamment les mesures Kappa et erreur de Bayes. Nous avons ainsi pu montrer que les étiquettes thématiques des paragraphes du corpus de test comportaient vraisemblablement des erreurs, les performances de détection de thème obtenues doivent donc êtres exploitées prudemment.

A Complete Understanding Speech System Based on Semantic Concepts
Salma Jamoussi | Kamel Smaïli | Dominique Fohr | Jean-Paul Haton
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Language Modeling Using Dynamic Bayesian Networks
Murat Deviren | Khalid Daoudi | Kamel Smaïli
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

Nouvelle approche de la sélection de vocabulaire pour la détection de thème
Armelle Brun | Kamel Smaïli | Jean-Paul Haton
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

En reconnaissance de la parole, un des moyens d’améliorer les performances des systèmes est de passer par l’adaptation des modèles de langage. Une étape cruciale de ce processus consiste à détecter le thème du document traité et à adapter ensuite le modèle de langage. Dans cet article, nous proposons une nouvelle approche de création des vocabulaires utilisés pour la détection de thème. Cette dernière est fondée sur le développement de vocabulaires spécifiques et caractéristiques des différents thèmes. Nous montrons que cette approche permet non seulement d’améliorer les performances des méthodes, mais exploite également des vocabulaires de taille réduite. De plus, elle permet d’améliorer de façon très significative les performances de méthodes de détection lorsqu’elles sont combinées.

Vers la compréhension automatique de la parole : extraction de concepts par réseaux bayésiens
Salma Jamoussi | Kamel Smaïli | Jean-Paul Haton
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La compréhension automatique de la parole peut être considérée comme un problème d’association entre deux langages différents. En entrée, la requête exprimée en langage naturel et en sortie, juste avant l’étape d’interprétation, la même requête exprimée en terme de concepts. Un concept représente un sens bien déterminé. Il est défini par un ensemble de mots partageant les mêmes propriétés sémantiques. Dans cet article, nous proposons une méthode à base de réseau bayésien pour l’extraction automatique des concepts ainsi que trois approches différentes pour la représentation vectorielle des mots. Ces représentations aident un réseau bayésien à regrouper les mots, construisant ainsi la liste adéquate des concepts à partir d’un corpus d’apprentissage. Nous conclurons cet article par la description d’une étape de post-traitement au cours de laquelle, nous étiquetons nos requêtes et nous générons les commandes SQL appropriées validant ainsi, notre approche de compréhension.

2002

Identification thématique hiérarchique : Application aux forums de discussions
Brigitte Bigi | Kamel Smaïli
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les modèles statistiques du langage ont pour but de donner une représentation statistique de la langue mais souffrent de nombreuses imperfections. Des travaux récents ont montré que ces modèles peuvent être améliorés s’ils peuvent bénéficier de la connaissance du thème traité, afin de s’y adapter. Le thème du document est alors obtenu par un mécanisme d’identification thématique, mais les thèmes ainsi traités sont souvent de granularité différente, c’est pourquoi il nous semble opportun qu’ils soient organisés dans une hiérarchie. Cette structuration des thèmes implique la mise en place de techniques spécifiques d’identification thématique. Cet article propose un modèle statistique à base d’unigrammes pour identifier automatiquement le thème d’un document parmi une arborescence prédéfinie de thèmes possibles. Nous présentons également un critère qui permet au modèle de donner un degré de fiabilité à la décision prise. L’ensemble des expérimentations a été réalisé sur des données extraites du groupe ’fr’ des forums de discussion.

WSIM : une méthode de détection de thème fondée sur la similarité entre mots
Armelle Brun | Kamel Smaïli | Jean-Paul Haton
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’adaptation des modèles de langage dans les systèmes de reconnaissance de la parole est un des enjeux importants de ces dernières années. Elle permet de poursuivre la reconnaissance en utilisant le modèle de langage adéquat : celui correspondant au thème identifié. Dans cet article nous proposons une méthode originale de détection de thème fondée sur des vocabulaires caractéristiques de thèmes et sur la similarité entre mots et thèmes. Cette méthode dépasse la méthode classique (TFIDF) de 14%, ce qui représente un gain important en terme d’identification. Nous montrons également l’intérêt de choisir un vocabulaire adéquat. Notre méthode de détermination des vocabulaires atteint des performances 3 fois supérieures à celles obtenues avec des vocabulaires construits sur la fréquence des mots.

Venues