Fatiha Sadat

2024

pdf bib abs
Advancing Language Diversity and Inclusion: Towards a Neural Network-based Spell Checker and Correction for Wolof
Thierno Ibrahima Cissé | Fatiha Sadat
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

This paper introduces a novel approach to spell checking and correction for low-resource and under-represented languages, with a specific focus on an African language, Wolof. By leveraging the capabilities of transformer models and neural networks, we propose an efficient and practical system capable of correcting typos and improving text quality. Our proposed technique involves training a transformer model on a parallel corpus consisting of misspelled sentences and their correctly spelled counterparts, generated using a semi-automatic method. As we fine tune the model to transform misspelled text into accurate sentences, we demonstrate the immense potential of this approach to overcome the challenges faced by resource-scarce and under-represented languages in the realm of spell checking and correction. Our experimental results and evaluations exhibit promising outcomes, offering valuable insights that contribute to the ongoing endeavors aimed at enriching linguistic diversity and inclusion and thus improving digital communication accessibility for languages grappling with scarcity of resources and under-representation in the digital landscape.

2023

pdf bib abs
Beyond Information: Is ChatGPT Empathetic Enough?
Ahmed Belkhir | Fatiha Sadat
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This paper aims to explore and enhance ChatGPT’s abilities to generate more human-like conversations by taking into account the emotional state of the user. To achieve this goal, a prompt-driven Emotional Intelligence is used through the empathetic dialogue dataset in order to propose a more empathetic conversational language model. We propose two altered versions of ChatGPT as follows: (1) an emotion-infused version which takes the user’s emotion as input before generating responses using an emotion classifier based on ELECTRA ; and (2) the emotion adapting version that tries to accommodate for how the user feels without any external component. By analyzing responses of the two proposed altered versions and comparing them to the standard version of ChatGPT, we find that using the external emotion classifier leads to more frequent and pronounced use of positive emotions compared to the standard version. On the other hand, using simple prompt engineering to take the user emotion into consideration, does the opposite. Finally, comparisons with state-of-the-art models highlight the potential of prompt engineering to enhance the emotional abilities of chatbots based on large language models.

pdf bib abs
Towards the First Named Entity Recognition of Inuktitut for an Improved Machine Translation
Ngoc Tan Le | Soumia Kasdi | Fatiha Sadat
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

Named Entity Recognition is a crucial step to ensure good quality performance of several Natural Language Processing applications and tools, including machine translation and information retrieval. Moreover, it is considered as a fundamental module of many Natural Language Understanding tasks such as question-answering systems. This paper presents a first study on NER for an under-represented Indigenous Inuit language of Canada, Inuktitut, which lacks linguistic resources and large labeled data. Our proposed NER model for Inuktitut is built by transferring linguistic characteristics from English to Inuktitut, based on either rules or bilingual word embeddings. We provide an empirical study based on a comparison with the state of the art models and as well as intrinsic and extrinsic evaluations. In terms of Recall, Precision and F-score, the obtained results show the effectiveness of the proposed NER methods. Furthermore, it improved the performance of Inuktitut-English Neural Machine Translation.

pdf bib
Challenges and Issue of Gender Bias in Under-Represented Languages: An Empirical Study on Inuktitut-English NMT
Ngoc Tan Le | Oussama Hansal | Fatiha Sadat
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib abs
Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof
Thierno Ibrahima Cissé | Fatiha Sadat
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

This paper presents a spell checker and correction tool specifically designed for Wolof, an under-represented spoken language in Africa. The proposed spell checker leverages a combination of a trie data structure, dynamic programming, and the weighted Levenshtein distance to generate suggestions for misspelled words. We created novel linguistic resources for Wolof, such as a lexicon and a corpus of misspelled words, using a semi-automatic approach that combines manual and automatic annotation methods. Despite the limited data available for the Wolof language, the spell checker’s performance showed a predictive accuracy of 98.31% and a suggestion accuracy of 93.33%.Our primary focus remains the revitalization and preservation of Wolof as an Indigenous and spoken language in Africa, providing our efforts to develop novel linguistic resources. This work represents a valuable contribution to the growth of computational tools and resources for the Wolof language and provides a strong foundation for future studies in the automatic spell checking and correction field.

2022

pdf bib abs
Indigenous Language Revitalization and the Dilemma of Gender Bias
Oussama Hansal | Ngoc Tan Le | Fatiha Sadat
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Natural Language Processing (NLP), through its several applications, has been considered as one of the most valuable field in interdisciplinary researches, as well as in computer science. However, it is not without its flaws. One of the most common flaws is bias. This paper examines the main linguistic challenges of Inuktitut, an indigenous language of Canada, and focuses on gender bias identification and mitigation. We explore the unique characteristics of this language to help us understand the right techniques that can be used to identify and mitigate implicit biases. We use some methods to quantify the gender bias existing in Inuktitut word embeddings; then we proceed to mitigate the bias and evaluate the performance of the debiased embeddings. Next, we explain how approaches for detecting and reducing bias in English embeddings may be transferred to Inuktitut embeddings by properly taking into account the language’s particular characteristics. Next, we compare the effect of the debiasing techniques on Inuktitut and English. Finally, we highlight some future research directions which will further help to push the boundaries.

pdf bib abs
Challenges and Perspectives for Innu-Aimun within Indigenous Language Technologies
Antoine Cadotte | Tan Le Ngoc | Mathieu Boivin | Fatiha Sadat
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Innu-Aimun is an Algonquian language spoken in Eastern Canada. It is the language of the Innu, an Indigenous people that now lives for the most part in a dozen communities across Quebec and Labrador. Although it is alive, Innu-Aimun sees important preservation and revitalization challenges and issues. The state of its technology is still nascent, with very few existing applications. This paper proposes a first survey of the available linguistic resources and existing technology for Innu-Aimun. Considering the existing linguistic and textual resources, we argue that developing language technology is feasible and propose first steps towards NLP applications like machine translation. The goal of developing such technologies is first and foremost to help efforts in improving language transmission and cultural safety and preservation for Innu-Aimun speakers, as those are considered urgent and vital issues. Finally, we discuss the importance of close collaboration and consultation with the Innu community in order to ensure that language technologies are developed respectfully and in accordance with that goal.

pdf bib abs
Deep Learning-Based Morphological Segmentation for Indigenous Languages: A Study Case on Innu-Aimun
Ngoc Tan Le | Antoine Cadotte | Mathieu Boivin | Fatiha Sadat | Jimena Terraza
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Recent advances in the field of deep learning have led to a growing interest in the development of NLP approaches for low-resource and endangered languages. Nevertheless, relatively little research, related to NLP, has been conducted on indigenous languages. These languages are considered to be filled with complexities and challenges that make their study incredibly difficult in the NLP and AI fields. This paper focuses on the morphological segmentation of indigenous languages, an extremely challenging task because of polysynthesis, dialectal variations with rich morpho-phonemics, misspellings and resource-limited scenario issues. The proposed approach, towards a morphological segmentation of Innu-Aimun, an extremely low-resource indigenous language of Canada, is based on deep learning. Experiments and evaluations have shown promising results, compared to state-of-the-art rule-based and unsupervised approaches.

2021

pdf bib abs
On the Hidden Negative Transfer in Sequential Transfer Learning for Domain Adaptation from News to Tweets
Sara Meftah | Nasredine Semmar | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Proceedings of the Second Workshop on Domain Adaptation for NLP

Transfer Learning has been shown to be a powerful tool for Natural Language Processing (NLP) and has outperformed the standard supervised learning paradigm, as it takes benefit from the pre-learned knowledge. Nevertheless, when transfer is performed between less related domains, it brings a negative transfer, i.e. hurts the transfer performance. In this research, we shed light on the hidden negative transfer occurring when transferring from the News domain to the Tweets domain, through quantitative and qualitative analysis. Our experiments on three NLP taks: Part-Of-Speech tagging, Chunking and Named Entity recognition, reveal interesting insights.

pdf bib abs
Revitalisation des langues autochtones via le prétraitement et la traduction automatique neuronale: le cas de l’inuktitut (Revitalization and Preservation of Indigenous Languages through Natural Language Processing )
Tan Le Ngoc | Fatiha Sadat
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Nous présentons des résumés en français et en anglais de l’article (Tan Le & Sadat, 2020) présenté à la 28ème conférence internationale sur les linguistiques computationnelles (the 28th International Conference on Computational Linguistics) en 2020.

pdf bib abs
Towards a First Automatic Unsupervised Morphological Segmentation for Inuinnaqtun
Ngoc Tan Le | Fatiha Sadat
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Low-resource polysynthetic languages pose many challenges in NLP tasks, such as morphological analysis and Machine Translation, due to available resources and tools, and the morphologically complex languages. This research focuses on the morphological segmentation while adapting an unsupervised approach based on Adaptor Grammars in low-resource setting. Experiments and evaluations on Inuinnaqtun, one of Inuit language family in Northern Canada, considered a language that will be extinct in less than two generations, have shown promising results.

pdf bib
Towards a Low-Resource Neural Machine Translation for Indigenous Languages in Canada
Ngoc Tan Le | Fatiha Sadat
Traitement Automatique des Langues, Volume 62, Numéro 3 : Diversité Linguistique [Linguistic Diversity in Natural Language Processing]

2020

pdf bib abs
Multi-Task Supervised Pretraining for Neural Domain Adaptation
Sara Meftah | Nasredine Semmar | Mohamed-Ayoub Tahiri | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

Two prevalent transfer learning approaches are used in recent works to improve neural networks performance for domains with small amounts of annotated data: Multi-task learning which involves training the task of interest with related auxiliary tasks to exploit their underlying similarities, and Mono-task fine-tuning, where the weights of the model are initialized with the pretrained weights of a large-scale labeled source domain and then fine-tuned with labeled data of the target domain (domain of interest). In this paper, we propose a new approach which takes advantage from both approaches by learning a hierarchical model trained across multiple tasks from a source domain, and is then fine-tuned on multiple tasks of the target domain. Our experiments on four tasks applied to the social media domain show that our proposed approach leads to significant improvements on all tasks compared to both approaches.

pdf bib
Low-Resource NMT: an Empirical Study on the Effect of Rich Morphological Word Segmentation on Inuktitut
Tan Ngoc Le | Fatiha Sadat
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib abs
Towards a Multi-Dataset for Complex Emotions Learning Based on Deep Neural Networks
Billal Belainine | Fatiha Sadat | Mounir Boukadoum | Hakim Lounis
Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources

In sentiment analysis, several researchers have used emoji and hashtags as specific forms of training and supervision. Some emotions, such as fear and disgust, are underrepresented in the text of social media. Others, such as anticipation, are absent. This research paper proposes a new dataset for complex emotion detection using a combination of several existing corpora in order to represent and interpret complex emotions based on the Plutchik’s theory. Our experiments and evaluations confirm that using Transfer Learning (TL) with a rich emotional corpus, facilitates the detection of complex emotions in a four-dimensional space. In addition, the incorporation of the rule on the reverse emotions in the model’s architecture brings a significant improvement in terms of precision, recall, and F-score.

pdf bib abs
Multilingual Neural Machine Translation involving Indian Languages
Pulkit Madaan | Fatiha Sadat
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Neural Machine Translations (NMT) models are capable of translating a single bilingual pair and require a new model for each new language pair. Multilingual Neural Machine Translation models are capable of translating multiple language pairs, even pairs which it hasn’t seen before in training. Availability of parallel sentences is a known problem in machine translation. Multilingual NMT model leverages information from all the languages to improve itself and performs better. We propose a data augmentation technique that further improves this model profoundly. The technique helps achieve a jump of more than 15 points in BLEU score from the multilingual NMT model. A BLEU score of 36.2 was achieved for Sindhi–English translation, which is higher than any score on the leaderboard of the LoResMT SharedTask at MT Summit 2019, which provided the data for the experiments.

pdf bib abs
Revitalization of Indigenous Languages through Pre-processing and Neural Machine Translation: The case of Inuktitut
Tan Ngoc Le | Fatiha Sadat
Proceedings of the 28th International Conference on Computational Linguistics

Indigenous languages have been very challenging when dealing with NLP tasks and applications because of multiple reasons. These languages, in linguistic typology, are polysynthetic and highly inflected with rich morphophonemics and variable dialectal-dependent spellings; which affected studies on any NLP task in the recent years. Moreover, Indigenous languages have been considered as low-resource and/or endangered; which poses a great challenge for research related to Artificial Intelligence and its fields, such as NLP and machine learning. In this paper, we propose a study on the Inuktitut language through pre-processing and neural machine translation, in order to revitalize the language which belongs to the Inuit family, a type of polysynthetic languages spoken in Northern Canada. Our focus is concentrated on: (1) the preprocessing phase, and (2) applications on specific NLP tasks such as morphological analysis and neural machine translation, both for Indigenous languages of Canada. Our evaluations in the context of lowresource Inuktitut-English Neural Machine Translation, showed significant improvements of the proposed approach compared to the state-of-the-art.

2019

bib abs
Augmenting Named Entity Recognition with Commonsense Knowledge
Gaith Dekhili | Tan Ngoc Le | Fatiha Sadat
Proceedings of the 2019 Workshop on Widening NLP

Commonsense can be vital in some applications like Natural Language Understanding (NLU), where it is often required to resolve ambiguity arising from implicit knowledge and underspecification. In spite of the remarkable success of neural network approaches on a variety of Natural Language Processing tasks, many of them struggle to react effectively in cases that require commonsense knowledge. In the present research, we take advantage of the availability of the open multilingual knowledge graph ConceptNet, by using it as an additional external resource in Named Entity Recognition (NER). Our proposed architecture involves BiLSTM layers combined with a CRF layer that was augmented with some features such as pre-trained word embedding layers and dropout layers. Moreover, apart from using word representations, we used also character-based representation to capture the morphological and the orthographic information. Our experiments and evaluations showed an improvement in the overall performance with +2.86 in the F1-measure. Commonsense reasonnig has been employed in other studies and NLP tasks but to the best of our knowledge, there is no study relating the integration of a commonsense knowledge base in NER.

pdf bib abs
Exploration de l’apprentissage par transfert pour l’analyse de textes des réseaux sociaux (Exploring neural transfer learning for social media text analysis )
Sara Meftah | Nasredine Semmar | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

L’apprentissage par transfert représente la capacité qu’un modèle neuronal entraîné sur une tâche à généraliser suffisamment et correctement pour produire des résultats pertinents sur une autre tâche proche mais différente. Nous présentons dans cet article une approche fondée sur l’apprentissage par transfert pour construire automatiquement des outils d’analyse de textes des réseaux sociaux en exploitant les similarités entre les textes d’une langue bien dotée (forme standard d’une langue) et les textes d’une langue peu dotée (langue utilisée en réseaux sociaux). Nous avons expérimenté notre approche sur plusieurs langues ainsi que sur trois tâches d’annotation linguistique (étiquetage morpho-syntaxique, annotation en parties du discours et reconnaissance d’entités nommées). Les résultats obtenus sont très satisfaisants et montrent l’intérêt de l’apprentissage par transfert pour tirer profit des modèles neuronaux profonds sans la contrainte d’avoir à disposition une quantité de données importante nécessaire pour avoir une performance acceptable.

pdf bib abs
Joint Learning of Pre-Trained and Random Units for Domain Adaptation in Part-of-Speech Tagging
Sara Meftah | Youssef Tamaazousti | Nasredine Semmar | Hassane Essafi | Fatiha Sadat
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Fine-tuning neural networks is widely used to transfer valuable knowledge from high-resource to low-resource domains. In a standard fine-tuning scheme, source and target problems are trained using the same architecture. Although capable of adapting to new domains, pre-trained units struggle with learning uncommon target-specific patterns. In this paper, we propose to augment the target-network with normalised, weighted and randomly initialised units that beget a better adaptation while maintaining the valuable source knowledge. Our experiments on POS tagging of social media texts (Tweets domain) demonstrate that our method achieves state-of-the-art performances on 3 commonly used datasets.

2018

pdf bib
Retrieving Information from the French Lexical Network in RDF/OWL Format
Alexsandro Fonseca | Fatiha Sadat | François Lareau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Low-Resource Machine Transliteration Using Recurrent Neural Networks of Asian Languages
Ngoc Tan Le | Fatiha Sadat
Proceedings of the Seventh Named Entities Workshop

Grapheme-to-phoneme models are key components in automatic speech recognition and text-to-speech systems. With low-resource language pairs that do not have available and well-developed pronunciation lexicons, grapheme-to-phoneme models are particularly useful. These models are based on initial alignments between grapheme source and phoneme target sequences. Inspired by sequence-to-sequence recurrent neural network-based translation methods, the current research presents an approach that applies an alignment representation for input sequences and pre-trained source and target embeddings to overcome the transliteration problem for a low-resource languages pair. We participated in the NEWS 2018 shared task for the English-Vietnamese transliteration task.

pdf bib abs
Using Neural Transfer Learning for Morpho-syntactic Tagging of South-Slavic Languages Tweets
Sara Meftah | Nasredine Semmar | Fatiha Sadat | Stephan Raaijmakers
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper, we describe a morpho-syntactic tagger of tweets, an important component of the CEA List DeepLIMA tool which is a multilingual text analysis platform based on deep learning. This tagger is built for the Morpho-syntactic Tagging of Tweets (MTT) Shared task of the 2018 VarDial Evaluation Campaign. The MTT task focuses on morpho-syntactic annotation of non-canonical Twitter varieties of three South-Slavic languages: Slovene, Croatian and Serbian. We propose to use a neural network model trained in an end-to-end manner for the three languages without any need for task or domain specific features engineering. The proposed approach combines both character and word level representations. Considering the lack of annotated data in the social media domain for South-Slavic languages, we have also implemented a cross-domain Transfer Learning (TL) approach to exploit any available related out-of-domain annotated data.

pdf bib
Improving the neural network-based machine transliteration for low-resourced language pair
Ngoc Tan Le | Fatiha Sadat
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2017

pdf bib
A Neural Network Transliteration Model in Low Resource Settings
Tan Le | Fatiha Sadat
Proceedings of Machine Translation Summit XVI: Research Track

pdf bib abs
Translittération automatique pour une paire de langues peu dotée ()
Ngoc Tan Le | Fatiha Sadat | Lucie Ménard
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

La translittération convertit phonétiquement les mots dans une langue source (i.e. français) en mots équivalents dans une langue cible (i.e. vietnamien). Cette conversion nécessite un nombre considérable de règles définies par les experts linguistes pour déterminer comment les phonèmes sont alignés ainsi que prendre en compte le système de phonologie de la langue cible. La problématique pour les paires de langues peu dotées lie à la pénurie des ressources linguistiques. Dans ce travail de recherche, nous présentons une démonstration de conversion de graphème en phonème pour pallier au problème de translittération pour une paire de langues peu dotée, avec une application sur français-vietnamien. Notre système nécessite un petit corpus d’apprentissage phonétique bilingue. Nous avons obtenu des résultats prometteurs, avec un gain de +4,40% de score BLEU, par rapport au système de base utilisant l’approche de traduction automatique statistique.

2016

pdf bib abs
Named Entity Recognition and Hashtag Decomposition to Improve the Classification of Tweets
Billal Belainine | Alexsandro Fonseca | Fatiha Sadat
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

In social networks services like Twitter, users are overwhelmed with huge amount of social data, most of which are short, unstructured and highly noisy. Identifying accurate information from this huge amount of data is indeed a hard task. Classification of tweets into organized form will help the user to easily access these required information. Our first contribution relates to filtering parts of speech and preprocessing this kind of highly noisy and short data. Our second contribution concerns the named entity recognition (NER) in tweets. Thus, the adaptation of existing language tools for natural languages, noisy and not accurate language tweets, is necessary. Our third contribution involves segmentation of hashtags and a semantic enrichment using a combination of relations from WordNet, which helps the performance of our classification system, including disambiguation of named entities, abbreviations and acronyms. Graph theory is used to cluster the words extracted from WordNet and tweets, based on the idea of connected components. We test our automatic classification system with four categories: politics, economy, sports and the medical field. We evaluate and compare several automatic classification systems using part or all of the items described in our contributions and found that filtering by part of speech and named entity recognition dramatically increase the classification precision to 77.3 %. Moreover, a classification system incorporating segmentation of hashtags and semantic enrichment by two relations from WordNet, synonymy and hyperonymy, increase classification precision up to 83.4 %.

pdf bib abs
UQAM-NTL: Named entity recognition in Twitter messages
Ngoc Tan Le | Fatma Mallek | Fatiha Sadat
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

This paper describes our system used in the 2nd Workshop on Noisy User-generated Text (WNUT) shared task for Named Entity Recognition (NER) in Twitter, in conjunction with Coling 2016. Our system is based on supervised machine learning by applying Conditional Random Fields (CRF) to train two classifiers for two evaluations. The first evaluation aims at predicting the 10 fine-grained types of named entities; while the second evaluation aims at predicting no type of named entities. The experimental results show that our method has significantly improved Twitter NER performance.

pdf bib abs
Lexfom: a lexical functions ontology model
Alexsandro Fonseca | Fatiha Sadat | François Lareau
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

A lexical function represents a type of relation that exists between lexical units (words or expressions) in any language. For example, the antonymy is a type of relation that is represented by the lexical function Anti: Anti(big) = small. Those relations include both paradigmatic relations, i.e. vertical relations, such as synonymy, antonymy and meronymy and syntagmatic relations, i.e. horizontal relations, such as objective qualification (legitimate demand), subjective qualification (fruitful analysis), positive evaluation (good review) and support verbs (pay a visit, subject to an interrogation). In this paper, we present the Lexical Functions Ontology Model (lexfom) to represent lexical functions and the relation among lexical units. Lexfom is divided in four modules: lexical function representation (lfrep), lexical function family (lffam), lexical function semantic perspective (lfsem) and lexical function relations (lfrel). Moreover, we show how it combines to Lexical Model for Ontologies (lemon), for the transformation of lexical networks into the semantic web formats. So far, we have implemented 100 simple and 500 complex lexical functions, and encoded about 8,000 syntagmatic and 46,000 paradigmatic relations, for the French language.

2015

pdf bib abs
Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection
Ngoc Tan Le | Fatiha Sadat
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

The creation of high-quality named entity annotated resources is time-consuming and an expensive process. Most of the gold standard corpora are available for English but not for less-resourced languages such as Vietnamese. In Asian languages, this task is remained problematic. This paper focuses on an automatic construction of named entity annotated corpora for Vietnamese-French, a less-resourced pair of languages. We incrementally apply different cross-projection methods using parallel corpora, such as perfect string matching and edit distance similarity. Evaluations on Vietnamese –French pair of languages show a good accuracy (F-score of 94.90%) when identifying named entities pairs and building a named entity annotated parallel corpus.

pdf bib
Multi-Dialect Machine Translation (MuDMat)
Fatiha Sadat
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

We present simple and effective methods for extracting comparable corpora and bilingual lexicons from Wikipedia. We shall exploit the large scale and the structure of Wikipedia articles to extract two resources that will be very useful for natural language applications. We build a comparable corpus from Wikipedia using categories as topic restrictions and we extract bilingual lexicons from inter-language links aligned with statistical method or a combined statistical and linguistic method.

pdf bib
Extraction de lexiques bilingues à partir de Wikipédia (Bilingual lexicon extraction from Wikipedia) [in French]
Rahma Sellami | Fatiha Sadat | Lamia Hadrich Belguith
JEP-TALN-RECITAL 2012, Workshop TALAf 2012: Traitement Automatique des Langues Africaines (TALAf 2012: African Language Processing)

2010

pdf bib abs
Exploitation de Wikipédia pour l’Enrichissement et la Construction des Ressources Linguistiques
Fatiha Sadat | Alexandre Terrasa
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Cet article présente une approche et des résultats utilisant l’encyclopédie en ligne Wikipédia comme ressource semi-structurée de connaissances linguistiques et en particulier comme un corpus comparable pour l’extraction de terminologie bilingue. Cette approche tend à extraire d’abord des paires de terme et traduction à partir de types des informations, liens et textes de Wikipédia. L’étape suivante consiste à l’utilisation de l’information linguistique afin de ré-ordonner les termes et leurs traductions pertinentes et ainsi éliminer les termes cibles inutiles. Les évaluations préliminaires utilisant les paires de langues français-anglais, japonais-français et japonais-anglais ont montré une bonne qualité des paires de termes extraits. Cette étude est très favorable pour la construction et l’enrichissement des ressources linguistiques tels que les dictionnaires et ontologies multilingues. Aussi, elle est très utile pour un système de recherche d’information translinguistique (RIT).

pdf bib
Exploiting a Multilingual Web-based Encyclopedia for Bilingual Terminology Extraction
Fatiha Sadat
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

2006

pdf bib
Arabic Preprocessing Schemes for Statistical Machine Translation
Nizar Habash | Fatiha Sadat
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf bib
Combination of Arabic Preprocessing Schemes for Statistical Machine Translation
Fatiha Sadat | Nizar Habash
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib abs
Système de traduction automatique statistique combinant différentes ressources
Fatiha Sadat | George Foster | Roland Kuhn
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Cet article décrit une approche combinant différents modèles statistiques pour la traduction automatique basée sur les segments. Pour ce faire, différentes ressources sont utilisées, dont deux corpus parallèles aux caractéristiques différentes et un dictionnaire de terminologie bilingue et ce, afin d’améliorer la performance quantitative et qualitative du système de traduction. Nous évaluons notre approche sur la paire de langues français-anglais et montrons comment la combinaison des ressources proposées améliore de façon significative les résultats.

pdf bib abs
Automatic Transliteration of Proper Nouns from Arabic to English
Mehdi M. Kashani | Fred Popowich | Fatiha Sadat
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT

After providing a brief introduction to the transliteration problem, and highlighting some issues specific to Arabic to English translation, a three phase algorithm is introduced as a computational solution to the problem. The algorithm is based on a Hidden Markov Model approach, but also leverages information available in on-line databases. The algorithm is then evaluated, and shown to achieve accuracy approaching .80%