Fatiha Sadat


2021

pdf bib
Revitalisation des langues autochtones via le prétraitement et la traduction automatique neuronale: le cas de l’inuktitut (Revitalization and Preservation of Indigenous Languages through Natural Language Processing )
Tan Le Ngoc | Fatiha Sadat
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Nous présentons des résumés en français et en anglais de l’article (Tan Le & Sadat, 2020) présenté à la 28ème conférence internationale sur les linguistiques computationnelles (the 28th International Conference on Computational Linguistics) en 2020.

pdf bib
On the Hidden Negative Transfer in Sequential Transfer Learning for Domain Adaptation from News to Tweets
Sara Meftah | Nasredine Semmar | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Proceedings of the Second Workshop on Domain Adaptation for NLP

Transfer Learning has been shown to be a powerful tool for Natural Language Processing (NLP) and has outperformed the standard supervised learning paradigm, as it takes benefit from the pre-learned knowledge. Nevertheless, when transfer is performed between less related domains, it brings a negative transfer, i.e. hurts the transfer performance. In this research, we shed light on the hidden negative transfer occurring when transferring from the News domain to the Tweets domain, through quantitative and qualitative analysis. Our experiments on three NLP taks: Part-Of-Speech tagging, Chunking and Named Entity recognition, reveal interesting insights.

pdf bib
Towards a First Automatic Unsupervised Morphological Segmentation for Inuinnaqtun
Ngoc Tan Le | Fatiha Sadat
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Low-resource polysynthetic languages pose many challenges in NLP tasks, such as morphological analysis and Machine Translation, due to available resources and tools, and the morphologically complex languages. This research focuses on the morphological segmentation while adapting an unsupervised approach based on Adaptor Grammars in low-resource setting. Experiments and evaluations on Inuinnaqtun, one of Inuit language family in Northern Canada, considered a language that will be extinct in less than two generations, have shown promising results.

2020

pdf bib
Revitalization of Indigenous Languages through Pre-processing and Neural Machine Translation: The case of Inuktitut
Tan Ngoc Le | Fatiha Sadat
Proceedings of the 28th International Conference on Computational Linguistics

Indigenous languages have been very challenging when dealing with NLP tasks and applications because of multiple reasons. These languages, in linguistic typology, are polysynthetic and highly inflected with rich morphophonemics and variable dialectal-dependent spellings; which affected studies on any NLP task in the recent years. Moreover, Indigenous languages have been considered as low-resource and/or endangered; which poses a great challenge for research related to Artificial Intelligence and its fields, such as NLP and machine learning. In this paper, we propose a study on the Inuktitut language through pre-processing and neural machine translation, in order to revitalize the language which belongs to the Inuit family, a type of polysynthetic languages spoken in Northern Canada. Our focus is concentrated on: (1) the preprocessing phase, and (2) applications on specific NLP tasks such as morphological analysis and neural machine translation, both for Indigenous languages of Canada. Our evaluations in the context of lowresource Inuktitut-English Neural Machine Translation, showed significant improvements of the proposed approach compared to the state-of-the-art.

pdf bib
Multilingual Neural Machine Translation involving Indian Languages
Pulkit Madaan | Fatiha Sadat
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Neural Machine Translations (NMT) models are capable of translating a single bilingual pair and require a new model for each new language pair. Multilingual Neural Machine Translation models are capable of translating multiple language pairs, even pairs which it hasn’t seen before in training. Availability of parallel sentences is a known problem in machine translation. Multilingual NMT model leverages information from all the languages to improve itself and performs better. We propose a data augmentation technique that further improves this model profoundly. The technique helps achieve a jump of more than 15 points in BLEU score from the multilingual NMT model. A BLEU score of 36.2 was achieved for Sindhi–English translation, which is higher than any score on the leaderboard of the LoResMT SharedTask at MT Summit 2019, which provided the data for the experiments.

pdf bib
Towards a Multi-Dataset for Complex Emotions Learning Based on Deep Neural Networks
Billal Belainine | Fatiha Sadat | Mounir Boukadoum | Hakim Lounis
Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources

In sentiment analysis, several researchers have used emoji and hashtags as specific forms of training and supervision. Some emotions, such as fear and disgust, are underrepresented in the text of social media. Others, such as anticipation, are absent. This research paper proposes a new dataset for complex emotion detection using a combination of several existing corpora in order to represent and interpret complex emotions based on the Plutchik’s theory. Our experiments and evaluations confirm that using Transfer Learning (TL) with a rich emotional corpus, facilitates the detection of complex emotions in a four-dimensional space. In addition, the incorporation of the rule on the reverse emotions in the model’s architecture brings a significant improvement in terms of precision, recall, and F-score.

pdf bib
Low-Resource NMT: an Empirical Study on the Effect of Rich Morphological Word Segmentation on Inuktitut
Tan Ngoc Le | Fatiha Sadat
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Multi-Task Supervised Pretraining for Neural Domain Adaptation
Sara Meftah | Nasredine Semmar | Mohamed-Ayoub Tahiri | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

Two prevalent transfer learning approaches are used in recent works to improve neural networks performance for domains with small amounts of annotated data: Multi-task learning which involves training the task of interest with related auxiliary tasks to exploit their underlying similarities, and Mono-task fine-tuning, where the weights of the model are initialized with the pretrained weights of a large-scale labeled source domain and then fine-tuned with labeled data of the target domain (domain of interest). In this paper, we propose a new approach which takes advantage from both approaches by learning a hierarchical model trained across multiple tasks from a source domain, and is then fine-tuned on multiple tasks of the target domain. Our experiments on four tasks applied to the social media domain show that our proposed approach leads to significant improvements on all tasks compared to both approaches.

2019

pdf bib
Joint Learning of Pre-Trained and Random Units for Domain Adaptation in Part-of-Speech Tagging
Sara Meftah | Youssef Tamaazousti | Nasredine Semmar | Hassane Essafi | Fatiha Sadat
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Fine-tuning neural networks is widely used to transfer valuable knowledge from high-resource to low-resource domains. In a standard fine-tuning scheme, source and target problems are trained using the same architecture. Although capable of adapting to new domains, pre-trained units struggle with learning uncommon target-specific patterns. In this paper, we propose to augment the target-network with normalised, weighted and randomly initialised units that beget a better adaptation while maintaining the valuable source knowledge. Our experiments on POS tagging of social media texts (Tweets domain) demonstrate that our method achieves state-of-the-art performances on 3 commonly used datasets.

bib
Augmenting Named Entity Recognition with Commonsense Knowledge
Gaith Dekhili | Tan Ngoc Le | Fatiha Sadat
Proceedings of the 2019 Workshop on Widening NLP

Commonsense can be vital in some applications like Natural Language Understanding (NLU), where it is often required to resolve ambiguity arising from implicit knowledge and underspecification. In spite of the remarkable success of neural network approaches on a variety of Natural Language Processing tasks, many of them struggle to react effectively in cases that require commonsense knowledge. In the present research, we take advantage of the availability of the open multilingual knowledge graph ConceptNet, by using it as an additional external resource in Named Entity Recognition (NER). Our proposed architecture involves BiLSTM layers combined with a CRF layer that was augmented with some features such as pre-trained word embedding layers and dropout layers. Moreover, apart from using word representations, we used also character-based representation to capture the morphological and the orthographic information. Our experiments and evaluations showed an improvement in the overall performance with +2.86 in the F1-measure. Commonsense reasonnig has been employed in other studies and NLP tasks but to the best of our knowledge, there is no study relating the integration of a commonsense knowledge base in NER.

pdf bib
Exploration de l’apprentissage par transfert pour l’analyse de textes des réseaux sociaux (Exploring neural transfer learning for social media text analysis )
Sara Meftah | Nasredine Semmar | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

L’apprentissage par transfert représente la capacité qu’un modèle neuronal entraîné sur une tâche à généraliser suffisamment et correctement pour produire des résultats pertinents sur une autre tâche proche mais différente. Nous présentons dans cet article une approche fondée sur l’apprentissage par transfert pour construire automatiquement des outils d’analyse de textes des réseaux sociaux en exploitant les similarités entre les textes d’une langue bien dotée (forme standard d’une langue) et les textes d’une langue peu dotée (langue utilisée en réseaux sociaux). Nous avons expérimenté notre approche sur plusieurs langues ainsi que sur trois tâches d’annotation linguistique (étiquetage morpho-syntaxique, annotation en parties du discours et reconnaissance d’entités nommées). Les résultats obtenus sont très satisfaisants et montrent l’intérêt de l’apprentissage par transfert pour tirer profit des modèles neuronaux profonds sans la contrainte d’avoir à disposition une quantité de données importante nécessaire pour avoir une performance acceptable.

2018

pdf bib
Improving the neural network-based machine transliteration for low-resourced language pair
Ngoc Tan Le | Fatiha Sadat
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Retrieving Information from the French Lexical Network in RDF/OWL Format
Alexsandro Fonseca | Fatiha Sadat | François Lareau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Low-Resource Machine Transliteration Using Recurrent Neural Networks of Asian Languages
Ngoc Tan Le | Fatiha Sadat
Proceedings of the Seventh Named Entities Workshop

Grapheme-to-phoneme models are key components in automatic speech recognition and text-to-speech systems. With low-resource language pairs that do not have available and well-developed pronunciation lexicons, grapheme-to-phoneme models are particularly useful. These models are based on initial alignments between grapheme source and phoneme target sequences. Inspired by sequence-to-sequence recurrent neural network-based translation methods, the current research presents an approach that applies an alignment representation for input sequences and pre-trained source and target embeddings to overcome the transliteration problem for a low-resource languages pair. We participated in the NEWS 2018 shared task for the English-Vietnamese transliteration task.

pdf bib
Using Neural Transfer Learning for Morpho-syntactic Tagging of South-Slavic Languages Tweets
Sara Meftah | Nasredine Semmar | Fatiha Sadat | Stephan Raaijmakers
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper, we describe a morpho-syntactic tagger of tweets, an important component of the CEA List DeepLIMA tool which is a multilingual text analysis platform based on deep learning. This tagger is built for the Morpho-syntactic Tagging of Tweets (MTT) Shared task of the 2018 VarDial Evaluation Campaign. The MTT task focuses on morpho-syntactic annotation of non-canonical Twitter varieties of three South-Slavic languages: Slovene, Croatian and Serbian. We propose to use a neural network model trained in an end-to-end manner for the three languages without any need for task or domain specific features engineering. The proposed approach combines both character and word level representations. Considering the lack of annotated data in the social media domain for South-Slavic languages, we have also implemented a cross-domain Transfer Learning (TL) approach to exploit any available related out-of-domain annotated data.

2017

pdf bib
Translittération automatique pour une paire de langues peu dotée ()
Ngoc Tan Le | Fatiha Sadat | Lucie Ménard
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

La translittération convertit phonétiquement les mots dans une langue source (i.e. français) en mots équivalents dans une langue cible (i.e. vietnamien). Cette conversion nécessite un nombre considérable de règles définies par les experts linguistes pour déterminer comment les phonèmes sont alignés ainsi que prendre en compte le système de phonologie de la langue cible. La problématique pour les paires de langues peu dotées lie à la pénurie des ressources linguistiques. Dans ce travail de recherche, nous présentons une démonstration de conversion de graphème en phonème pour pallier au problème de translittération pour une paire de langues peu dotée, avec une application sur français-vietnamien. Notre système nécessite un petit corpus d’apprentissage phonétique bilingue. Nous avons obtenu des résultats prometteurs, avec un gain de +4,40% de score BLEU, par rapport au système de base utilisant l’approche de traduction automatique statistique.

2016

pdf bib
Named Entity Recognition and Hashtag Decomposition to Improve the Classification of Tweets
Billal Belainine | Alexsandro Fonseca | Fatiha Sadat
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

In social networks services like Twitter, users are overwhelmed with huge amount of social data, most of which are short, unstructured and highly noisy. Identifying accurate information from this huge amount of data is indeed a hard task. Classification of tweets into organized form will help the user to easily access these required information. Our first contribution relates to filtering parts of speech and preprocessing this kind of highly noisy and short data. Our second contribution concerns the named entity recognition (NER) in tweets. Thus, the adaptation of existing language tools for natural languages, noisy and not accurate language tweets, is necessary. Our third contribution involves segmentation of hashtags and a semantic enrichment using a combination of relations from WordNet, which helps the performance of our classification system, including disambiguation of named entities, abbreviations and acronyms. Graph theory is used to cluster the words extracted from WordNet and tweets, based on the idea of connected components. We test our automatic classification system with four categories: politics, economy, sports and the medical field. We evaluate and compare several automatic classification systems using part or all of the items described in our contributions and found that filtering by part of speech and named entity recognition dramatically increase the classification precision to 77.3 %. Moreover, a classification system incorporating segmentation of hashtags and semantic enrichment by two relations from WordNet, synonymy and hyperonymy, increase classification precision up to 83.4 %.

pdf bib
UQAM-NTL: Named entity recognition in Twitter messages
Ngoc Tan Le | Fatma Mallek | Fatiha Sadat
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

This paper describes our system used in the 2nd Workshop on Noisy User-generated Text (WNUT) shared task for Named Entity Recognition (NER) in Twitter, in conjunction with Coling 2016. Our system is based on supervised machine learning by applying Conditional Random Fields (CRF) to train two classifiers for two evaluations. The first evaluation aims at predicting the 10 fine-grained types of named entities; while the second evaluation aims at predicting no type of named entities. The experimental results show that our method has significantly improved Twitter NER performance.

pdf bib
Lexfom: a lexical functions ontology model
Alexsandro Fonseca | Fatiha Sadat | François Lareau
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

A lexical function represents a type of relation that exists between lexical units (words or expressions) in any language. For example, the antonymy is a type of relation that is represented by the lexical function Anti: Anti(big) = small. Those relations include both paradigmatic relations, i.e. vertical relations, such as synonymy, antonymy and meronymy and syntagmatic relations, i.e. horizontal relations, such as objective qualification (legitimate demand), subjective qualification (fruitful analysis), positive evaluation (good review) and support verbs (pay a visit, subject to an interrogation). In this paper, we present the Lexical Functions Ontology Model (lexfom) to represent lexical functions and the relation among lexical units. Lexfom is divided in four modules: lexical function representation (lfrep), lexical function family (lffam), lexical function semantic perspective (lfsem) and lexical function relations (lfrel). Moreover, we show how it combines to Lexical Model for Ontologies (lemon), for the transformation of lexical networks into the semantic web formats. So far, we have implemented 100 simple and 500 complex lexical functions, and encoded about 8,000 syntagmatic and 46,000 paradigmatic relations, for the French language.

2015

pdf bib
Multi-Dialect Machine Translation (MuDMat)
Fatiha Sadat
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Multi-Dialect Machine Translation (MuDMat)
Fatiha Sadat
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection
Ngoc Tan Le | Fatiha Sadat
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

The creation of high-quality named entity annotated resources is time-consuming and an expensive process. Most of the gold standard corpora are available for English but not for less-resourced languages such as Vietnamese. In Asian languages, this task is remained problematic. This paper focuses on an automatic construction of named entity annotated corpora for Vietnamese-French, a less-resourced pair of languages. We incrementally apply different cross-projection methods using parallel corpora, such as perfect string matching and edit distance similarity. Evaluations on Vietnamese –French pair of languages show a good accuracy (F-score of 94.90%) when identifying named entities pairs and building a named entity annotated parallel corpus.

2014

pdf bib
Identifying Portuguese Multiword Expressions using Different Classification Algorithms - A Comparative Analysis
Alexsandro Fonseca | Fatiha Sadat | Alexandre Blondin Massé
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)

pdf bib
A Comparative Study of Different Classification Methods for the Identification of Brazilian Portuguese Multiword Expressions
Alexsandro Fonseca | Fatiha Sadat
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

pdf bib
Collaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application – the case of Tunisian Arabic and the Social Media
Fatiha Sadat | Fatma Mallek | Mohamed Boudabous | Rahma Sellami | Atefeh Farzindar
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf bib
Automatic Identification of Arabic Language Varieties and Dialects in Social Media
Fatiha Sadat | Farzindar Kazemi | Atefeh Farzindar
Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP)

pdf bib
TALN-RECITAL 2014 Workshop TALAf 2014 : Traitement Automatique des Langues Africaines (TALAf 2014: African Language Processing)
Mathieu Mangeot | Fatiha Sadat
TALN-RECITAL 2014 Workshop TALAf 2014 : Traitement Automatique des Langues Africaines (TALAf 2014: African Language Processing)

2013

pdf bib
Towards a Hybrid Rule-based and Statistical Arabic-French Machine Translation System
Fatiha Sadat
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Exploiting multiple resources for Japanese to English patent translation
Rahma Sellami | Fatiha Sadat | Lamia Hadrich Belguith
Proceedings of the 5th Workshop on Patent Translation

pdf bib
Pre-processing and Language Analysis for Arabic to French Statistical Machine Translation (Traduction automatique statistique pour l’arabe-français améliorée par le prétraitement et l’analyse de la langue) [in French]
Fatiha Sadat | Emad Mohamed
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib
Extraction de lexiques bilingues à partir de Wikipédia (Bilingual lexicon extraction from Wikipedia) [in French]
Rahma Sellami | Fatiha Sadat | Lamia Hadrich Belguith
JEP-TALN-RECITAL 2012, Workshop TALAf 2012: Traitement Automatique des Langues Africaines (TALAf 2012: African Language Processing)

pdf bib
Exploiting Wikipedia as a Knowledge Base for the Extraction of Linguistic Resources: Application on Arabic-French Comparable Corpora and Bilingual Lexicons
Rahma Sellami | Fatiha Sadat | Lamia Hadrich Belguith
Fourth Workshop on Computational Approaches to Arabic-Script-based Languages

We present simple and effective methods for extracting comparable corpora and bilingual lexicons from Wikipedia. We shall exploit the large scale and the structure of Wikipedia articles to extract two resources that will be very useful for natural language applications. We build a comparable corpus from Wikipedia using categories as topic restrictions and we extract bilingual lexicons from inter-language links aligned with statistical method or a combined statistical and linguistic method.

2010

pdf bib
Exploiting a Multilingual Web-based Encyclopedia for Bilingual Terminology Extraction
Fatiha Sadat
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf bib
Exploitation de Wikipédia pour l’Enrichissement et la Construction des Ressources Linguistiques
Fatiha Sadat | Alexandre Terrasa
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Cet article présente une approche et des résultats utilisant l’encyclopédie en ligne Wikipédia comme ressource semi-structurée de connaissances linguistiques et en particulier comme un corpus comparable pour l’extraction de terminologie bilingue. Cette approche tend à extraire d’abord des paires de terme et traduction à partir de types des informations, liens et textes de Wikipédia. L’étape suivante consiste à l’utilisation de l’information linguistique afin de ré-ordonner les termes et leurs traductions pertinentes et ainsi éliminer les termes cibles inutiles. Les évaluations préliminaires utilisant les paires de langues français-anglais, japonais-français et japonais-anglais ont montré une bonne qualité des paires de termes extraits. Cette étude est très favorable pour la construction et l’enrichissement des ressources linguistiques tels que les dictionnaires et ontologies multilingues. Aussi, elle est très utile pour un système de recherche d’information translinguistique (RIT).

2006

pdf bib
Système de traduction automatique statistique combinant différentes ressources
Fatiha Sadat | George Foster | Roland Kuhn
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Cet article décrit une approche combinant différents modèles statistiques pour la traduction automatique basée sur les segments. Pour ce faire, différentes ressources sont utilisées, dont deux corpus parallèles aux caractéristiques différentes et un dictionnaire de terminologie bilingue et ce, afin d’améliorer la performance quantitative et qualitative du système de traduction. Nous évaluons notre approche sur la paire de langues français-anglais et montrons comment la combinaison des ressources proposées améliore de façon significative les résultats.

pdf bib
Automatic Transliteration of Proper Nouns from Arabic to English
Mehdi M. Kashani | Fred Popowich | Fatiha Sadat
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT

pdf bib
Combination of Arabic Preprocessing Schemes for Statistical Machine Translation
Fatiha Sadat | Nizar Habash
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
PORTAGE: with Smoothed Phrase Tables and Segment Choice Models
Howard Johnson | Fatiha Sadat | George Foster | Roland Kuhn | Michel Simard | Eric Joanis | Samuel Larkin
Proceedings on the Workshop on Statistical Machine Translation

pdf bib
Arabic Preprocessing Schemes for Statistical Machine Translation
Nizar Habash | Fatiha Sadat
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

2005

pdf bib
PORTAGE: A Phrase-Based Machine Translation System
Fatiha Sadat | Howard Johnson | Akakpo Agbago | George Foster | Roland Kuhn | Joel Martin | Aaron Tikuisis
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2003

pdf bib
Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval
Fatiha Sadat | Masatoshi Yoshikawa | Shunsuke Uemura
The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Learning Bilingual Translations from Comparable Corpora to Cross-Language Information Retrieval: Hybrid Statistics-based and Linguistics-based Approach
Fatiha Sadat | Masatoshi Yoshikawa | Shunsuke Uemura
Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages

2002

pdf bib
An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction
Hervé Déjean | Éric Gaussier | Fatiha Sadat
COLING 2002: The 19th International Conference on Computational Linguistics