Nasredine Semmar

2023

Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.

pdf bib abs
Intégration de connaissances structurées par synthèse de texte spécialisé
Guilhem Piat | Ellington Kirby | Julien Tourille | Nasredine Semmar | Alexandre Allauzen | Hassane Essafi
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

Les modèles de langue de type Transformer peinent à incorporer les modifications ayant pour but d’intégrer des formats de données structurés non-textuels tels que les graphes de connaissances. Les exemples où cette intégration est faite avec succès requièrent généralement que le problème de désambiguïsation d’entités nommées soit résolu en amont, ou bien l’ajout d’une quantité importante de texte d’entraînement, généralement annotée. Ces contraintes rendent l’exploitation de connaissances structurées comme source de données difficile et parfois même contre-productive. Nous cherchons à adapter un modèle de langage au domaine biomédical en l’entraînant sur du texte de synthèse issu d’un graphe de connaissances, de manière à exploiter ces informations dans le cadre d’une modalité maîtrisée par le modèle de langage.

2021

pdf bib abs
On the Hidden Negative Transfer in Sequential Transfer Learning for Domain Adaptation from News to Tweets
Sara Meftah | Nasredine Semmar | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Proceedings of the Second Workshop on Domain Adaptation for NLP

Transfer Learning has been shown to be a powerful tool for Natural Language Processing (NLP) and has outperformed the standard supervised learning paradigm, as it takes benefit from the pre-learned knowledge. Nevertheless, when transfer is performed between less related domains, it brings a negative transfer, i.e. hurts the transfer performance. In this research, we shed light on the hidden negative transfer occurring when transferring from the News domain to the Tweets domain, through quantitative and qualitative analysis. Our experiments on three NLP taks: Part-Of-Speech tagging, Chunking and Named Entity recognition, reveal interesting insights.

2020

pdf bib abs
Multi-Task Supervised Pretraining for Neural Domain Adaptation
Sara Meftah | Nasredine Semmar | Mohamed-Ayoub Tahiri | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

Two prevalent transfer learning approaches are used in recent works to improve neural networks performance for domains with small amounts of annotated data: Multi-task learning which involves training the task of interest with related auxiliary tasks to exploit their underlying similarities, and Mono-task fine-tuning, where the weights of the model are initialized with the pretrained weights of a large-scale labeled source domain and then fine-tuned with labeled data of the target domain (domain of interest). In this paper, we propose a new approach which takes advantage from both approaches by learning a hierarchical model trained across multiple tasks from a source domain, and is then fine-tuned on multiple tasks of the target domain. Our experiments on four tasks applied to the social media domain show that our proposed approach leads to significant improvements on all tasks compared to both approaches.

2019

pdf bib abs
Exploration de l’apprentissage par transfert pour l’analyse de textes des réseaux sociaux (Exploring neural transfer learning for social media text analysis )
Sara Meftah | Nasredine Semmar | Youssef Tamaazousti | Hassane Essafi | Fatiha Sadat
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

L’apprentissage par transfert représente la capacité qu’un modèle neuronal entraîné sur une tâche à généraliser suffisamment et correctement pour produire des résultats pertinents sur une autre tâche proche mais différente. Nous présentons dans cet article une approche fondée sur l’apprentissage par transfert pour construire automatiquement des outils d’analyse de textes des réseaux sociaux en exploitant les similarités entre les textes d’une langue bien dotée (forme standard d’une langue) et les textes d’une langue peu dotée (langue utilisée en réseaux sociaux). Nous avons expérimenté notre approche sur plusieurs langues ainsi que sur trois tâches d’annotation linguistique (étiquetage morpho-syntaxique, annotation en parties du discours et reconnaissance d’entités nommées). Les résultats obtenus sont très satisfaisants et montrent l’intérêt de l’apprentissage par transfert pour tirer profit des modèles neuronaux profonds sans la contrainte d’avoir à disposition une quantité de données importante nécessaire pour avoir une performance acceptable.

pdf bib abs
Joint Learning of Pre-Trained and Random Units for Domain Adaptation in Part-of-Speech Tagging
Sara Meftah | Youssef Tamaazousti | Nasredine Semmar | Hassane Essafi | Fatiha Sadat
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Fine-tuning neural networks is widely used to transfer valuable knowledge from high-resource to low-resource domains. In a standard fine-tuning scheme, source and target problems are trained using the same architecture. Although capable of adapting to new domains, pre-trained units struggle with learning uncommon target-specific patterns. In this paper, we propose to augment the target-network with normalised, weighted and randomly initialised units that beget a better adaptation while maintaining the valuable source knowledge. Our experiments on POS tagging of social media texts (Tweets domain) demonstrate that our method achieves state-of-the-art performances on 3 commonly used datasets.

2018

pdf bib abs
A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts
Wafia Adouane | Simon Dobnik | Jean-Philippe Bernardy | Nasredine Semmar
Proceedings of the Second Workshop on Subword/Character LEvel Models

This paper seeks to examine the effect of including background knowledge in the form of character pre-trained neural language model (LM), and data bootstrapping to overcome the problem of unbalanced limited resources. As a test, we explore the task of language identification in mixed-language short non-edited texts with an under-resourced language, namely the case of Algerian Arabic for which both labelled and unlabelled data are limited. We compare the performance of two traditional machine learning methods and a deep neural networks (DNNs) model. The results show that overall DNNs perform better on labelled data for the majority categories and struggle with the minority ones. While the effect of the untokenised and unlabelled data encoded as LM differs for each category, bootstrapping, however, improves the performance of all systems and all categories. These methods are language independent and could be generalised to other under-resourced languages for which a small labelled data and a larger unlabelled data are available.

pdf bib abs
Using Neural Transfer Learning for Morpho-syntactic Tagging of South-Slavic Languages Tweets
Sara Meftah | Nasredine Semmar | Fatiha Sadat | Stephan Raaijmakers
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper, we describe a morpho-syntactic tagger of tweets, an important component of the CEA List DeepLIMA tool which is a multilingual text analysis platform based on deep learning. This tagger is built for the Morpho-syntactic Tagging of Tweets (MTT) Shared task of the 2018 VarDial Evaluation Campaign. The MTT task focuses on morpho-syntactic annotation of non-canonical Twitter varieties of three South-Slavic languages: Slovene, Croatian and Serbian. We propose to use a neural network model trained in an end-to-end manner for the three languages without any need for task or domain specific features engineering. The proposed approach combines both character and word level representations. Considering the lack of annotated data in the social media domain for South-Slavic languages, we have also implemented a cross-domain Transfer Learning (TL) approach to exploit any available related out-of-domain annotated data.

pdf bib
A Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel Corpora
Nasredine Semmar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Neural Network Model for Part-Of-Speech Tagging of Social Media Texts
Sara Meftah | Nasredine Semmar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Automatic Identification of Maghreb Dialects Using a Dictionary-Based Approach
Houda Saâdane | Hosni Seffih | Christian Fluhr | Khalid Choukri | Nasredine Semmar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Building Multiword Expressions Bilingual Lexicons for Domain Adaptation of an Example-Based Machine Translation System
Nasredine Semmar | Mariama Laib
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We describe in this paper a hybrid ap-proach to build automatically bilingual lexicons of Multiword Expressions (MWEs) from parallel corpora. We more specifically investigate the impact of using a domain-specific bilingual lexicon of MWEs on domain adaptation of an Example-Based Machine Translation (EBMT) system. We conducted experiments on the English-French language pair and two kinds of texts: in-domain texts from Europarl (European Parliament proceedings) and out-of-domain texts from Emea (European Medicines Agency documents) and Ecb (European Central Bank corpus). The obtained results indicate that integrating domain-specific bilingual lexicons of MWEs improves translation quality of the EBMT system when texts to translate are related to the specific domain and induces a relatively slight deterioration of translation quality when translating general-purpose texts.

pdf bib abs
Une approche hybride pour la construction de lexiques bilingues d’expressions multi-mots à partir de corpus parallèles (A hybrid approach to build bilingual lexicons of multiword expressions from parallel corpora)
Nasredine Semmar | Morgane Marchand
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Les expressions multi-mots jouent un rôle important dans différentes applications du Traitement Automatique de la Langue telles que la traduction automatique et la recherche d’information interlingue. Cet article, d’une part, décrit une approche hybride pour l’acquisition d’un lexique bilingue d’expressions multi-mots à partir d’un corpus parallèle anglais-français, et d’autre part, présente l’impact de l’utilisation d’un lexique bilingue spécialisé d’expressions multi-mots produit par cette approche sur les résultats du système de traduction statistique libre Moses. Nous avons exploré deux métriques basées sur la co-occurrence pour évaluer les liens d’alignement entre les expressions multi-mots des langues source et cible. Les résultats obtenus montrent que la métrique utilisant un dictionnaire bilingue amorce de mots simples améliore aussi bien la qualité de l’alignement d’expressions multi-mots que celle de la traduction.

pdf bib
Une approche fondée sur les lexiques d’analyse de sentiments du dialecte algérien [A lexicon-based approach for sentiment analysis in the Algerian dialect]
Imane Guellil | Faical Azouaou | Houda Saâdane | Nasredine Semmar
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]

2016

pdf bib abs
Inducing Multilingual Text Analysis Tools Using Bidirectional Recurrent Neural Networks
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This work focuses on the development of linguistic analysis tools for resource-poor languages. We use a parallel corpus to produce a multilingual word representation based only on sentence level alignment. This representation is combined with the annotated source side (resource-rich language) of the parallel corpus to train text analysis tools for resource-poor languages. Our approach is based on Recurrent Neural Networks (RNN) and has the following advantages: (a) it does not use word alignment information, (b) it does not assume any knowledge about foreign languages, which makes it applicable to a wide range of resource-poor languages, (c) it provides truly multilingual taggers. In a previous study, we proposed a method based on Simple RNN to automatically induce a Part-Of-Speech (POS) tagger. In this paper, we propose an improvement of our neural model. We investigate the Bidirectional RNN and the inclusion of external information (for instance low level information from Part-Of-Speech tags) in the RNN to train a more complex tagger (for instance, a multilingual super sense tagger). We demonstrate the validity and genericity of our method by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual POS and super sense taggers.

pdf bib abs
Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning
Wafia Adouane | Nasredine Semmar | Richard Johansson
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/word-based n-gram models. However, social media and new technologies have contributed to the rise of informal and minority languages on the Web. The state-of-the-art automatic language identifiers fail to properly identify many of them. Romanized Arabic (RA) and Romanized Berber (RB) are cases of these informal languages which are under-resourced. The goal of this paper is twofold: detect RA and RB, at a document level, as separate languages and distinguish between them as they coexist in North Africa. We consider the task as a classification problem and use supervised machine learning to solve it. For both languages, character-based 5-grams combined with additional lexicons score the best, F-score of 99.75% and 97.77% for RB and RA respectively.

pdf bib abs
Automatic Detection of Arabicized Berber and Arabic Varieties
Wafia Adouane | Nasredine Semmar | Richard Johansson | Victoria Bobicev
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language processing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically processed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We introduce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%.

pdf bib abs
ASIREM Participation at the Discriminating Similar Languages Shared Task 2016
Wafia Adouane | Nasredine Semmar | Richard Johansson
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This paper presents the system built by ASIREM team for the Discriminating between Similar Languages (DSL) Shared task 2016. It describes the system which uses character-based and word-based n-grams separately. ASIREM participated in both sub-tasks (sub-task 1 and sub-task 2) and in both open and closed tracks. For the sub-task 1 which deals with Discriminating between similar languages and national language varieties, the system achieved an accuracy of 87.79% on the closed track, ending up ninth (the best results being 89.38%). In sub-task 2, which deals with Arabic dialect identification, the system achieved its best performance using character-based n-grams (49.67% accuracy), ranking fourth in the closed track (the best result being 51.16%), and an accuracy of 53.18%, ranking first in the open track.

pdf bib abs
Etude de l’impact d’un lexique bilingue spécialisé sur la performance d’un moteur de traduction à base d’exemples (Studying the impact of a specialized bilingual lexicon on the performance of an example-based machine translation engine)
Nasredine Semmar | Othman Zennaki | Meriama Laib
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

La traduction automatique statistique bien que performante est aujourd’hui limitée parce qu’elle nécessite de gros volumes de corpus parallèles qui n’existent pas pour tous les couples de langues et toutes les spécialités et que leur production est lente et coûteuse. Nous présentons, dans cet article, un prototype d’un moteur de traduction à base d’exemples utilisant la recherche d’information interlingue et ne nécessitant qu’un corpus de textes en langue cible. Plus particulièrement, nous proposons d’étudier l’impact d’un lexique bilingue de spécialité sur la performance de ce prototype. Nous évaluons ce prototype de traduction et comparons ses résultats à ceux du système de traduction statistique Moses en utilisant les corpus parallèles anglais-français Europarl (European Parliament Proceedings) et Emea (European Medicines Agency Documents). Les résultats obtenus montrent que le score BLEU du prototype du moteur de traduction à base d’exemples est proche de celui du système Moses sur des documents issus du corpus Europarl et meilleur sur des documents extraits du corpus Emea.

pdf bib abs
Projection Interlingue d’Étiquettes pour l’Annotation Sémantique Non Supervisée (Cross-lingual Annotation Projection for Unsupervised Semantic Tagging)
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Nos travaux portent sur la construction rapide d’outils d’analyse linguistique pour des langues peu dotées en ressources. Dans une précédente contribution, nous avons proposé une méthode pour la construction automatique d’un analyseur morpho-syntaxique via une projection interlingue d’annotations linguistiques à partir de corpus parallèles (méthode fondée sur les réseaux de neurones récurrents). Nous présentons, dans cet article, une amélioration de notre modèle neuronal, avec la prise en compte d’informations linguistiques externes pour un annotateur plus complexe. En particulier, nous proposons d’intégrer des annotations morpho-syntaxiques dans notre architecture neuronale pour l’apprentissage non supervisé d’annotateurs sémantiques multilingues à gros grain (annotation en SuperSenses). Nous montrons la validité de notre méthode et sa généricité sur l’italien et le français et étudions aussi l’impact de la qualité du corpus parallèle sur notre approche (généré par traduction manuelle ou automatique). Nos expériences portent sur la projection d’annotations de l’anglais vers le français et l’italien.

2015

pdf bib
Unsupervised and Lightly Supervised Part-of-Speech Tagging Using Recurrent Neural Networks
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Improving the Performance of an Example-Based Machine Translation System Using a Domain-specific Bilingual Lexicon
Nasredine Semmar | Othman Zennaki | Meriama Laib
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

pdf bib abs
Utilisation des réseaux de neurones récurrents pour la projection interlingue d’étiquettes morpho-syntaxiques à partir d’un corpus parallèle
Othman Zennaki | Nasredine Semmar | Laurent Besacier
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

La construction d’outils d’analyse linguistique pour les langues faiblement dotées est limitée, entre autres, par le manque de corpus annotés. Dans cet article, nous proposons une méthode pour construire automatiquement des outils d’analyse via une projection interlingue d’annotations linguistiques en utilisant des corpus parallèles. Notre approche n’utilise pas d’autres sources d’information, ce qui la rend applicable à un large éventail de langues peu dotées. Nous proposons d’utiliser les réseaux de neurones récurrents pour projeter les annotations d’une langue à une autre (sans utiliser d’information d’alignement des mots). Dans un premier temps, nous explorons la tâche d’annotation morpho-syntaxique. Notre méthode combinée avec une méthode de projection d’annotation basique (utilisant l’alignement mot à mot), donne des résultats comparables à ceux de l’état de l’art sur une tâche similaire.

pdf bib
Evaluating the Impact of Using a Domain-specific Bilingual Lexicon on the Performance of a Hybrid Machine Translation Approach
Nasredine Semmar | Othman Zennaki | Meriama Laib
Proceedings of the International Conference Recent Advances in Natural Language Processing

MultiWord Expressions (MWEs) repesent a key issue for numerous applications in Natural Language Processing (NLP) especially for Machine Translation (MT). In this paper, we describe a strategy for detecting translation pairs of MWEs in a French-English parallel corpus. In addition we introduce three methods aiming to integrate extracted bilingual MWE S in M OSES, a phrase based Statistical Machine Translation (SMT) system. We experimentally show that these textual units can improve translation quality.

pdf bib abs
Using Arabic Transliteration to Improve Word Alignment from French- Arabic Parallel Corpora
Houda Saadane | Ouafa Benterki | Nasredine Semmar | Christian Fluhr
Fourth Workshop on Computational Approaches to Arabic-Script-based Languages

In this paper, we focus on the use of Arabic transliteration to improve the results of a linguistics-based word alignment approach from parallel text corpora. This approach uses, on the one hand, a bilingual lexicon, named entities, cognates and grammatical tags to align single words, and on the other hand, syntactic dependency relations to align compound words. We have evaluated the word aligner integrating Arabic transliteration using two methods: A manual evaluation of the alignment quality and an evaluation of the impact of this alignment on the translation quality by using the Moses statistical machine translation system. The obtained results show that Arabic transliteration improves the quality of both alignment and translation.

pdf bib
Automatic Construction of a MultiWord Expressions Bilingual Lexicon: A Statistical Machine Translation Evaluation Perspective
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

pdf bib
Utilisation de la translittération arabe pour l’amélioration de l’alignement de mots à partir de corpus parallèles français-arabe (Using Arabic Transliteration to Improve Word Alignment from French-Arabic Parallel Corpora) [in French]
Houda Saadane | Nasredine Semmar
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

2010

pdf bib
A hybrid word alignment approach to improve translation lexicons with compound words and idiomatic expressions
Nasredine Semmar | Christophe Servan | Gaël de Chalendar | Benoît Le Ny | Jean-Jacques Bouzaglou
Proceedings of Translating and the Computer 32

pdf bib abs
LIMA : A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation
Romaric Besançon | Gaël de Chalendar | Olivier Ferret | Faiza Gara | Olivier Mesnard | Meriama Laïb | Nasredine Semmar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The increasing amount of available textual information makes necessary the use of Natural Language Processing (NLP) tools. These tools have to be used on large collections of documents in different languages. But NLP is a complex task that relies on many processes and resources. As a consequence, NLP tools must be both configurable and efficient: specific software architectures must be designed for this purpose. We present in this paper the LIMA multilingual analysis platform, developed at CEA LIST. This configurable platform has been designed to develop NLP based industrial applications while keeping enough flexibility to integrate various processes and resources. This design makes LIMA a linguistic analyzer that can handle languages as different as French, English, German, Arabic or Chinese. Beyond its architecture principles and its capabilities as a linguistic analyzer, LIMA also offers a set of tools dedicated to the test and the evaluation of linguistic modules and to the production and the management of new linguistic resources.

pdf bib abs
MLIF : A Metamodel to Represent and Exchange Multilingual Textual Information
Samuel Cruz-Lara | Gil Francopoulo | Laurent Romary | Nasredine Semmar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The fast evolution of language technology has produced pressing needs in standardization. The multiplicity of language resources representation levels and the specialization of these representations make difficult the interaction between linguistic resources and components manipulating these resources. In this paper, we describe the MultiLingual Information Framework (MLIF ― ISO CD 24616). MLIF is a metamodel which allows the representation and the exchange of multilingual textual information. This generic metamodel is designed to provide a common platform for all the tools developed around the existing multilingual data exchange formats. This platform provides, on the one hand, a set of generic data categories for various application domains, and on the other hand, strategies for the interoperability with existing standards. The objective is to reach a better convergence between heterogeneous standardisation activities that are taking place in the domain of data modeling (XML; W3C), text management (TEI; TEIC), multilingual information (TMX-LISA; XLIFF-OASIS) and multimedia (SMILText; W3C). This is a work in progress within ISO-TC37 in order to define a new ISO standard.

2007

pdf bib abs
Utilisation d’une approche basée sur la recherche cross-lingue d’information pour l’alignement de phrases à partir de textes bilingues Arabe-Français
Nasredine Semmar | Christian Fluhr
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’alignement de phrases à partir de textes bilingues consiste à reconnaître les phrases qui sont traductions les unes des autres. Cet article présente une nouvelle approche pour aligner les phrases d’un corpus parallèle. Cette approche est basée sur la recherche crosslingue d’information et consiste à construire une base de données des phrases du texte cible et considérer chaque phrase du texte source comme une requête à cette base. La recherche crosslingue utilise un analyseur linguistique et un moteur de recherche. L’analyseur linguistique traite aussi bien les documents à indexer que les requêtes et produit un ensemble de lemmes normalisés, un ensemble d’entités nommées et un ensemble de mots composés avec leurs étiquettes morpho-syntaxiques. Le moteur de recherche construit les fichiers inversés des documents en se basant sur leur analyse linguistique et retrouve les documents pertinents à partir de leur indexes. L’aligneur de phrases a été évalué sur un corpus parallèle Arabe-Français et les résultats obtenus montrent que 97% des phrases ont été correctement alignées.

pdf bib
Arabic to French Sentence Alignment: Exploration of A Cross-language Information Retrieval Approach
Nasredine Semmar | Christian Fluhr
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources

2006

pdf bib abs
Using Cross-language Information Retrieval for Sentence Alignment
Nasredine Semmar | Meriama Laib | Christian Fluhr
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT

Cross-language information retrieval consists in providing a query in one language and searching documents in different languages. Retrieved documents are ordered by the probability of being relevant to the user's request with the highest ranked being considered the most relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents to be indexed. This system, designed to work on Arabic, Chinese, English, French, German and Spanish, is composed of a multilingual linguistic analyzer, a statistical analyzer, a reformulator, a comparator and a search engine. The multilingual linguistic analyzer includes a morphological analyzer, a part-of-speech tagger and a syntactic analyzer. In the case of Arabic, a clitic stemmer is added to the morphological analyzer to segment the input words into proclitics, simple forms and enclitics. The linguistic analyzer processes both documents to be indexed and queries to produce a set of normalized lemmas, a set of named entities and a set of nominal compounds with their morpho-syntactic tags. The statistical analyzer computes for documents to be indexed concept weights based on concept database frequencies. The comparator computes intersections between queries and documents and provides a relevance weight for each intersection. Before this comparison, the reformulator expands queries during the search. The expansion is used to infer from the original query words other words expressing the same concepts. The expansion can be in the same language or in different languages. The search engine retrieves the ranked, relevant documents from the indexes according to the corresponding reformulated query and then merges the results obtained for each language, taking into account the original words of the query and their weights in order to score the documents. Sentence alignment consists in estimating which sentence or sentences in the source language correspond with which sentence or sentences in a target language. We present in this paper a new approach to aligning sentences from a parallel corpora based on the LIC2M cross-language information retrieval system. This approach consists in building a database of sentences of the target text and considering each sentence of the source text as a "query" to that database. The aligned bilingual parallel corpora can be used as a translation memory in a computer-aided translation tool.

pdf bib abs
Using Stemming in Morphological Analysis to Improve Arabic Information Retrieval
Nasredine Semmar | Meriama Laib | Christian Fluhr
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Information retrieval (IR) consists in finding all relevant documents for a user query in a collection of documents. These documents are ordered by the probability of being relevant to the user’s query. The highest ranked document is considered to be the most likely relevant document. Natural Language Processing (NLP) for IR aims to transform the potentially ambiguous words of queries and documents into unambiguous internal representations on which matching and retrieval can take place. This transformation is generally achieved by several levels of linguistic analysis, morphological, syntactic and so forth. In this paper, we present the Arabic linguistic analyzer used in the LIC2M cross-lingual search engine. We focus on the morphological analyzer and particularly the clitic stemmer which segments the input words into proclitics, simple forms and enclitics. We demonstrate that stemming improves search engine recall and precision.

pdf bib abs
A Deep Linguistic Analysis for Cross-language Information Retrieval
Nasredine Semmar | Meriama Laib | Christian Fluhr
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Cross-language information retrieval consists in providing a query in one language and searching documents in one or different languages. These documents are ordered by the probability of being relevant to the user's request. The highest ranked document is considered to be the most likely relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents. This system is composed of a linguistic analyzer, a statistic analyzer, a reformulator, a comparator and a search engine. The linguistic analysis processes both documents to be indexed and queries to extract concepts representing their content. This analysis includes a morphological analysis, a part-of-speech tagging and a syntactic analysis. In this paper, we present the deep linguistic analysis used in the LIC2M cross-lingual search engine and we will particularly focus on the impact of the syntactic analysis on the retrieval effectiveness.

This paper describes the ARCADE II project, concerned with the evaluation of parallel text alignment systems. The ARCADE II project aims at exploring the techniques of multilingual text alignment through a fine evaluation of the existing techniques and the development of new alignment methods. The evaluation campaign consists of two tracks devoted to the evaluation of alignment at sentence and word level respectively. It differs from ARCADE I in the multilingual aspect and the investigation of lexical alignment.