2017
pdf
bib
abs
Building Multiword Expressions Bilingual Lexicons for Domain Adaptation of an Example-Based Machine Translation System
Nasredine Semmar
|
Mariama Laib
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
We describe in this paper a hybrid ap-proach to build automatically bilingual lexicons of Multiword Expressions (MWEs) from parallel corpora. We more specifically investigate the impact of using a domain-specific bilingual lexicon of MWEs on domain adaptation of an Example-Based Machine Translation (EBMT) system. We conducted experiments on the English-French language pair and two kinds of texts: in-domain texts from Europarl (European Parliament proceedings) and out-of-domain texts from Emea (European Medicines Agency documents) and Ecb (European Central Bank corpus). The obtained results indicate that integrating domain-specific bilingual lexicons of MWEs improves translation quality of the EBMT system when texts to translate are related to the specific domain and induces a relatively slight deterioration of translation quality when translating general-purpose texts.
2016
pdf
bib
abs
Etude de l’impact d’un lexique bilingue spécialisé sur la performance d’un moteur de traduction à base d’exemples (Studying the impact of a specialized bilingual lexicon on the performance of an example-based machine translation engine)
Nasredine Semmar
|
Othman Zennaki
|
Meriama Laib
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)
La traduction automatique statistique bien que performante est aujourd’hui limitée parce qu’elle nécessite de gros volumes de corpus parallèles qui n’existent pas pour tous les couples de langues et toutes les spécialités et que leur production est lente et coûteuse. Nous présentons, dans cet article, un prototype d’un moteur de traduction à base d’exemples utilisant la recherche d’information interlingue et ne nécessitant qu’un corpus de textes en langue cible. Plus particulièrement, nous proposons d’étudier l’impact d’un lexique bilingue de spécialité sur la performance de ce prototype. Nous évaluons ce prototype de traduction et comparons ses résultats à ceux du système de traduction statistique Moses en utilisant les corpus parallèles anglais-français Europarl (European Parliament Proceedings) et Emea (European Medicines Agency Documents). Les résultats obtenus montrent que le score BLEU du prototype du moteur de traduction à base d’exemples est proche de celui du système Moses sur des documents issus du corpus Europarl et meilleur sur des documents extraits du corpus Emea.
2015
pdf
bib
Evaluating the Impact of Using a Domain-specific Bilingual Lexicon on the Performance of a Hybrid Machine Translation Approach
Nasredine Semmar
|
Othman Zennaki
|
Meriama Laib
Proceedings of the International Conference Recent Advances in Natural Language Processing
pdf
bib
Improving the Performance of an Example-Based Machine Translation System Using a Domain-specific Bilingual Lexicon
Nasredine Semmar
|
Othman Zennaki
|
Meriama Laib
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters
2014
pdf
bib
Using cross-language information retrieval and statistical language modelling in example-based machine translation
Nasredine Semmar
|
Othman Zennaki
|
Meriama Laib
Proceedings of Translating and the Computer 36
2010
pdf
bib
abs
LIMA : A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation
Romaric Besançon
|
Gaël de Chalendar
|
Olivier Ferret
|
Faiza Gara
|
Olivier Mesnard
|
Meriama Laïb
|
Nasredine Semmar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The increasing amount of available textual information makes necessary the use of Natural Language Processing (NLP) tools. These tools have to be used on large collections of documents in different languages. But NLP is a complex task that relies on many processes and resources. As a consequence, NLP tools must be both configurable and efficient: specific software architectures must be designed for this purpose. We present in this paper the LIMA multilingual analysis platform, developed at CEA LIST. This configurable platform has been designed to develop NLP based industrial applications while keeping enough flexibility to integrate various processes and resources. This design makes LIMA a linguistic analyzer that can handle languages as different as French, English, German, Arabic or Chinese. Beyond its architecture principles and its capabilities as a linguistic analyzer, LIMA also offers a set of tools dedicated to the test and the evaluation of linguistic modules and to the production and the management of new linguistic resources.
2006
pdf
bib
abs
A Deep Linguistic Analysis for Cross-language Information Retrieval
Nasredine Semmar
|
Meriama Laib
|
Christian Fluhr
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Cross-language information retrieval consists in providing a query in one language and searching documents in one or different languages. These documents are ordered by the probability of being relevant to the user's request. The highest ranked document is considered to be the most likely relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents. This system is composed of a linguistic analyzer, a statistic analyzer, a reformulator, a comparator and a search engine. The linguistic analysis processes both documents to be indexed and queries to extract concepts representing their content. This analysis includes a morphological analysis, a part-of-speech tagging and a syntactic analysis. In this paper, we present the deep linguistic analysis used in the LIC2M cross-lingual search engine and we will particularly focus on the impact of the syntactic analysis on the retrieval effectiveness.
pdf
bib
abs
Using Stemming in Morphological Analysis to Improve Arabic Information Retrieval
Nasredine Semmar
|
Meriama Laib
|
Christian Fluhr
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Information retrieval (IR) consists in finding all relevant documents for a user query in a collection of documents. These documents are ordered by the probability of being relevant to the user’s query. The highest ranked document is considered to be the most likely relevant document. Natural Language Processing (NLP) for IR aims to transform the potentially ambiguous words of queries and documents into unambiguous internal representations on which matching and retrieval can take place. This transformation is generally achieved by several levels of linguistic analysis, morphological, syntactic and so forth. In this paper, we present the Arabic linguistic analyzer used in the LIC2M cross-lingual search engine. We focus on the morphological analyzer and particularly the clitic stemmer which segments the input words into proclitics, simple forms and enclitics. We demonstrate that stemming improves search engine recall and precision.
pdf
bib
abs
Using Cross-language Information Retrieval for Sentence Alignment
Nasredine Semmar
|
Meriama Laib
|
Christian Fluhr
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT
Cross-language information retrieval consists in providing a query in one language and searching documents in different languages. Retrieved documents are ordered by the probability of being relevant to the user's request with the highest ranked being considered the most relevant document. The LIC2M cross-language information retrieval system is a weighted Boolean search engine based on a deep linguistic analysis of the query and the documents to be indexed. This system, designed to work on Arabic, Chinese, English, French, German and Spanish, is composed of a multilingual linguistic analyzer, a statistical analyzer, a reformulator, a comparator and a search engine. The multilingual linguistic analyzer includes a morphological analyzer, a part-of-speech tagger and a syntactic analyzer. In the case of Arabic, a clitic stemmer is added to the morphological analyzer to segment the input words into proclitics, simple forms and enclitics. The linguistic analyzer processes both documents to be indexed and queries to produce a set of normalized lemmas, a set of named entities and a set of nominal compounds with their morpho-syntactic tags. The statistical analyzer computes for documents to be indexed concept weights based on concept database frequencies. The comparator computes intersections between queries and documents and provides a relevance weight for each intersection. Before this comparison, the reformulator expands queries during the search. The expansion is used to infer from the original query words other words expressing the same concepts. The expansion can be in the same language or in different languages. The search engine retrieves the ranked, relevant documents from the indexes according to the corresponding reformulated query and then merges the results obtained for each language, taking into account the original words of the query and their weights in order to score the documents. Sentence alignment consists in estimating which sentence or sentences in the source language correspond with which sentence or sentences in a target language. We present in this paper a new approach to aligning sentences from a parallel corpora based on the LIC2M cross-language information retrieval system. This approach consists in building a database of sentences of the target text and considering each sentence of the source text as a "query" to that database. The aligned bilingual parallel corpora can be used as a translation memory in a computer-aided translation tool.