Djamel Mostefa

Also published as: D. Mostefa

2017

Cet article présente un système d’alertes fondé sur la masse de données issues de Tweeter. L’objectif de l’outil est de surveiller l’actualité, autour de différents domaines témoin incluant les événements sportifs ou les catastrophes naturelles. Cette surveillance est transmise à l’utilisateur sous forme d’une interface web contenant la liste d’événements localisés sur une carte.

Robustesse et portabilités multilingue et multi-domaines des systèmes de compréhension de la parole : les corpus du projet PortMedia (Robustness and portability of spoken language understanding systems among languages and domains : the PORTMEDIA project) [in French]
Fabrice Lefèvre | Djamel Mostefa | Laurent Besacier | Yannick Estève | Matthieu Quignard | Nathalie Camelin | Benoit Favre | Bassam Jabaian | Lina Rojas-Barahona
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

pdf bib abs

This paper aims at giving an overview of ELRAs recent activities. The first part elaborates on ELRAs means of boosting the sharing Language Resources (LRs) within the HLT community through its catalogues, LRE-Map initiative, as well as its work towards the integration of its LRs within the META-SHARE open infrastructure. The second part shows how ELRA helps in the development and evaluation of HLT, in particular through its numerous participations to collaborative projects for the production of resources and platforms to facilitate their production and exploitation. A third part focuses on ELRAs work for clearing IPR issues in a HLT-oriented context, one of its latest initiative being its involvement in a Fair Research Act proposal to promote the easy access to LRs to the widest community. Finally, the last part elaborates on recent actions for disseminating information and promoting cooperation in the field, e.g. an the Language Library being launched at LREC2012 and the creation of an International Standard LR Number, a LR unique identifier to enable the accurate identification of LRs. Among the other messages ELRA will be conveying the attendees are the announcement of a set of freely available resources, the establishment of a LR and Evaluation forum, etc.

pdf bib abs

Leveraging study of robustness and portability of spoken language understanding systems across languages and domains: the PORTMEDIA corpora
Fabrice Lefèvre | Djamel Mostefa | Laurent Besacier | Yannick Estève | Matthieu Quignard | Nathalie Camelin | Benoit Favre | Bassam Jabaian | Lina M. Rojas-Barahona
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The PORTMEDIA project is intended to develop new corpora for the evaluation of spoken language understanding systems. The newly collected data are in the field of human-machine dialogue systems for tourist information in French in line with the MEDIA corpus. Transcriptions and semantic annotations, obtained by low-cost procedures, are provided to allow a thorough evaluation of the systems' capabilities in terms of robustness and portability across languages and domains. A new test set with some adaptation data is prepared for each case: in Italian as an example of a new language, for ticket reservation as an example of a new domain. Finally the work is complemented by the proposition of a new high level semantic annotation scheme well-suited to dialogue data.

pdf bib abs

New language resources for the Pashto language
Djamel Mostefa | Khalid Choukri | Sylvie Brunessaux | Karim Boudahmane
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.

2010

pdf bib abs

Question Answering (QA) technology aims at providing relevant answers to natural language questions. Most Question Answering research has focused on mining document collections containing written texts to answer written questions. In addition to written sources, a large (and growing) amount of potentially interesting information appears in spoken documents, such as broadcast news, speeches, seminars, meetings or telephone conversations. The QAST track (Question-Answering on Speech Transcripts) was introduced in CLEF to investigate the problem of question answering in such audio documents. This paper describes in detail the evaluation protocol and tools designed and developed for the CLEF-QAST evaluation campaigns that have taken place between 2007 and 2009. We first remind the data, question sets, and submission procedures that were produced or set up during these three campaigns. As for the evaluation procedure, the interface that was developed to ease the assessors work is described. In addition, this paper introduces a methodology for a semi-automatic evaluation of QAST systems based on time slot comparisons. Finally, the QAST Evaluation Package 2007-2009 resulting from these evaluation campaigns is also introduced.

pdf bib abs

Annotations for Opinion Mining Evaluation in the Industrial Context of the DOXA project
Patrick Paroubek | Alexander Pak | Djamel Mostefa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

After presenting opinion and sentiment analysis state of the art and the DOXA project, we review the few evaluation campaigns that have dealt in the past with opinion mining. Then we present the two level opinion and sentiment model that we will use for evaluation in the DOXA project and the annotation interface we use for hand annotating a reference corpus. We then present the corpus which will be used on DOXA and report on the hand-annotation task on a corpus of comments on video games and the solution adopted to obtain a sufficient level of inter-annotator agreement.

2009

pdf bib

2008

pdf bib

The Impact of Reference Quality on Automatic MT Evaluation
Olivier Hamon | Djamel Mostefa
Coling 2008: Companion volume: Posters

pdf bib abs

Data Collection for the CHIL CLEAR 2007 Evaluation Campaign
Nicolas Moreau | Djamel Mostefa | Rainer Stiefelhagen | Susanne Burger | Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes in detail the data that was collected and annotated during the third and final year of the CHIL project. This data was used for the CLEAR evaluation campaign in spring 2007. The paper also introduces the CHIL Evaluation Package 2007 that resulted from this campaign including a complete description of the performed evaluation tasks. This evaluation package will be made available to the community through the ELRA General Catalogue.

pdf bib abs

PASSAGE: from French Parser Evaluation to Large Sized Treebank
Éric Villemonte de la Clergerie | Olivier Hamon | Djamel Mostefa | Christelle Ayache | Patrick Paroubek | Anne Vilnat
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present the PASSAGE project which aims at building automatically a French Treebank of large size by combining the output of several parsers, using the EASY annotation scheme. We present also the results of the of the first evaluation campaign of the project and the preliminary results we have obtained with our ROVER procedure for combining parsers automatically.

pdf bib abs

Quick Rich Transcriptions of Arabic Broadcast News Speech Data
Chomicha Bendahman | Meghan Glenn | Djamel Mostefa | Niklas Paulsson | Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the collect and transcription of a large set of Arabic broadcast news speech data. A total of more than 2000 hours of data was transcribed. The transcription factor for transcribing the broadcast news data has been reduced using a method such as Quick Rich Transcription (QRTR) as well as reducing the number of quality controls performed on the data. The data was collected from several Arabic TV and radio sources and from both Modern Standard Arabic and dialectal Arabic. The orthographic transcriptions included segmentation, speaker turns, topics, sentence unit types and a minimal noise mark-up. The transcripts were produced as a part of the GALE project.

pdf bib abs

Question Answering on Speech Transcriptions: the QAST evaluation in CLEF
Lori Lamel | Sophie Rosset | Christelle Ayache | Djamel Mostefa | Jordi Turmo | Pere Comas
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper reports on the QAST track of CLEF aiming to evaluate Question Answering on Speech Transcriptions. Accessing information in spoken documents provides additional challenges to those of text-based QA, needing to address the characteristics of spoken language, as well as errors in the case of automatic transcriptions of spontaneous speech. The framework and results of the pilot QAst evaluation held as part of CLEF 2007 is described, illustrating some of the additional challenges posed by QA in spoken documents relative to written ones. The current plans for future multiple-language and multiple-task QAst evaluations are described.

pdf bib abs

The INFILE Project: a Crosslingual Filtering Systems Evaluation Campaign
Romaric Besançon | Stéphane Chaudiron | Djamel Mostefa | Ismaïl Timimi | Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The InFile project (INformation, FILtering, Evaluation) is a cross-language adaptive filtering evaluation campaign, sponsored by the French National Research Agency. The campaign is organized by the CEA LIST, ELDA and the University of Lille3-GERiiCO. It has an international scope as it is a pilot track of the CLEF 2008 campaigns. The corpus is built from a collection of about 1.4 million newswires (10 GB) in three languages, Arabic, English and French provided by the French news Agency Agence France Press (AFP) and selected from a 3-year period. The profiles corpus is made of 50 profiles from which 30 concern general news and events (national and international affairs, politics, sports?) and 20 concern scientific and technical subjects.

pdf bib abs

An Experimental Methodology for an End-to-End Evaluation in Speech-to-Speech Translation
Olivier Hamon | Djamel Mostefa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the evaluation methodology used to evaluate the TC-STAR speech-to-speech translation (SST) system and the results from the third year of the project. It follows the results presented in Hamon (2007), dealing with the first end-to-end evaluation of the project. In this paper, we try to experiment with the methodology and the protocol during a second end-to-end evaluation, by comparing outputs from the TC-STAR system with interpreters from the European parliament. For this purpose, we test different criteria of evaluation and type of questions within a comprehension test. The results show that interpreters do not translate all the information (as opposed to the automatic system), but the quality of SST is still far from that of human translation. The experimental comprehension test used provides new information to study the quality of automatic systems, but without settling the issue of which protocol is the best. This depends on what the evaluator wants to know about the SST: either to have a subjective end-user evaluation or a more objective one.

pdf bib abs

New Telephone Speech Databases for French: a Children Database and an optimized Adult Corpus
Djamel Mostefa | Arnaud Vallee
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the results of the NEOLOGOS project: a children database and an optimized adult database for the French language. A new approach was adopted for the collection of the adult database in order to enable the development of new algorithms in the field of speech processing (study of speaker characteristics, speakers similarity, speaker selection algorithms, etc.) The objective here was to define and to carry out a new methodology for collecting significant quantities of speaker dependent data, for a significant number of speakers, as was done for several databases oriented towards speaker verification, but with the additional constraint of maximising the coverage of the space of all speakers. The children database is made of 1,000 sessions recorded by children between 7 and 16 years old. Both speech databases are SpeehDat-compliant meaning that they can be easily used for research and development in the field of speech technology.

2007

pdf bib

End-to-end evaluation of a speech-to-speech translation system in TC-STAR
Olivier Hamon | Djamel Mostefa | Khalid Choukri
Proceedings of Machine Translation Summit XI: Papers

bib

MT evaluation & TC-STAR
Khalid Choukri | Olivier Hamon | Djamel Mostefa
Proceedings of the Workshop on Automatic procedures in MT evaluation

2006

pdf bib abs

TC-STAR: New language resources for ASR and SLT purposes
Henk van den Heuvel | Khalid Choukri | Christian Gollan | Asuncion Moreno | Djamel Mostefa
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In TC-STAR a variety of Language Resources (LR) is being produced. In this contribution we address the resources that have been created for Automatic Speech Recrognition and Spoken Language Translation. As yet, these are 14 LR in total: two training SLR for ASR (English and Spanish), three development LR and three evaluation LR for ASR (English, Spanish, Mandarin), and three development LR and three evaluation LR for SLT (English-Spanish, Spanish-English, Mandarin-English). In this paper we describe the properties, validation, and availability of these resources.

pdf bib abs

The aim of the Media-Evalda project is to evaluate the understanding capabilities of dialog systems. This paper presents the Media protocol for speech understanding evaluation and describes the results of the June 2005 literal evaluation campaign. Five systems, both symbolic or corpus-based, participated to the evaluation which is based on a common semantic representation. Different scorings have been performed on the system results. The understanding error rate, for the Full scoring is, depending on the systems, from 29% to 41.3%. A diagnosis analysis of these results is proposed.

pdf bib abs

Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News
S. Galliano | E. Geoffrois | G. Gravier | J.-F. Bonastre | D. Mostefa | K. Choukri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents the audio corpus developed in the framework of the ESTER evaluation campaign of French broadcast news transcription systems. This corpus includes 100 hours of manually annotated recordings and 1,677 hours of non transcribed data. The manual annotations include the detailed verbatim orthographic transcription, the speaker turns and identities, information about acoustic conditions, and name entities. Additional resources generated by automatic speech processing systems, such as phonetic alignments and word graphs, are also described.

pdf bib abs

Evaluation of Automatic Speech Recognition and Speech Language Translation within TC-STAR:Results from the first evaluation campaign
Djamel Mostefa | Olivier Hamon | Khalid Choukri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper reports on the evaluation activities conducted in the first year of the TC-STAR project. The TC-STAR project, financed by the European Commission within the Sixth Framework Program, is envisaged as a long-term effort to advance research in the core technologies of Speech-to-Speech Translation (SST). SST technology is a combination of Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text To Speech (TTS).

pdf bib abs

Evaluation of multimodal components within CHIL: The evaluation packages and results
Djamel Mostefa | Marie-Neige Garcia | Khalid Choukri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This article describes the first CHIL evaluation campaign in which 12 technologies were evaluated. The major outcomes of the first evaluation campaign are the so-called Evaluation Packages. An evaluation package is the full documentation (definition and description of the evaluation methodologies, protocols and metrics) alongside the data sets and software scoring tools, which an organisation needs in order to perform the evaluation of one or more systems for a given technology. These evaluation packages will be made available to the community through ELDA General Catalogue.

2004

The aim of the MEDIA project is to design and test a methodology for the evaluat ion of context-dependent and independent spoken dialogue systems. We propose an evaluation paradigm based on the use of test suites from real-world corpora and a common semantic representation and common metrics. This paradigm should allow us to diagnose the context-sensitive understanding capability of dialogue system s. This paradigm will be used within an evaluation campaign involving several si tes all of which will carry out the task of querying information from a database .

pdf bib