Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
We present GeSERA, an open-source improved version of SERA for evaluating automatic extractive and abstractive summaries from the general domain. SERA is based on a search engine that compares candidate and reference summaries (called queries) against an information retrieval document base (called index). SERA was originally designed for the biomedical domain only, where it showed a better correlation with manual methods than the widely used lexical-based ROUGE method. In this paper, we take out SERA from the biomedical domain to the general one by adapting its content-based method to successfully evaluate summaries from the general domain. First, we improve the query reformulation strategy with POS Tags analysis of general-domain corpora. Second, we replace the biomedical index used in SERA with two article collections from AQUAINT-2 and Wikipedia. We conduct experiments with TAC2008, TAC2009, and CNNDM datasets. Results show that, in most cases, GeSERA achieves higher correlations with manual evaluation methods than SERA, while it reduces its gap with ROUGE for general-domain summary evaluation. GeSERA even surpasses ROUGE in two cases of TAC2009. Finally, we conduct extensive experiments and provide a comprehensive study of the impact of human annotators and the index size on summary evaluation with SERA and GeSERA.
Data-driven approaches for creating virtual patient dialogue systems require the availability of large data specific to the language,domain and clinical cases studied. Based on the lack of dialogue corpora in French for medical education, we propose an annotatedcorpus of dialogues including medical consultation interactions between doctor and patient. In this work, we detail the building processof the proposed dialogue corpus, describe the annotation guidelines and also present the statistics of its contents. We then conducted aquestion categorization task to evaluate the benefits of the proposed corpus that is made publicly available.
Cet article décrit Iagotchi, un personnage virtuel philosophique et artistique qui apprend et développe des connaissances à partir de ses interactions avec l’humain. Iagotchi se présente à la fois comme un apprenant et un expert avec comme objectifs principaux (1) d’accompagner l’homme dans ses questionnements, (2) de lui fournir des réponses pertinentes sur la base de ses requêtes et (3) de générer des textes poétiques cohérents. Dans ce travail, nous décrivons l’architecture du système de Iagotchi et les composants clés tels que le moteur de conversation, le gestionnaire de sujets et le générateur de poésies.
Dans le contexte médical, un patient ou médecin virtuel dialoguant permet de former les apprenants au diagnostic médical via la simulation de manière autonome. Dans ce travail, nous avons exploité les propriétés sémantiques capturées par les représentations distribuées de mots pour la recherche de questions similaires dans le système de dialogues d’un agent conversationnel médical. Deux systèmes de dialogues ont été créés et évalués sur des jeux de données collectées lors des tests avec les apprenants. Le premier système fondé sur la correspondance de règles de dialogue créées à la main présente une performance globale de 92% comme taux de réponses cohérentes sur le cas clinique étudié tandis que le second système qui combine les règles de dialogue et la similarité sémantique réalise une performance de 97% de réponses cohérentes en réduisant de 7% les erreurs de compréhension par rapport au système de correspondance de règles.
Following Gillick and Favre (2009), a lot of work about extractive summarization has modeled this task by associating two contrary constraints: one aims at maximizing the coverage of the summary with respect to its information content while the other represents its size limit. In this context, the notion of redundancy is only implicitly taken into account. In this article, we extend the framework defined by Gillick and Favre (2009) by examining how and to what extent integrating semantic sentence similarity into an update summarization system can improve its results. We show more precisely the impact of this strategy through evaluations performed on DUC 2007 and TAC 2008 and 2009 datasets.
multi-document Maâli Mnasri1, 2 Gaël de Chalendar1 Olivier Ferret1 (1) CEA, LIST, Laboratoire Vision et Ingénierie des Contenus, Gif-sur-Yvette, F-91191, France. (2) Université Paris-Sud, Université Paris-Saclay, F-91405 Orsay, France. maali.mnasri@cea.fr, gael.de-chalendar@cea.fr, olivier.ferret@cea.fr R ÉSUMÉ À la suite des travaux de Gillick & Favre (2009), beaucoup de travaux portant sur le résumé par extraction se sont appuyés sur une modélisation de cette tâche sous la forme de deux contraintes antagonistes : l’une vise à maximiser la couverture du résumé produit par rapport au contenu des textes d’origine tandis que l’autre représente la limite du résumé en termes de taille. Dans cette approche, la notion de redondance n’est prise en compte que de façon implicite. Dans cet article, nous reprenons le cadre défini par Gillick & Favre (2009) mais nous examinons comment et dans quelle mesure la prise en compte explicite de la similarité sémantique des phrases peut améliorer les performances d’un système de résumé multi-document. Nous vérifions cet impact par des évaluations menées sur les corpus DUC 2003 et 2004.
VerbNet is an English lexical resource for verbs that has proven useful for English NLP due to its high coverage and coherent classification. Such a resource doesnt exist for other languages, despite some (mostly automatic and unsupervised) attempts. We show how to semi-automatically adapt VerbNet using existing resources designed for diï¬erent purposes. This study focuses on French and uses two French resources: a semantic lexicon (Les Verbes Français) and a syntactic lexicon (Lexique-Grammaire).
At CEA LIST, we have decided to release our multilingual analyzer LIMA as Free software. As we were not proprietary of all the language resources used we had to select and adapt free ones in order to attain results good enough and equivalent to those obtained with our previous ones. For English and French, we found and adapted a full-form dictionary and an annotated corpus for learning part-of-speech tagging models.
The Asfalda project aims to develop a French corpus with frame-based semantic annotations and automatic tools for shallow semantic analysis. We present the first part of the project: focusing on a set of notional domains, we delimited a subset of English frames, adapted them to French data when necessary, and developed the corresponding French lexicon. We believe that working domain by domain helped us to enforce the coherence of the resulting resource, and also has the advantage that, though the number of frames is limited (around a hundred), we obtain full coverage within a given domain.
WordNet, une des ressources lexicales les plus utilisées aujourd’hui a été constituée en anglais et les chercheurs travaillant sur d’autres langues souffrent du manque d’une telle ressource. Malgré les efforts fournis par la communauté française, les différents WordNets produits pour la langue française ne sont toujours pas aussi exhaustifs que le WordNet de Princeton. C’est pourquoi nous proposons une méthode novatrice dans la production de termes nominaux instanciant les différents synsets de WordNet en exploitant les propriétés syntaxiques distributionnelles du vocabulaire français. Nous comparons la ressource que nous obtenons avecWOLF et montrons que notre approche offre une couverture plus large.
Semantic Role Labeling cannot be performed without an associated linguistic resource. A key resource for such a task is the FrameNet resource based on Fillmores theory of frame semantics. Like many linguistic resources, FrameNet has been built by English native speakers for the English language. To overcome the lack of such resources in other languages, we propose a new approach to FrameNet translation by using bilingual dictionaries and filtering the wrong translations. We define six scores to filter, based on translation redundancy and FrameNet structure. We also present our work on the enrichment of the obtained resource with nouns. This enrichment uses semantic spaces built on syntactical dependencies and a multi-represented k-NN classifier. We evaluate both the tasks on the French language over a subset of ten frames and show improved results compared to the existing French FrameNet. Our final resource contains 15,132 associations lexical units-frames for an estimated precision of 86%.
The increasing amount of available textual information makes necessary the use of Natural Language Processing (NLP) tools. These tools have to be used on large collections of documents in different languages. But NLP is a complex task that relies on many processes and resources. As a consequence, NLP tools must be both configurable and efficient: specific software architectures must be designed for this purpose. We present in this paper the LIMA multilingual analysis platform, developed at CEA LIST. This configurable platform has been designed to develop NLP based industrial applications while keeping enough flexibility to integrate various processes and resources. This design makes LIMA a linguistic analyzer that can handle languages as different as French, English, German, Arabic or Chinese. Beyond its architecture principles and its capabilities as a linguistic analyzer, LIMA also offers a set of tools dedicated to the test and the evaluation of linguistic modules and to the production and the management of new linguistic resources.
La fiabilité des réponses qu’il propose, ou un moyen de l’estimer, est le meilleur atout d’un système de question-réponse. A cette fin, nous avons choisi d’effectuer des recherches dans des ensembles de documents différents et de privilégier des résultats qui sont trouvés dans ces différentes sources. Ainsi, le système QALC travaille à la fois sur une collection finie d’articles de journaux et sur le Web.