Text generation opens up new prospects for overcoming the lack of open corpora in fields such as healthcare, where data sharing is bound by confidentiality. In this study, we compare the performance of encoder-decoder and decoder-only language models for the controlled generation of clinical cases in French. To do so, we fine-tuned several pre-trained models on French clinical cases for each architecture and generate clinical cases conditioned by patient demographic information (gender and age) and clinical features.Our results suggest that encoder-decoder models are easier to control than decoder-only models, but more costly to train.
This paper presents the creation of Hostomytho, a game with a purpose intended for evaluating the quality of synthetic biomedical texts through multiple mini-games. Hostomytho was developed entirely using open source technologies both for internet browser and mobile platforms (IOS & Android). The code and the annotations created for synthetic clinical cases in French will be made freely available.
Cet article présente une étude sur l’utilisation de modèles de prédiction de liens pour l’enrichissement de graphes lexico-sémantiques du français. Celle-ci porte sur deux graphes, RezoJDM16k et RL-fr et sept modèles de prédiction de liens. Nous avons étudié les prédictions du modèle le plus performant, afin d’extraire de potentiels nouveaux triplets en utilisant un score de confiance que nous avons évalué avec des annotations manuelles. Nos résultats mettent en évidence des avantages différentspour le graphe dense RezoJDM16k par rapport à RL-fr, plus clairsemé. Si l’ajout de nouveaux triplets à RezoJDM16k offre des avantages limités, RL-fr peut bénéficier substantiellement de notre approche.
Nous proposons un outil pour mesurer automatiquement les biais de genre dans des textes générés par des grands modèles de langue dans des langues flexionnelles. Nous évaluons sept modèles à l’aide de 52 000 textes en français et 2 500 textes en italien, pour la rédaction de lettres de motivation. Notre outil s’appuie sur la détection de marqueurs morpho-syntaxiques de genre pour mettre au jour des biais. Ainsi, les modèles favorisent largement la génération de masculin : le genre masculin est deux fois plus présent que le féminin en français, et huit fois plus en italien. Les modèles étudiés exacerbent également des stéréotypes attestés en sociologie en associant les professions stéréotypiquement féminines aux textes au féminin, et les professions stéréotypiquement masculines aux textes au masculin.
La génération de texte ouvre des perspectives pour pallier l’absence de corpus librement partageables dans des domaines contraints par la confidentialité, comme le domaine médical. Dans cette étude, nous comparons les performances de modèles encodeurs-décodeurs et décodeurs seuls pour la génération conditionnée de cas cliniques en français. Nous affinons plusieurs modèles pré-entraînés pour chaque architecture sur des cas cliniques en français conditionnés par les informations démographiques des patient·es (sexe et âge) et des éléments cliniques.Nous observons que les modèles encodeur-décodeurs sont plus facilement contrôlables que les modèles décodeurs seuls, mais plus coûteux à entraîner.
This paper presents a resource-centric study of link prediction approaches over French lexical-semantic graphs. Our study incorporates two graphs, RezoJDM16k and RL-fr, and we evaluated seven link prediction models, with CompGCN-ConvE emerging as the best performer. We also conducted a qualitative analysis of the predictions using manual annotations. Based on this, we found that predictions with higher confidence scores were more valid for inclusion. Our findings highlight different benefits for the dense graph compared to the sparser graph RL-fr. While the addition of new triples to RezoJDM16k offers limited advantages, RL-fr can benefit substantially from our approach.
Named Entity Recognition (NER) is an applicative task for which annotation schemes vary. To compare the performance of systems which tagsets differ in precision and coverage, it is necessary to assess (i) the comparability of their annotation schemes and (ii) the individual adequacy of the latter to a common annotation scheme. What is more, and given the lack of robustness of some tools towards textual variation, we cannot expect an evaluation led on an homogeneous corpus with low-coverage to provide a reliable prediction of the actual tools performance. To tackle both these limitations in evaluation, we provide a gold corpus for French covering 6 textual genres and annotated with a rich tagset that enables comparison with multiple annotation schemes. We use the flexibility of this gold corpus to provide both: (i) an individual evaluation of four heterogeneous NER systems on their target tagsets, (ii) a comparison of their performance on a common scheme. This rich evaluation framework enables a fair comparison of NER systems across textual genres and annotation schemes.
Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group. We build on the French CrowS-Pairs corpus and guidelines to provide translations of the existing material into seven additional languages. In total, we produce 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models. We find that language models in all languages favor sentences that express stereotypes in most bias categories. The process of creating a resource that covers a wide range of language types and cultural settings highlights the difficulty of bias evaluation, in particular comparability across languages and contexts.
Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence in the NLP community over time. Using a corpus with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP publication authors, we explore the industry presence in the field since the early 90s. We find that industry presence among NLP authors has been steady before a steep increase over the past five years (180% growth from 2017 to 2022). A few companies account for most of the publications and provide funding to academic researchers through grants and internships. Our study shows that the presence and impact of the industry on natural language processing research are significant and fast-growing. This work calls for increased transparency of industry influence in the field.
In sensitive domains, the sharing of corpora is restricted due to confidentiality, copyrights or trade secrets. Automatic text generation can help alleviate these issues by producing synthetic texts that mimic the linguistic properties of real documents while preserving confidentiality. In this study, we assess the usability of synthetic corpus as a substitute training corpus for clinical information extraction. Our goal is to automatically produce a clinical case corpus annotated with clinical entities and to evaluate it for a named entity recognition (NER) task. We use two auto-regressive neural models partially or fully trained on generic French texts and fine-tuned on clinical cases to produce a corpus of synthetic clinical cases. We study variants of the generation process: (i) fine-tuning on annotated vs. plain text (in that case, annotations are obtained a posteriori) and (ii) selection of generated texts based on models parameters and filtering criteria. We then train NER models with the resulting synthetic text and evaluate them on a gold standard clinical corpus. Our experiments suggest that synthetic text is useful for clinical NER.
With NLP research now quickly being transferred into real-world applications, it is important to be aware of and think through the consequences of our scientific investigation. Such ethical considerations are important in both authoring and reviewing. This tutorial will equip participants with basic guidelines for thinking deeply about ethical issues and review common considerations that recur in NLP research. The methodology is interactive and participatory, including case studies and working in groups. Importantly, the participants will be co-building the tutorial outcomes and will be working to create further tutorial materials to share as public outcomes.
Au début du XXIe siècle, le français faisait encore partie des langues peu dotées. Grâce aux efforts de la communauté française du traitement automatique des langues (TAL), de nombreuses ressources librement disponibles ont été produites, dont des lexiques du français. À travers cet article, nous nous intéressons à leur devenir dans la communauté par le prisme des actes de la conférence TALN sur une période de 20 ans.
Les ressources textuelles disponibles dans le domaine biomédical sont rares pour des raisons de confidentialité. Des données existent mais ne sont pas partageables, c’est pourquoi il est intéressant de s’inspirer de ces données pour en générer de nouvelles sans contrainte de partage. Une difficulté majeure de la génération de données médicales est que les données générées doivent ressembler aux données originales sans compromettre leur confidentialité. L’évaluation de cette tâche est donc difficile. Dans cette étude, nous étendons l’évaluation de corpus cliniques générés en français en y ajoutant une dimension sémantique à l’aide de plongements de phrases. Nous recherchons des phrases proches à l’aide de similarité cosinus entre plongements, et analysons les scores de similarité. Nous observons que les phrases synthétiques sont thématiquement proches du corpus original, mais suffisamment éloignées pour ne pas être de simples reformulations qui compromettraient la confidentialité.
Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting. Much work on biases in natural language processing has addressed biases linked to the social and cultural experience of English speaking individuals in the United States. We seek to widen the scope of bias studies by creating material to measure social bias in language models (LMs) against specific demographic groups in France. We build on the US-centered CrowS-pairs dataset to create a multilingual stereotypes dataset that allows for comparability across languages while also characterizing biases that are specific to each country and language. We introduce 1,679 sentence pairs in French that cover stereotypes in ten types of bias like gender and age. 1,467 sentence pairs are translated from CrowS-pairs and 212 are newly crowdsourced. The sentence pairs contrast stereotypes concerning underadvantaged groups with the same sentence concerning advantaged groups. We find that four widely used language models (three French, one multilingual) favor sentences that express stereotypes in most bias categories. We report on the translation process from English into French, which led to a characterization of stereotypes in CrowS-pairs including the identification of US-centric cultural traits. We offer guidelines to further extend the dataset to other languages and cultural environments.
Nous présentons ici FENEC (FrEnch Named-entity Evaluation Corpus), un corpus à échantillons équilibrés contenant six genres, annoté en entités nommées selon le schéma fin Quæro. Les caractéristiques de ce corpus nous permettent d’évaluer et de comparer trois outils d’annotation automatique — un à base de règles et deux à base de réseaux de neurones — en jouant sur trois dimensions : la finesse du jeu d’étiquettes, le genre des corpus, et les métriques d’évaluation.
Cet article étudie l’application de la #RègledeBender dans des articles de traitement automatique des langues (TAL), en prenant en compte une dimension contrastive, par l’examen des actes de deux conférences du domaine, TALN et ACL, et une dimension diachronique, en examinant ces conférences au fil du temps. Un échantillon d’articles a été annoté manuellement et deux classifieurs ont été développés afin d’annoter automatiquement les autres articles. Nous quantifions ainsi l’application de la #RègledeBender, et mettons en évidence un léger mieux en faveur de TALN sur cet aspect.
Le TAL repose sur la disponibilité de corpus annotés pour l’entraînement et l’évaluation de modèles. Il existe très peu de ressources pour la similarité sémantique dans le domaine clinique en français. Dans cette étude, nous proposons une définition de la similarité guidée par l’analyse clinique et l’appliquons au développement d’un nouveau corpus partagé de 1 000 paires de phrases annotées manuellement en scores de similarité. Nous évaluons ensuite le corpus par des expériences de mesure automatique de similarité. Nous montrons ainsi qu’un modèle de plongements de phrases peut capturer la similarité avec des performances à l’état de l’art sur le corpus DEFT STS (Spearman=0,8343). Nous montrons également que le contenu du corpus CLISTER est complémentaire de celui de DEFT STS.
Afin de permettre l’étude des biais en traitement automatique de la langue au delà de l’anglais américain, nous enrichissons le corpus américain CrowS-pairs de 1 677 paires de phrases en français représentant des stéréotypes portant sur dix catégories telles que le genre. 1 467 paires de phrases sont traduites à partir de CrowS-pairs et 210 sont nouvellement recueillies puis traduites en anglais. Selon le principe des paires minimales, les phrases du corpus contrastent un énoncé stéréotypé concernant un groupe défavorisé et son équivalent pour un groupe favorisé. Nous montrons que quatre modèles de langue favorisent les énoncés qui expriment des stéréotypes dans la plupart des catégories. Nous décrivons le processus de traduction et formulons des recommandations pour étendre le corpus à d’autres langues. Attention : Cet article contient des énoncés de stéréotypes qui peuvent être choquants.
This article studies the application of the #BenderRule in Natural Language Processing (NLP) articles according to two dimensions. Firstly, in a contrastive manner, by considering two major international conferences, LREC and ACL, and secondly, in a diachronic manner, by inspecting nearly 14,000 articles over a period of time ranging from 2000 to 2020 for LREC and from 1979 to 2020 for ACL. For this purpose, we created a corpus from LREC and ACL articles from the above-mentioned periods, from which we manually annotated nearly 1,000. We then developed two classifiers to automatically annotate the rest of the corpus. Our results show that LREC articles tend to respect the #BenderRule (80 to 90% of them respect it), whereas 30 to 40% of ACL articles do not. Interestingly, over the considered periods, the results appear to be stable for the two conferences, even though a rebound in ACL 2020 could be a sign of the influence of the blog post about the #BenderRule.
This paper describes the continuation of a project that aims at establishing an interoperable annotation schema for quantification phenomena as part of the ISO suite of standards for semantic annotation, known as the Semantic Annotation Framework. After a break, caused by the Covid-19 pandemic, the project was relaunched in early 2022 with a second working draft of an annotation scheme, which is discussed in this paper. Keywords: semantic annotation, quantification, interoperability, annotation schema, ISO standard
Modern Natural Language Processing relies on the availability of annotated corpora for training and evaluating models. Such resources are scarce, especially for specialized domains in languages other than English. In particular, there are very few resources for semantic similarity in the clinical domain in French. This can be useful for many biomedical natural language processing applications, including text generation. We introduce a definition of similarity that is guided by clinical facts and apply it to the development of a new French corpus of 1,000 sentence pairs manually annotated according to similarity scores. This new sentence similarity corpus is made freely available to the community. We further evaluate the corpus through experiments of automatic similarity measurement. We show that a model of sentence embeddings can capture similarity with state-of-the-art performance on the DEFT STS shared task evaluation data set (Spearman=0.8343). We also show that the corpus is complementary to DEFT STS.
There is a growing interest in the evaluation of bias, fairness and social impact of Natural Language Processing models and tools. However, little resources are available for this task in languages other than English. Translation of resources originally developed for English is a promising research direction. However, there is also a need for complementing translated resources by newly sourced resources in the original languages and social contexts studied. In order to collect a language resource for the study of biases in Language Models for French, we decided to resort to citizen science. We created three tasks on the LanguageARC citizen science platform to assist with the translation of an existing resource from English into French as well as the collection of complementary resources in native French. We successfully collected data for all three tasks from a total of 102 volunteer participants. Participants from different parts of the world contributed and we noted that although calls sent to mailing lists had a positive impact on participation, some participants pointed barriers to contributions due to the collection platform.
The reviewing procedure has been identified as one of the major issues in the current situation of the NLP field. While it is implicitly assumed that junior researcher learn reviewing during their PhD project, this might not always be the case. Additionally, with the growing NLP community and the efforts in the context of widening the NLP community, researchers joining the field might not have the opportunity to practise reviewing. This tutorial fills in this gap by providing an opportunity to learn the basics of reviewing. Also more experienced researchers might find this tutorial interesting to revise their reviewing procedure.
This paper details experiments we performed on the Universal Dependencies 2.7 corpora in order to investigate the dominant word order in the available languages. For this purpose, we used a graph rewriting tool, GREW, which allowed us to go beyond the surface annotations and identify the implicit subjects. We first measured the distribution of the six different word orders (SVO, SOV, VSO, VOS, OVS, OSV) in the corpora and investigated when there was a significant difference in the corpora within a given language. Then, we compared the obtained results with information provided in the WALS database (Dryer and Haspelmath, 2013) and in ( ̈Ostling, 2015). Finally, we examined the impact of using a graph rewriting tool for this task. The tools and resources used for this research are all freely available.
This tutorial will cover the theory and practice of reviewing research in natural language processing. Heavy reviewing burdens on natural language processing researchers have made it clear that our community needs to increase the size of our pool of potential reviewers. Simultaneously, notable “false negatives”—rejection by our conferences of work that was later shown to be tremendously important after acceptance by other conferences—have raised awareness of the fact that our reviewing practices leave something to be desired. We do not often talk about “false positives” with respect to conference papers, but leaders in the field have noted that we seem to have a publication bias towards papers that report high performance, with perhaps not much else of interest in them. It need not be this way. Reviewing is a learnable skill, and you will learn it here via lectures and a considerable amount of hands-on practice.
Nous présentons ici les résultats d’un travail de réplication et d’extension pour l’alsacien d’une expérience concernant l’étiquetage en parties du discours de langues peu dotées par spécialisation des plongements lexicaux (Magistry et al., 2018). Ce travail a été réalisé en étroite collaboration avec les auteurs de l’article d’origine. Cette interaction riche nous a permis de mettre au jour les éléments manquants dans la présentation de l’expérience, de les compléter, et d’étendre la recherche à la robustesse à la variation.
We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of this generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.
We present here Rigor Mortis, a gamified crowdsourcing platform designed to evaluate the intuition of the speakers, then train them to annotate multi-word expressions (MWEs) in French corpora. We previously showed that the speakers’ intuition is reasonably good (65% in recall on non-fixed MWE). We detail here the annotation results, after a training phase using some of the tests developed in the PARSEME-FR project.
Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written.
With recent efforts in drawing attention to the task of replicating and/or reproducing results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors’, reviewers’ and community’s side.
Building representative linguistic resources and NLP tools for non-standardized languages is challenging: when spelling is not determined by a norm, multiple written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourced alternative spellings we use to extract rules applied to match OOV words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without expert rule definition. We apply this multilingual methodology on Alsatian, a French regional language and provide an intrinsic evaluation of the correctness of the variants pairs, and an extrinsic evaluation on a downstream task. We show that in a low-resource scenario, 145 inital pairs can lead to the generation of 876 additional variant pairs, and a diminution of OOV words improving the part-of-speech tagging performance by 1 to 4%.
This article presents the results we obtained in crowdsourcing French speakers’ intuition concerning multi-work expressions (MWEs). We developed a slightly gamified crowdsourcing platform, part of which is designed to test users’ ability to identify MWEs with no prior training. The participants perform relatively well at the task, with a recall reaching 65% for MWEs that do not behave as function words.
Nous présentons ici les résultats d’une expérience menée sur l’annotation en parties du discours d’un corpus d’une langue régionale encore peu dotée, l’alsacien, via une plateforme de myriadisation (crowdsourcing) bénévole développée spécifiquement à cette fin : Bisame1 . La plateforme, mise en ligne en mai 2016, nous a permis de recueillir 15 846 annotations grâce à 42 participants. L’évaluation des annotations, réalisée sur un corpus de référence, montre que la F-mesure des annotations volontaires est de 0, 93. Le tagger entraîné sur le corpus annoté atteint lui 82 % d’exactitude. Il s’agit du premier tagger spécifique à l’alsacien. Cette méthode de développement de ressources langagières est donc efficace et prometteuse pour certaines langues peu dotées, dont un nombre suffisant de locuteurs est connecté et actif sur le Web. Le code de la plateforme, le corpus annoté et le tagger sont librement disponibles.
Nous avons précédemment montré qu’il est possible de faire produire des annotations syntaxiques de qualité par des participants à un jeu ayant un but. Nous présentons ici les résultats d’une expérience visant à évaluer leur production sur un corpus plus complexe, en langue de spécialité, en l’occurrence un corpus de textes scientifiques sur l’ADN. Nous déterminons précisément la complexité de ce corpus, puis nous évaluons les annotations en syntaxe de dépendances produites par les joueurs par rapport à une référence mise au point par des experts du domaine.
This article presents the results we obtained on a complex annotation task (that of dependency syntax) using a specifically designed Game with a Purpose, ZombiLingo. We show that with suitable mechanisms (decomposition of the task, training of the players and regular control of the annotation quality during the game), it is possible to obtain annotations whose quality is significantly higher than that obtainable with a parser, provided that enough players participate. The source code of the game and the resulting annotated corpora (for French) are freely available.
We present here the context and results of two surveys (a French one and an international one) concerning Ethics and NLP, which we designed and conducted between June and September 2015. These surveys follow other actions related to raising concern for ethics in our community, including a Journée d’études, a workshop and the Ethics and Big Data Charter. The concern for ethics shows to be quite similar in both surveys, despite a few differences which we present and discuss. The surveys also lead to think there is a growing awareness in the field concerning ethical issues, which translates into a willingness to get involved in ethics-related actions, to debate about the topic and to see ethics be included in major conferences themes. We finally discuss the limits of the surveys and the means of action we consider for the future. The raw data from the two surveys are freely available online.
The authors have written the Ethic and Big Data Charter in collaboration with various agencies, private bodies and associations. This Charter aims at describing any large or complex resources, and in particular language resources, from a legal and ethical viewpoint and ensuring the transparency of the process of creating and distributing such resources. We propose in this article an analysis of the documentation coverage of the most frequently mentioned language resources with regards to the Charter, in order to show the benefit it offers.
We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.
This article presents experiments aiming at mapping the Lexique des Verbes du Français (Lexicon of French Verbs) to FRILEX, a Natural Language Processing (NLP) lexicon based on D ICOVALENCE. The two resources (Lexicon of French Verbs and D ICOVALENCE) were built by linguists, based on very different theories, which makes a direct mapping nearly impossible. We chose to use the examples provided in one of the resource to find implicit links between the two and make them explicit.
This article presents Propa-L, a freely accessible Web service that allows to semantically filter a lexical network. The language resources behind the service are dynamic and created through Games With A Purpose. We show an example of application of this service: the generation of a list of keywords for parental filtering on the Web, but many others can be envisaged. Moreover, the propagation algorithm we present here can be applied to any lexical network, in any language.
This article details work aiming at evaluating the quality of the manual annotation of gene renaming couples in scientific abstracts, which generates sparse annotations. To evaluate these annotations, we compare the results obtained using the commonly advocated inter-annotator agreement coefficients such as S, κ and Ï, the less known R, the weighted coefficients κÏ and α as well as the F-measure and the SER. We analyze to which extent they are relevant for our data. We then study the bias introduced by prevalence by changing the way the contingency table is built. We finally propose an original way to synthesize the results by computing distances between categories, based on the produced annotations.
In this paper, we present an annotation campaign of football (soccer) matches, from a heterogeneous text corpus of both match minutes and video commentary transcripts, in French. The data, annotations and evaluation process are detailed, and the quality of the annotated corpus is discussed. In particular, we propose a new technique to better estimate the annotator agreement when few elements of a text are to be annotated. Based on that, we show how the source medium influenced the process and the quality.
Cet article est une prise de position concernant les plate-formes de type Amazon Mechanical Turk, dont l’utilisation est en plein essor depuis quelques années dans le traitement automatique des langues. Ces plateformes de travail en ligne permettent, selon le discours qui prévaut dans les articles du domaine, de faire développer toutes sortes de ressources linguistiques de qualité, pour un prix imbattable et en un temps très réduit, par des gens pour qui il s’agit d’un passe-temps. Nous allons ici démontrer que la situation est loin d’être aussi idéale, que ce soit sur le plan de la qualité, du prix, du statut des travailleurs ou de l’éthique. Nous rappellerons ensuite les solutions alternatives déjà existantes ou proposées. Notre but est ici double : informer les chercheurs, afin qu’ils fassent leur choix en toute connaissance de cause, et proposer des solutions pratiques et organisationnelles pour améliorer le développement de nouvelles ressources linguistiques en limitant les risques de dérives éthiques et légales, sans que cela se fasse au prix de leur coût ou de leur qualité.
L’objectif des travaux présentés dans cet article est l’évaluation de la qualité d’annotations manuelles de relations de renommage de gènes dans des résumés scientifiques, annotations qui présentent la caractéristique d’être très dispersées. Pour cela, nous avons calculé et comparé les coefficients les plus communément utilisés, entre autres kappa (Cohen, 1960) et pi (Scott, 1955), et avons analysé dans quelle mesure ils sont adaptés à nos données. Nous avons également étudié les différentes pondérations applicables à ces coefficients permettant de calculer le kappa pondéré (Cohen, 1968) et l’alpha (Krippendorff, 1980, 2004). Nous avons ainsi étudié le biais induit par la grande prévalence d’une catégorie et défini un mode de calcul des distances entre catégories reposant sur les annotations réalisées.
In this paper, we introduce the FastKwic (Key Word In Context using FASTR), a new concordancer for French and English that does not require users to learn any particular request language. Built on FASTR, it shows them not only occurrences of the searched term but also of several morphological, morpho-syntactic and syntactic variants (for example, image enhancement, enhancement of image, enhancement of fingerprint image, image texture enhancement). Fastkwic is freely available. It consists of two UTF-8 compliant Perl modules that depend on several external tools and resources : FASTR, TreeTagger, Flemm (for French). Licenses of theses tools and resources permitting, the FastKwic package is nevertheless self-sufficient. FastKwic first modules is for terminological resource compilation. Its input is a list of terms - as required by FASTR. FastKwic second module is for processing concordances. It relies on FASTR again for indexing the input corpus with terms and their variants. Its output is a concordancer: for each term and its variants, the context of occurrence is provided.
La tâche, aujourd’hui considérée comme fondamentale, de reconnaissance d’entités nommées, présente des difficultés spécifiques en matière d’annotation. Nous les précisons ici, en les illustrant par des expériences d’annotation manuelle dans le domaine de la microbiologie. Ces problèmes nous amènent à reposer la question fondamentale de ce que les annotateurs doivent annoter et surtout, pour quoi faire. Nous identifions pour cela les applications nécessitant l’extraction d’entités nommées et, en fonction des besoins de ces applications, nous proposons de définir sémantiquement les éléments à annoter. Nous présentons ensuite un certain nombre de recommandations méthodologiques permettant d’assurer un cadre d’annotation cohérent et évaluable.
La production de lexiques est une activité indispensable mais complexe, qui nécessite, quelle que soit la méthode de création utilisée (acquisition automatique ou manuelle), une validation humaine. Nous proposons dans ce but une plate-forme Web librement disponible, appelée Sylva (Systematic lexicon validator). Cette plate-forme a pour caractéristiques principales de permettre une validation multi-niveaux (par des validateurs, puis un expert) et une traçabilité de la ressource. La tâche de l’expert(e) linguiste en est allégée puisqu’il ne lui reste à considérer que les données sur lesquelles il n’y a pas d’accord inter-validateurs.
PrepLex est un lexique des prépositions du français. Il contient les informations utiles à des systèmes d’analyse syntaxique. Il a été construit en comparant puis fusionnant différentes sources d’informations lexicales disponibles. Ce lexique met également en évidence les prépositions ou classes de prépositions qui apparaissent dans la définition des cadres de sous-catégorisation des ressources lexicales qui décrivent la valence des verbes.