Rodrigo Wilkens


2024

pdf bib
Exploring hybrid approaches to readability: experiments on the complementarity between linguistic features and transformers
Rodrigo Wilkens | Patrick Watrin | Rémi Cardon | Alice Pintard | Isabelle Gribomont | Thomas François
Findings of the Association for Computational Linguistics: EACL 2024

Linguistic features have a strong contribution in the context of the automatic assessment of text readability (ARA). They have been one of the anchors between the computational and theoretical models. With the development in the ARA field, the research moved to Deep Learning (DL). In an attempt to reconcile the mixed results reported in this context, we present a systematic comparison of 6 hybrid approaches along with standard Machine Learning and DL approaches, on 4 corpora (different languages and target audiences). The various experiments clearly highlighted two rather simple hybridization methods (soft label and simple concatenation). They also appear to be the most robust on smaller datasets and across various tasks and languages. This study stands out as the first to systematically compare different architectures and approaches to feature hybridization in DL, as well as comparing performance in terms of two languages and two target audiences of the text, which leads to a clearer pattern of results.

pdf bib
L’impact de genre sur la prédiction de la lisibilité du texte en FLE
Lingyun Gao | Rodrigo Wilkens | Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Cet article étudie l’impact du genre discursif sur la prédiction de la lisibilité des textes en français langue étrangère (FLE) à travers l’intégration de méta-informations du genre discursif dans les modèles de prédiction de la lisibilité. En utilisant des architectures neuronales basées sur CamemBERT, nous avons comparé les performances de modèles intégrant l’information de genre à celles d’un modèle de base ne considérant que le texte. Nos résultats révèlent une amélioration modeste de l’exactitude globale lors de l’intégration du genre, avec cependant des variations notables selon les genres spécifiques de textes. Cette observation semble confirmer l’importance de prendre en compte les méta-informations textuelles tel que le genre lors de la conception de modèles de lisibilité et de traiter le genre comme une information riche à laquelle le modèle doit accorder une position préférentielle.

pdf bib
Modéliser la facilité d’écoute en FLE : vaut-il mieux lire la transcription ou écouter le signal vocal ?
Minami Ozawa | Rodrigo Wilkens | Kaori Sugiyama | Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Le principal objectif de cette étude est de proposer un modèle capable de prédire automatiquement le niveau de facilité d’écoute de documents audios en français. Les données d’entrainement sont constituées d’enregistrements audios accompagnés de leurs transcriptions et sont issues de manuels de FLE dont le niveau est évalué sur l’échelle du Cadre européen commun de référence (CECR). Nous comparons trois approches différentes : machines à vecteurs de support (SVM) combinant des variables de lisibilité et de fluidité, wav2vec et CamemBERT. Pour identifier le meilleur modèle, nous évaluons l’impact des caractéristiques linguistiques et prosodiques ainsi que du style de parole(dialogue ou monologue) sur les performances. Nos expériences montrent que les variables de fluidité améliorent la précision du modèle et que cette précision est différente par style de parole. Enfin, les performances de tous les modèles varient selon les niveaux du CECR.

pdf bib
TCFLE-8 : un corpus de productions écrites d’apprenants de français langue étrangère et son application à la correction automatisée de textes
Rodrigo Wilkens | Alice Pintard | David Alfter | Vincent Folny | Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

La correction automatisée de textes (CAT) vise à évaluer automatiquement la qualité de textes écrits. L’automatisation permet une évaluation à grande échelle ainsi qu’une amélioration de la cohérence, de la fiabilité et de la normalisation du processus. Ces caractéristiques sont particulièrement importantes dans le contexte des examens de certification linguistique. Cependant, un goulot d’étranglement majeur dans le développement des systèmes CAT est la disponibilité des corpus. Dans cet article, nous visons à encourager le développement de systèmes de correction automatique en fournissant le corpus TCFLE-8, un corpus de 6~569 essais collectés dans le contexte de l’examen de certification Test de Connaissance du Français (TCF). Nous décrivons la procédure d’évaluation stricte qui a conduit à la notation de chaque essai par au moins deux évaluateurs selon l’échelle du Cadre européen commun de référence pour les langues (CECR) et à la création d’un corpus équilibré. Nous faisons également progresser les performances de l’état de l’art pour la tâche de CAT en français en expérimentant deux solides modèles de référence.

pdf bib
Exploration d’approches hybrides pour la lisibilité : expériences sur la complémentarité entre les traits linguistiques et les transformers
Rodrigo Wilkens | Patrick Watrin | Rémi Cardon | Alice Pintard | Isabelle Gribomont | Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès

Les architectures d’apprentissage automatique reposant sur la définition de traits linguistiques ont connu un succès important dans le domaine de l’évaluation automatique de la lisibilité des textes (ARA) et ont permis de faire se rencontrer informatique et théorie psycholinguistique. Toutefois, les récents développements se sont tournés vers l’apprentissage profond et les réseaux de neurones. Dans cet article, nous cherchons à réconcilier les deux approches. Nous présentons une comparaison systématique de 6 architectures hybrides (appliquées à plusieurs langues et publics) que nous comparons à ces deux approches concurrentes. Les diverses expériences réalisées ont clairement mis en évidence deux méthodes d’hybridation : Soft-Labeling et concaténation simple. Ces deux architectures sont également plus efficaces lorsque les données d’entraînement sont réduites. Cette étude est la première à comparer systématiquement différentes architectures hybrides et à étudier leurs performances dans plusieurs tâches de lisibilité.

pdf bib
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024
Rodrigo Wilkens | Rémi Cardon | Amalia Todirascu | Núria Gala
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

pdf bib
Paying attention to the words: explaining readability prediction for French as a foreign language
Rodrigo Wilkens | Patrick Watrin | Thomas François
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

Automatic text Readability Assessment (ARA) has been seen as a way of helping people with reading difficulties. Recent advancements in Natural Language Processing have shifted ARA from linguistic-based models to more precise black-box models. However, this shift has weakened the alignment between ARA models and the reading literature, potentially leading to inaccurate predictions based on unintended factors. In this paper, we investigate the explainability of ARA models, inspecting the relationship between attention mechanism scores, ARA features, and CEFR level predictions made by the model. We propose a method for identifying features associated with the predictions made by a model through the use of the attention mechanism. Exploring three feature families (i.e., psycho-linguistic, work frequency and graded lexicon), we associated features with the model’s attention heads. Finally, while not fully explanatory of the model’s performance, the correlations of these associations surpass those between features and text readability levels.

2023

pdf bib
Statistical Methods for Annotation Analysis
Rodrigo Wilkens
Computational Linguistics, Volume 49, Issue 3 - September 2023

pdf bib
TCFLE-8: a Corpus of Learner Written Productions for French as a Foreign Language and its Application to Automated Essay Scoring
Rodrigo Wilkens | Alice Pintard | David Alfter | Vincent Folny | Thomas François
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Automated Essay Scoring (AES) aims to automatically assess the quality of essays. Automation enables large-scale assessment, improvements in consistency, reliability, and standardization. Those characteristics are of particular relevance in the context of language certification exams. However, a major bottleneck in the development of AES systems is the availability of corpora, which, unfortunately, are scarce, especially for languages other than English. In this paper, we aim to foster the development of AES for French by providing the TCFLE-8 corpus, a corpus of 6.5k essays collected in the context of the Test de Connaissance du Français (TCF - French Knowledge Test) certification exam. We report the strict quality procedure that led to the scoring of each essay by at least two raters according to the CEFR levels and to the creation of a balanced corpus. In addition, we describe how linguistic properties of the essays relate to the learners’ proficiency in TCFLE-8. We also advance the state-of-the-art performance for the AES task in French by experimenting with two strong baselines (i.e. RoBERTa and feature-based). Finally, we discuss the challenges of AES using TCFLE-8.

pdf bib
Annotation Linguistique pour l’Évaluation de la Simplification Automatique de Textes
Rémi Cardon | Adrien Bibal | Rodrigo Wilkens | David Alfter | Magali Norré | Adeline Müller | Patrick Watrin | Thomas François
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

L’évaluation des systèmes de simplification automatique de textes (SAT) est une tâche difficile, accomplie à l’aide de métriques automatiques et du jugement humain. Cependant, d’un point de vue linguistique, savoir ce qui est concrètement évalué n’est pas clair. Nous proposons d’annoter un des corpus de référence pour la SAT, ASSET, que nous utilisons pour éclaircir cette question. En plus de la contribution que constitue la ressource annotée, nous montrons comment elle peut être utilisée pour analyser le comportement de SARI, la mesure d’évaluation la plus populaire en SAT. Nous présentons nos conclusions comme une étape pour améliorer les protocoles d’évaluation en SAT à l’avenir.

2022

pdf bib
Is Attention Explanation? An Introduction to the Debate
Adrien Bibal | Rémi Cardon | David Alfter | Rodrigo Wilkens | Xiaoou Wang | Thomas François | Patrick Watrin
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The performance of deep learning models in NLP and other fields of machine learning has led to a rise in their popularity, and so the need for explanations of these models becomes paramount. Attention has been seen as a solution to increase performance, while providing some explanations. However, a debate has started to cast doubt on the explanatory power of attention in neural networks. Although the debate has created a vast literature thanks to contributions from various areas, the lack of communication is becoming more and more tangible. In this paper, we provide a clear overview of the insights on the debate by critically confronting works from these different areas. This holistic vision can be of great interest for future works in all the communities concerned by this debate. We sum up the main challenges spotted in these areas, and we conclude by discussing the most promising future avenues on attention as an explanation.

pdf bib
Linguistic Corpus Annotation for Automatic Text Simplification Evaluation
Rémi Cardon | Adrien Bibal | Rodrigo Wilkens | David Alfter | Magali Norré | Adeline Müller | Watrin Patrick | Thomas François
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Evaluating automatic text simplification (ATS) systems is a difficult task that is either performed by automatic metrics or user-based evaluations. However, from a linguistic point-of-view, it is not always clear on what bases these evaluations operate. In this paper, we propose annotations of the ASSET corpus that can be used to shed more light on ATS evaluation. In addition to contributing with this resource, we show how it can be used to analyze SARI’s behavior and to re-evaluate existing ATS systems. We present our insights as a step to improve ATS evaluation protocols in the future.

pdf bib
L’Attention est-elle de l’Explication ? Une Introduction au Débat (Is Attention Explanation ? An Introduction to the Debate )
Adrien Bibal | Remi Cardon | David Alfter | Rodrigo Wilkens | Xiaoou Wang | Thomas François | Patrick Watrin
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Nous présentons un résumé en français et un résumé en anglais de l’article Is Attention Explanation ? An Introduction to the Debate (Bibal et al., 2022), publié dans les actes de la conférence 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).

pdf bib
FABRA: French Aggregator-Based Readability Assessment toolkit
Rodrigo Wilkens | David Alfter | Xiaoou Wang | Alice Pintard | Anaïs Tack | Kevin P. Yancey | Thomas François
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we present the FABRA: readability toolkit based on the aggregation of a large number of readability predictor variables. The toolkit is implemented as a service-oriented architecture, which obviates the need for installation, and simplifies its integration into other projects. We also perform a set of experiments to show which features are most predictive on two different corpora, and how the use of aggregators improves performance over standard feature-based readability prediction. Our experiments show that, for the explored corpora, the most important predictors for native texts are measures of lexical diversity, dependency counts and text coherence, while the most important predictors for foreign texts are syntactic variables illustrating language development, as well as features linked to lexical sophistication. FABRA: have the potential to support new research on readability assessment for French.

pdf bib
HECTOR: A Hybrid TExt SimplifiCation TOol for Raw Texts in French
Amalia Todirascu | Rodrigo Wilkens | Eva Rolin | Thomas François | Delphine Bernhard | Núria Gala
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Reducing the complexity of texts by applying an Automatic Text Simplification (ATS) system has been sparking interest inthe area of Natural Language Processing (NLP) for several years and a number of methods and evaluation campaigns haveemerged targeting lexical and syntactic transformations. In recent years, several studies exploit deep learning techniques basedon very large comparable corpora. Yet the lack of large amounts of corpora (original-simplified) for French has been hinderingthe development of an ATS tool for this language. In this paper, we present our system, which is based on a combination ofmethods relying on word embeddings for lexical simplification and rule-based strategies for syntax and discourse adaptations. We present an evaluation of the lexical, syntactic and discourse-level simplifications according to automatic and humanevaluations. We discuss the performances of our system at the lexical, syntactic, and discourse levels

pdf bib
Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference
Rodrigo Wilkens | David Alfter | Rémi Cardon | Núria Gala
Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference

pdf bib
MWE for Essay Scoring English as a Foreign Language
Rodrigo Wilkens | Daiane Seibert | Xiaoou Wang | Thomas François
Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference

Mastering a foreign language like English can bring better opportunities. In this context, although multiword expressions (MWE) are associated with proficiency, they are usually neglected in the works of automatic scoring language learners. Therefore, we study MWE-based features (i.e., occurrence and concreteness) in this work, aiming at assessing their relevance for automated essay scoring. To achieve this goal, we also compare MWE features with other classic features, such as length-based, graded resource, orthographic neighbors, part-of-speech, morphology, dependency relations, verb tense, language development, and coherence. Although the results indicate that classic features are more significant than MWE for automatic scoring, we observed encouraging results when looking at the MWE concreteness through the levels.

pdf bib
CENTAL at TSAR-2022 Shared Task: How Does Context Impact BERT-Generated Substitutions for Lexical Simplification?
Rodrigo Wilkens | David Alfter | Rémi Cardon | Isabelle Gribomont | Adrien Bibal | Watrin Patrick | Marie-Catherine De marneffe | Thomas François
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

Lexical simplification is the task of substituting a difficult word with a simpler equivalent for a target audience. This is currently commonly done by modeling lexical complexity on a continuous scale to identify simpler alternatives to difficult words. In the TSAR shared task, the organizers call for systems capable of generating substitutions in a zero-shot-task context, for English, Spanish and Portuguese. In this paper, we present the solution we (the cental team) proposed for the task. We explore the ability of BERT-like models to generate substitution words by masking the difficult word. To do so, we investigate various context enhancement strategies, that we combined into an ensemble method. We also explore different substitution ranking methods. We report on a post-submission analysis of the results and present our insights for potential improvements. The code for all our experiments is available at https://gitlab.com/Cental-FR/cental-tsar2022.

2020

pdf bib
Un corpus d’évaluation pour un système de simplification discursive (An Evaluation Corpus for Automatic Discourse Simplification)
Rodrigo Wilkens | Amalia Todirascu
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Nous présentons un nouveau corpus simplifié, disponible en français pour l’évaluation d’un système de simplification discursive. Ce système utilise des chaînes de référence pour simplifier et pour préserver la cohésion textuelle après simplification. Nous présentons la méthodologie de collecte de corpus (via un formulaire, qui recueille les simplifications manuelles faites par des participants experts), les règles présentées dans le guide, une analyse des types de simplifications et une évaluation de notre corpus, par comparaison avec la sortie du système de simplification automatique.

pdf bib
French Coreference for Spoken and Written Language
Rodrigo Wilkens | Bruno Oberle | Frédéric Landragin | Amalia Todirascu
Proceedings of the Twelfth Language Resources and Evaluation Conference

Coreference resolution aims at identifying and grouping all mentions referring to the same entity. In French, most systems run different setups, making their comparison difficult. In this paper, we present an extensive comparison of several coreference resolution systems for French. The systems have been trained on two corpora (ANCOR for spoken language and Democrat for written language) annotated with coreference chains, and augmented with syntactic and semantic information. The models are compared with different configurations (e.g. with and without singletons). In addition, we evaluate mention detection and coreference resolution apart. We present a full-stack model that outperforms other approaches. This model allows us to study the impact of mention detection errors on coreference resolution. Our analysis shows that mention detection can be improved by focusing on boundary identification while advances in the pronoun-noun relation detection can help the coreference task. Another contribution of this work is the first end-to-end neural French coreference resolution model trained on Democrat (written texts), which compares to the state-of-the-art systems for oral French.

pdf bib
Simplifying Coreference Chains for Dyslexic Children
Rodrigo Wilkens | Amalia Todirascu
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a work aiming to generate adapted content for dyslexic children for French, in the context of the ALECTOR project. Thus, we developed a system to transform the texts at the discourse level. This system modifies the coreference chains, which are markers of text cohesion, by using rules. These rules were designed following a careful study of coreference chains in both original texts and its simplified versions. Moreover, in order to define reliable transformation rules, we analysed several coreference properties as well as the concurrent simplification operations in the aligned texts. This information is coded together with a coreference resolution system and a text rewritten tool in the proposed system, which comprise a coreference module specialised in written text and seven text transformation operations. The evaluation of the system firstly focused on check the simplification by manual validation of three judges. These errors were grouped into five classes that combined can explain 93% of the errors. The second evaluation step consisted of measuring the simplification perception by 23 judges, which allow us to measure the simplification impact of the proposed rules.

pdf bib
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)
Núria Gala | Rodrigo Wilkens
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

pdf bib
Coreference-Based Text Simplification
Rodrigo Wilkens | Bruno Oberle | Amalia Todirascu
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Text simplification aims at adapting documents to make them easier to read by a given audience. Usually, simplification systems consider only lexical and syntactic levels, and, moreover, are often evaluated at the sentence level. Thus, studies on the impact of simplification in text cohesion are lacking. Some works add coreference resolution in their pipeline to address this issue. In this paper, we move forward in this direction and present a rule-based system for automatic text simplification, aiming at adapting French texts for dyslexic children. The architecture of our system takes into account not only lexical and syntactic but also discourse information, based on coreference chains. Our system has been manually evaluated in terms of grammaticality and cohesion. We have also built and used an evaluation corpus containing multiple simplification references for each sentence. It has been annotated by experts following a set of simplification guidelines, and can be used to run automatic evaluation of other simplification systems. Both the system and the evaluation corpus are freely available.

2018

pdf bib
Investigating Productive and Receptive Knowledge: A Profile for Second Language Learning
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the 27th International Conference on Computational Linguistics

The literature frequently addresses the differences in receptive and productive vocabulary, but grammar is often left unacknowledged in second language acquisition studies. In this paper, we used two corpora to investigate the divergences in the behavior of pedagogically relevant grammatical structures in reception and production texts. We further improved the divergence scores observed in this investigation by setting a polarity to them that indicates whether there is overuse or underuse of a grammatical structure by language learners. This led to the compilation of a language profile that was later combined with vocabulary and readability features for classifying reception and production texts in three classes: beginner, intermediate, and advanced. The results of the automatic classification task in both production (0.872 of F-measure) and reception (0.942 of F-measure) were comparable to the current state of the art. We also attempted to automatically attribute a score to texts produced by learners, and the correlation results were encouraging, but there is still a good amount of room for improvement in this task. The developed language profile will serve as input for a system that helps language learners to activate more of their passive knowledge in writing texts.

pdf bib
SW4ALL: a CEFR Classified and Aligned Corpus for Language Learning
Rodrigo Wilkens | Leonardo Zilio | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
An SLA Corpus Annotated with Pedagogically Relevant Grammatical Structures
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
The brWaC Corpus: A New Open Resource for Brazilian Portuguese
Jorge A. Wagner Filho | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Similarity Measures for the Detection of Clinical Conditions with Verbal Fluency Tasks
Felipe Paula | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Semantic Verbal Fluency tests have been used in the detection of certain clinical conditions, like Dementia. In particular, given a sequence of semantically related words, a large number of switches from one semantic class to another has been linked to clinical conditions. In this work, we investigate three similarity measures for automatically identifying switches in semantic chains: semantic similarity from a manually constructed resource, and word association strength and semantic relatedness, both calculated from corpora. This information is used for building classifiers to distinguish healthy controls from clinical cases with early stages of Alzheimer’s Disease and Mild Cognitive Deficits. The overall results indicate that for clinical conditions the classifiers that use these similarity measures outperform those that use a gold standard taxonomy.

2017

pdf bib
Using NLP for Enhancing Second Language Acquisition
Leonardo Zilio | Rodrigo Wilkens | Cédrick Fairon
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This study presents SMILLE, a system that draws on the Noticing Hypothesis and on input enhancements, addressing the lack of salience of grammatical infor mation in online documents chosen by a given user. By means of input enhancements, the system can draw the user’s attention to grammar, which could possibly lead to a higher intake per input ratio for metalinguistic information. The system receives as input an online document and submits it to a combined processing of parser and hand-written rules for detecting its grammatical structures. The input text can be freely chosen by the user, providing a more engaging experience and reflecting the user’s interests. The system can enhance a total of 107 fine-grained types of grammatical structures that are based on the CEFR. An evaluation of some of those structures resulted in an overall precision of 87%.

pdf bib
LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds
Rodrigo Wilkens | Leonardo Zilio | Silvio Ricardo Cordeiro | Felipe Paula | Carlos Ramisch | Marco Idiart | Aline Villavicencio
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

2016

pdf bib
Multiword Expressions in Child Language
Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of this work is to introduce CHILDES-MWE, which contains English CHILDES corpora automatically annotated with Multiword Expressions (MWEs) information. The result is a resource with almost 350,000 sentences annotated with more than 70,000 distinct MWEs of various types from both longitudinal and latitudinal corpora. This resource can be used for large scale language acquisition studies of how MWEs feature in child language. Focusing on compound nouns (CN), we then verify in a longitudinal study if there are differences in the distribution and compositionality of CNs in child-directed and child-produced sentences across ages. Moreover, using additional latitudinal data, we investigate if there are further differences in CN usage and in compositionality preferences. The results obtained for the child-produced sentences reflect CN distribution and compositionality in child-directed sentences.

pdf bib
B2SG: a TOEFL-like Task for Portuguese
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Resources such as WordNet are useful for NLP applications, but their manual construction consumes time and personnel, and frequently results in low coverage. One alternative is the automatic construction of large resources from corpora like distributional thesauri, containing semantically associated words. However, as they may contain noise, there is a strong need for automatic ways of evaluating the quality of the resulting resource. This paper introduces a gold standard that can aid in this task. The BabelNet-Based Semantic Gold Standard (B2SG) was automatically constructed based on BabelNet and partly evaluated by human judges. It consists of sets of tests that present one target word, one related word and three unrelated words. B2SG contains 2,875 validated relations: 800 for verbs and 2,075 for nouns; these relations are divided among synonymy, antonymy and hypernymy. They can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri by comparing the proximity of the target word with the related and unrelated options and observing if the related word has the highest similarity value among them. As a case study two distributional thesauri were also developed: one using surface forms from a large (1.5 billion word) corpus and the other using lemmatized forms from a smaller (409 million word) corpus. Both distributional thesauri were then evaluated against B2SG, and the one using lemmatized forms performed slightly better.

pdf bib
Automatic Construction of Large Readability Corpora
Jorge Alberto Wagner Filho | Rodrigo Wilkens | Aline Villavicencio
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the former already produce F-measures of over 0.75 for Portuguese texts, but the use of additional features results in even better results, in most cases. For English, shallow features also perform well as do classic readability formulas. Comparing different classifiers for the task, logistic regression obtained, in general, the best results, but with considerable differences between the results for two and those for three-classes, especially regarding the intermediary class. Given the large scale of the resulting corpus, for evaluation we adopt the agreement between different classifiers as an indication of readability assessment certainty. As a result of this work, a large corpus for Brazilian Portuguese was built, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.

2015

pdf bib
Distributional Thesauri for Portuguese: methodology evaluation
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Gabriel Gonçalves | Aline Villavicencio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

2012

pdf bib
An annotated English child language database
Aline Villavicencio | Beracah Yankama | Rodrigo Wilkens | Marco Idiart | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
Searching the Annotated Portuguese Childes Corpora
Rodrigo Wilkens
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
I say have you say tem: profiling verbs in children data in English and Portuguese
Rodrigo Wilkens | Aline Villavicencio
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

2010

pdf bib
COMUNICA - A Question Answering System for Brazilian Portuguese
Rodrigo Wilkens | Aline Villavicencio | Daniel Muller | Leandro Wives | Fabio Silva | Stanley Loh
Coling 2010: Demonstrations