Jorge Baptista
2026
From Complexity Scores to Readable Texts: iRead4Skills for Adult Literacy in Portuguese
Jorge Baptista | Eugénio Ribeiro | Nuno Mamede | David Antunes | Raquel Amaro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Jorge Baptista | Eugénio Ribeiro | Nuno Mamede | David Antunes | Raquel Amaro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Adult Learning (AL) programmes need short, trustworthy texts that match learners’ reading abilities, but educators rarely have time, tools, or evidence-based guidelines to select and adapt materials consistently.We present a live demo of iRead4Skills for European Portuguese: a web-based system that (i) estimates readability/complexity for AL-oriented levels aligned with CEFR, (ii) highlights where complexity concentrates (lexical, grammatical, semantic), and (iii) supports rewriting by offering actionable, level-aware suggestions and curated lexical resources.The demo emphasises transparency and “trainer-first” workflows: users see *why* a text is complex and *how* to revise it without losing meaning.
Portho: A Corpus-Based Resource of Orthographic Neighbors in European Portuguese
Eugénio Ribeiro | David Antunes | Nuno Mamede | Jorge Baptista
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Eugénio Ribeiro | David Antunes | Nuno Mamede | Jorge Baptista
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Orthographic neighbors (ONs) play a central role in models of visual word recognition and have been shown to influence reading speed, lexical access, and literacy development. Despite their importance, resources providing detailed and flexible ON information remain scarce for European Portuguese. This paper introduces Portho, a corpus-based lexical resource that provides multiple ON metrics for over 43,000 word forms, using several ON definitions. In addition to classical neighborhood size measures, Portho provides frequency-based statistics and graded orthographic distance (OD) features. We analyze the statistical properties of the resource and evaluate its empirical utility in automatic text complexity assessment using the iRead4Skills corpus. Results show that while ON features alone are insufficient to predict readability, they contribute complementary information and compare favorably with existing resources for Portuguese. Portho is made publicly available in different formats to support research in psycholinguistics, readability modeling, and Natural Language Processing (NLP) for Portuguese.
A Lexicon-Grammar of Brazilian Portuguese Predicative Adjectives
Ryan Martinez | Jorge Baptista | Oto Araújo Vale
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Ryan Martinez | Jorge Baptista | Oto Araújo Vale
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
This paper presents a syntactic lexicon of Brazilian Portuguese predicative adjectives that are not regularly derived from verbs. From the 7,000 most frequent adjectives in a large web corpus, 3,161 lexical items were selected and annotated with 36 syntactic properties. These properties were established through introspection and corpus evidence, covering argument structure, copular verbs, prepositions, transformations (e.g., raising, nominalization), semantic roles, and others. The resulting resource constitutes a machine-readable lexicon of predicative adjectives for Brazilian Portuguese.
Lexicon-Grammar Web
Jorge Baptista | David Antunes | Nuno Mamede | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
Jorge Baptista | David Antunes | Nuno Mamede | Eugénio Ribeiro
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2
This demo showcases a web-based interface that provides open, interactive access to a large-scale grammatical database of European Portuguese verbal constructions. Through a unified search and exploration environment, users can query, inspect, and compare more than 7,000 distributionally free verbal constructions and over 2,700 verbal idioms (frozen constructions), grounded in long-standing Lexicon–Grammar descriptions. For each construction, the interface exposes core linguistic properties such as argument structure, distributional constraints, semantic roles, major syntactic transformations, and curated usage examples with English translations. The demo illustrates how detailed, manually validated grammatical knowledge can be explored dynamically via the web, supporting linguistic research, language teaching, and NLP development. To the best of our knowledge, this is the largest publicly accessible, web-based grammatical resource dedicated to European Portuguese verbal constructions.
Semantic Representation of Relative Clauses in Lexicalized Abstract Meaning Representation
Jorge Baptista | Sónia Reis
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Jorge Baptista | Sónia Reis
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
This paper analyzes the semantic parsing of relative clauses in Portuguese in two meaning representation frameworks: Abstract Meaning Representation (AMR) and Lexicalized Meaning Representation (LMR). While both treat relatives as noun modifiers, AMR fails to distinguish restrictive from appositive clauses–an important traditional grammatical distinction. We argue for explicitly encoding this difference. The study draws on annotated translations of *The Little Prince* (Saint-Exupéry, 1943) in Brazilian and European Portuguese, highlighting issues in the Brazilian AMR annotations.
2025
The iRead4Skills Intelligent Complexity Analyzer
Wafa Aissa | Raquel Amaro | David Antunes | Thibault Bañeras-Roux | Jorge Baptista | Alejandro Catala | Luís Correia | Thomas François | Marcos Garcia | Mario Izquierdo-Álvarez | Nuno Mamede | Vasco Martins | Miguel Neves | Eugénio Ribeiro | Sandra Rodriguez Rey | Elodie Vanzeveren
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Wafa Aissa | Raquel Amaro | David Antunes | Thibault Bañeras-Roux | Jorge Baptista | Alejandro Catala | Luís Correia | Thomas François | Marcos Garcia | Mario Izquierdo-Álvarez | Nuno Mamede | Vasco Martins | Miguel Neves | Eugénio Ribeiro | Sandra Rodriguez Rey | Elodie Vanzeveren
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present the iRead4Skills Intelligent Complexity Analyzer, an open-access platform specifically designed to assist educators and content developers in addressing the needs of low-literacy adults by analyzing and diagnosing text complexity. This multilingual system integrates a range of Natural Language Processing (NLP) components to assess input texts along multiple levels of granularity and linguistic dimensions in Portuguese, Spanish, and French. It assigns four tailored difficulty levels using state-of-the-art models, and introduces four diagnostic yardsticks—textual structure, lexicon, syntax, and semantics—offering users actionable feedback on specific dimensions of textual complexity. Each component of the system is supported by experiments comparing alternative models on manually annotated data.
A European Portuguese corpus annotated for verbal idioms
David Antunes | Jorge Baptista | Nuno J. Mamede
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
David Antunes | Jorge Baptista | Nuno J. Mamede
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
This paper presents the construction of VIDiom-PT, a corpus in European Portuguese annotated for verbal idioms (e.g. O Rui bateu a bota, lit.: Rui hit the boot ‘Rui died’). This linguistic resource aims to support the development of systems capable of processing such constructions in this language variety. To assist in the annotation effort, two tools were built. The first allows for the detection of possible instances of verbal idioms in texts, while the second provides a graphical interface for annotating them. This effort culminated in the annotation of a total of 5,178 instances of 747 different verbal idioms in more than 200,000 sentences in European Portuguese. A highly reliable inter-annotator agreement was achieved, using Krippendorff’s alpha for nominal data (0.869) with 5% of the data independently annotated by 3 experts. Part of the annotated corpus is also made publicly available.
2024
Charting the Linguistic Landscape of Developing Writers: An Annotation Scheme for Enhancing Native Language Proficiency
Miguel Da Corte | Jorge Baptista
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Miguel Da Corte | Jorge Baptista
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This study describes a pilot annotation task designed to capture orthographic, grammatical, lexical, semantic, and discursive patterns exhibited by college native English speakers participating in developmental education (DevEd) courses. The paper introduces an annotation scheme developed by two linguists aiming at pinpointing linguistic challenges that hinder effective written communication. The scheme builds upon patterns supported by the literature, which are known as predictors of student placement in DevEd courses and English proficiency levels. Other novel, multilayered, linguistic aspects that the literature has not yet explored are also presented. The scheme and its primary categories are succinctly presented and justified. Two trained annotators used this scheme to annotate a sample of 103 text units (3 during the training phase and 100 during the annotation task proper). Texts were randomly selected from a population of 290 community college intending students. An in-depth quality assurance inspection was conducted to assess tagging consistency between annotators and to discern (and address) annotation inaccuracies. Krippendorff’s Alpha (K-alpha) interrater reliability coefficients were calculated, revealing a K-alpha score of k=0.40, which corresponds to a moderate level of agreement, deemed adequate for the complexity and length of the annotation task.
Automatic Text Readability Assessment in European Portuguese
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Text Readability Assessment in European Portuguese: A Comparison of Classification and Regression Approaches
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Lexicalized Meaning Representation (LMR)
Jorge Baptista | Sónia Reis | João Dias | Pedro Santos
Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024
Jorge Baptista | Sónia Reis | João Dias | Pedro Santos
Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024
This paper presents an adaptation of the Abstract Meaning Representation (AMR) framework for European Portuguese. This adaptation, referred to as Lexicalized Meaning Representation (LMR), was deemed necessary to address specific challenges posed by the grammar of the language, as well as various linguistic issues raised by the current version of AMR annotation guidelines. Some of these aspects stemmed from the use of a notation similar to AMR to represent real texts from the legal domain, enabling its use in Natural Language Processing (NLP) applications. In this context, several aspects of AMR were significantly simplified (e.g., the representation of multi-word expressions, named entities, and temporal expressions), while others were introduced, with efforts made to maintain the representation scheme as compatible as possible with standard AMR notation.
Enhancing Writing Proficiency Classification in Developmental Education: The Quest for Accuracy
Miguel Da Corte | Jorge Baptista
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Miguel Da Corte | Jorge Baptista
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Developmental Education (DevEd) courses align students’ college-readiness skills with higher education literacy demands. These courses often use automated assessment tools like Accuplacer for student placement. Existing literature raises concerns about these exams’ accuracy and placement precision due to their narrow representation of the writing process. These concerns warrant further attention within the domain of automatic placement systems, particularly in the establishment of a reference corpus of annotated essays for these systems’ machine/deep learning. This study aims at an enhanced annotation procedure to assess college students’ writing patterns more accurately. It examines the efficacy of machine-learning-based DevEd placement, contrasting Accuplacer’s classification of 100 college-intending students’ essays into two levels (Level 1 and 2) against that of 6 human raters. The classification task encompassed the assessment of the 6 textual criteria currently used by Accuplacer: mechanical conventions, sentence variety & style, idea development & support, organization & structure, purpose & focus, and critical thinking. Results revealed low inter-rater agreement, both on the individual criteria and the overall classification, suggesting human assessment of writing proficiency can be inconsistent in this context. To achieve a more accurate determination of writing proficiency and improve DevEd placement, more robust classification methods are thus required.
Exploring the Automated Scoring of Narrative Essays in Brazilian Portuguese using Transformer Models
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Eugénio Ribeiro | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Complementos de eco de adjetivos com completiva-sujeito em português do Brasil
Ryan Saldanha Martinez | Jorge Baptista | Oto Araújo Vale
Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology
Ryan Saldanha Martinez | Jorge Baptista | Oto Araújo Vale
Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology
Support Verb Constructions in Medieval Portuguese: Evidence from the CTA Corpus
Maria Inês Bico | Esperança Cardeira | Jorge Baptista | Fernando Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Maria Inês Bico | Esperança Cardeira | Jorge Baptista | Fernando Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Hurdles in Parsing Multi-word Adverbs: Examples from Portuguese
Izabela Muller | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Izabela Muller | Nuno Mamede | Jorge Baptista
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Towards a Syntactic Lexicon of Brazilian Portuguese Adjectives
Ryan Martinez | Jorge Baptista | Oto Vale
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Ryan Martinez | Jorge Baptista | Oto Vale
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
The Role of Adverbs in Language Variety Identification: The Case of Portuguese Multi-Word Adverbs
Izabela Müller | Nuno Mamede | Jorge Baptista
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Izabela Müller | Nuno Mamede | Jorge Baptista
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
This paper aims to assess the role of multiword compound adverbs in distinguishing Brazilian Portuguese (PT-BR) from European Portuguese (PT-PT). Two key factors underpin this focus: Firstly, multiword expressions often provide less ambiguity compared to single words, even when their meaning is idiomatic (non-compositional). Secondly, despite constituting a significant portion of lexicons in many languages, they are frequently overlooked in Natural Language Processing, possibly due to their heterogeneous nature and lexical range.For this study, a large lexicon of Portuguese multiword adverbs (3,665) annotated with diatopic information regarding language variety was utilized. The paper investigates the distribution of this category in a corpus consisting in excerpts from journalistic texts sourced from the DSL (Dialect and Similar Language) corpus, representing Brazilian (PT-BR) and European Portuguese (PT-PT), respectively, each partition containing 18,000 sentences.Results indicate a substantial similarity between the two varieties, with a considerable overlap in the lexicon of multiword adverbs. Additionally, specific adverbs unique to each language variety were identified. Lexical entries recognized in the corpus represent 18.2% (PT-BR) to 19.5% (PT-PT) of the lexicon, and approximately 5,700 matches in each partition. While many of the matches are spurious due to ambiguity with otherwise non-idiomatic, free strings, occurrences of adverbs marked as exclusive to one variety in texts from the other variety are rare.
2022
Support Verb Constructions across the Ocean Sea
Jorge Baptista | Nuno Mamede | Sónia Reis
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022
Jorge Baptista | Nuno Mamede | Sónia Reis
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022
This paper analyses the support (or light) verb constructions (SVC) in a publicly available, manually annotated corpus of multiword expressions (MWE) in Brazilian Portuguese. The paper highlights several issues in the linguistic definitions therein adopted for these types of MWE, and reports the results from applying STRING, a rule-based parsing system, originally developed for European Portuguese, to this corpus from Brazilian Portuguese. The goal is two-fold: to improve the linguistic definition of SVC in the annotation task, as well as to gauge the major difficulties found when transposing linguistic resources between these two varieties of the same language.
2021
Proverbios portugueses usuais: distribuião em corpora
Sonia Reis | Jorge Baptista | Nuno Mamede
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology
Sonia Reis | Jorge Baptista | Nuno Mamede
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology
2017
Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words
Helena Gomez | Ilia Markov | Jorge Baptista | Grigori Sidorov | David Pinto
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Helena Gomez | Ilia Markov | Jorge Baptista | Grigori Sidorov | David Pinto
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
Os Provérbios em manuais de ensino de Português Língua Não Materna (The Proverbs of teaching manuals in Non-Native Portuguese)[In Portuguese]
Sónia Reis | Jorge Baptista
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology
Sónia Reis | Jorge Baptista
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology
2016
metaTED: a Corpus of Metadiscourse for Spoken Language
Rui Correia | Nuno Mamede | Jorge Baptista | Maxine Eskenazi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Rui Correia | Nuno Mamede | Jorge Baptista | Maxine Eskenazi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper describes metaTED ― a freely available corpus of metadiscursive acts in spoken language collected via crowdsourcing. Metadiscursive acts were annotated on a set of 180 randomly chosen TED talks in English, spanning over different speakers and topics. The taxonomy used for annotation is composed of 16 categories, adapted from Adel(2010). This adaptation takes into account both the material to annotate and the setting in which the annotation task is performed. The crowdsourcing setup is described, including considerations regarding training and quality control. The collected data is evaluated in terms of quantity of occurrences, inter-annotator agreement, and annotation related measures (such as average time on task and self-reported confidence). Results show different levels of agreement among metadiscourse acts (α ∈ [0.15; 0.49]). To further assess the collected material, a subset of the annotations was submitted to expert appreciation, who validated which of the marked occurrences truly correspond to instances of the metadiscursive act at hand. Similarly to what happened with the crowd, experts revealed different levels of agreement between categories (α ∈ [0.18; 0.72]). The paper concludes with a discussion on the applicability of metaTED with respect to each of the 16 categories of metadiscourse.
2015
Novo dicionário de formas flexionadas do Unitex-PB: avaliação da flexão verbal (New Dictionary of Inflected forms of UNITEX-PB: Evaluation of Verbal Inflection)
Oto A. Vale | Jorge Baptista
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology
Oto A. Vale | Jorge Baptista
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology
Integrating support verb constructions into a parser
Amanda Rassi | Jorge Baptista | Nuno Mamede | Oto Vale
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology
Amanda Rassi | Jorge Baptista | Nuno Mamede | Oto Vale
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology
2014
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
Jorge Baptista | Pushpak Bhattacharyya | Christiane Fellbaum | Mikel Forcada | Chu-Ren Huang | Svetla Koeva | Cvetana Krstev | Eric Laporte
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
Jorge Baptista | Pushpak Bhattacharyya | Christiane Fellbaum | Mikel Forcada | Chu-Ren Huang | Svetla Koeva | Cvetana Krstev | Eric Laporte
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
The fuzzy boundaries of operator verb and support verb constructions with dar “give” and ter “have” in Brazilian Portuguese
Amanda Rassi | Cristina Santos-Turati | Jorge Baptista | Nuno Mamede | Oto Vale
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
Amanda Rassi | Cristina Santos-Turati | Jorge Baptista | Nuno Mamede | Oto Vale
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
2007
Spanish Adverbial Frozen Expressions
Dolors Català | Jorge Baptista
Proceedings of the Workshop on A Broader Perspective on Multiword Expressions
Dolors Català | Jorge Baptista
Proceedings of the Workshop on A Broader Perspective on Multiword Expressions
2004
Frozen Sentences of Portuguese: Formal Descriptions for NLP
Jorge Baptista | Anabela Correia | Graça Fernandes
Proceedings of the Workshop on Multiword Expressions: Integrating Processing
Jorge Baptista | Anabela Correia | Graça Fernandes
Proceedings of the Workshop on Multiword Expressions: Integrating Processing
1999
Search
Fix author
Co-authors
- Nuno Mamede 15
- Eugénio Ribeiro 7
- David Antunes 5
- Sónia Reis 5
- Oto Vale 4
- Raquel Amaro 2
- Miguel Da Corte 2
- Ryan Martinez 2
- Izabela Müller 2
- Amanda Rassi 2
- Oto Araújo Vale 2
- Wafa Aissa 1
- Fernando Baptista 1
- Thibault Bañeras-Roux 1
- Pushpak Bhattacharyya 1
- Maria Inês Bico 1
- Esperança Cardeira 1
- Alejandro Catala 1
- Dolors Català 1
- Rui Correia 1
- Anabela Correia 1
- Luís Correia 1
- João Dias 1
- Maxine Eskenazi 1
- Christiane Fellbaum 1
- Graça Fernandes 1
- Mikel L. Forcada 1
- Thomas François 1
- Marcos Garcia 1
- Helena Gomez 1
- Chu-Ren Huang 1
- Mario Izquierdo-Álvarez 1
- Svetla Koeva 1
- Cvetana Krstev 1
- Eric Laporte 1
- Ilia Markov 1
- Ryan Saldanha Martinez 1
- Vasco Martins 1
- Cristina Mota 1
- Miguel Neves 1
- David Pinto 1
- Ehsabete Ranchhod 1
- Sandra Rodriguez Rey 1
- Pedro Santos 1
- Cristina Santos-Turati 1
- Grigori Sidorov 1
- Elodie Vanzeveren 1