Maria José B. Finatto

Also published as: Maria José Bocorny Finatto, Maria José Finatto, Maria Finatto, Maria José Bocorny Finatto, Maria Jose Bocorny Finatto

2026

pdf bib abs

Annotation Guidelines and Challenges for Automatic Simplification of Portuguese Drug Leaflets
Arthur Scalercio | Eduarda Bertotto | Silvana Jesus | Maria José Finatto | Aline Paes
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2

While most essential medicines have become widely accessible across all social strata in Brazil due to government initiatives and market shifts, a significant barrier remains: the technical complexity of medication leaflets. This pragmatic and linguistic gap hinders patient comprehension of critical risks and benefits. Thus, adapting these texts into plain language patterns is crucial for patient safety and treatment adherence. Large language models have been increasingly effective as practical solutions for text simplification, an important Natural Language Processing (NLP) task that serves as a basis for several other linguistic and computational tasks. However, the scarcity of annotated datasets remains a bottleneck for rigorous evaluation. To bridge this gap, we propose a streamlined pipeline for generating simplified medical leaflets and introduce an initial benchmark dataset of 30 expertly annotated samples. Our results, supported by semantic and morphosyntactic evaluations, demonstrate that the proposed method produces high-quality, simplified content suitable for health applications.

pdf bib abs

Exploring automatic terminology extraction from historical medical data
Leonardo Zilio | Maria José Bocorny Finatto
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 2

This paper analyzes the performance of several terminology extraction methods when confronted with historical specialized texts that do not conform with modern orthographical norms. We tested two extraction methods based on linguistic patterns, four prompt-based generative artificial intelligence (GenAI) models, and one BERT-like model. Some of these models went through fine-tuning for terminology extraction, and one of these is specialized in the extraction of medical terms from documents written in Portuguese. For the GenAI models, we tested four different prompting strategies. As test set, we used chapter fifteen of the second part of the book Aviso ’a Gente do Mar sobre a sua Saude [Advice to Sea People about their Health], originally written in French by G. Mauran at the end of the 18th century, and translated and adapted to Portuguese in 1794. The chapter was annotated with terminology, and the evaluation was conducted independently both in terms of f-measure, as well as in terms of pure precision, to observe if the automatic extraction methods could complement the manual token-based annotation. Results show that using automatic extraction methods to complement the manual annotation can improve coverage, and that individual models do not achieve high extraction quality, but, by combining two or more models, a recall of more than 90% could be achieved in the test data.

pdf bib abs

LegalSim-PT: Building a Dataset for Legal Document Simplification in Portuguese Leveraging Linguistic Metrics
Arthur Scalercio | Maria José Finatto | Aline Paes
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Document simplification has recently attracted increasing attention due to its broader practical applicability compared to sentence-level simplification. Beyond simplifying individual sentences, this task involves preserving fluency, conciseness, and coherence across the entire text, often incorporating summarization techniques. Despite its importance, research in this area remains largely concentrated on a few languages, particularly English.In this work, we introduce LegalSim-PT, the first large-scale Portuguese dataset for document simplification based on legal texts. To mitigate reliance on manual evaluation, we combined data augmentation strategies with readability, semantic similarity, and diversity metrics to select the most suitable document pairs. We conducted a comprehensive analysis of the resulting dataset, first characterizing its surface features and comparing them with those of existing simplification corpora. Next, we assessed its quality using automatic metrics, linguistic indicators, and human evaluations. Finally, we select representative models as baselines and fine-tune two models on LegalSim-PT, achieving improved performance in document-level simplification.

2025

pdf bib abs

Evaluating LLMs for Portuguese Sentence Simplification with Linguistic Insights
Arthur Mariano Rocha De Azevedo Scalercio | Elvis A. De Souza | Maria José Bocorny Finatto | Aline Paes
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sentence simplification (SS) focuses on adapting sentences to enhance their readability and accessibility. While large language models (LLMs) match task-specific baselines in English SS, their performance in Portuguese remains underexplored. This paper presents a comprehensive performance comparison of 26 state-of-the-art LLMs in Portuguese SS, alongside two simplification models trained explicitly for this task and language. They are evaluated under a one-shot setting across scientific, news, and government datasets. We benchmark the models with our newly introduced Gov-Lang-BR corpus (1,703 complex-simple sentence pairs from Brazilian government agencies) and two established datasets: PorSimplesSent and Museum-PT. Our investigation takes advantage of both automatic metrics and large-scale linguistic analysis to examine the transformations achieved by the LLMs. Furthermore, a qualitative assessment of selected generated outputs provides deeper insights into simplification quality. Our findings reveal that while open-source LLMs have achieved impressive results, closed-source LLMs continue to outperform them in Portuguese SS.

2024

pdf bib

NLP for historical Portuguese: Analysing 18th-century medical texts
Leonardo Zilio | Rafaela Radünz Lazzari | Maria Jose Bocorny Finatto
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib abs

Enhancing Sentence Simplification in Portuguese: Leveraging Paraphrases, Context, and Linguistic Features
Arthur Scalercio | Maria Finatto | Aline Paes
Findings of the Association for Computational Linguistics: ACL 2024

Automatic text simplification focuses on transforming texts into a more comprehensible version without sacrificing their precision. However, automatic methods usually require (paired) datasets that can be rather scarce in languages other than English. This paper presents a new approach to automatic sentence simplification that leverages paraphrases, context, and linguistic attributes to overcome the absence of paired texts in Portuguese.We frame the simplification problem as a textual style transfer task and learn a style representation using the sentences around the target sentence in the document and its linguistic attributes. Moreover, unlike most unsupervised approaches that require style-labeled training data, we fine-tune strong pre-trained models using sentence-level paraphrases instead of annotated data. Our experiments show that our model achieves remarkable results, surpassing the current state-of-the-art (BART+ACCESS) while competitively matching a Large Language Model.

pdf bib

Can rules still beat neural networks? The case of automatic normalisation for 18th-century Portuguese texts
Leonardo Zilio | Rafaela R. Lazzari | Maria José B. Finatto
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2

2021

pdf bib

Constituintes Frasais com Funão de Sujeito em Sentenas Judiciais
Ester Motta | Maria Finatto
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology

2020

pdf bib abs

A Lexical Simplification Tool for Promoting Health Literacy
Leonardo Zilio | Liana Braga Paraguassu | Luis Antonio Leiva Hercules | Gabriel Ponomarenko | Laura Berwanger | Maria José Bocorny Finatto
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

This paper presents MedSimples, an authoring tool that combines Natural Language Processing, Corpus Linguistics and Terminology to help writers to convert health-related information into a more accessible version for people with low literacy skills. MedSimples applies parsing methods associated with lexical resources to automatically evaluate a text and present simplification suggestions that are more suitable for the target audience. Using the suggestions provided by the tool, the author can adapt the original text and make it more accessible. The focus of MedSimples lies on texts for special purposes, so that it not only deals with general vocabulary, but also with specialized terms. The tool is currently under development, but an online working prototype exists and can be tested freely. An assessment of MedSimples was carried out aiming at evaluating its current performance with some promising results, especially for informing the future developments that are planned for the tool.

2016

pdf bib abs

VerbLexPor: a lexical resource with semantic roles for Portuguese
Leonardo Zilio | Maria José Bocorny Finatto | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a lexical resource developed for Portuguese. The resource contains sentences annotated with semantic roles. The sentences were extracted from two domains: Cardiology research papers and newspaper articles. Both corpora were analyzed with the PALAVRAS parser and subsequently processed with a subcategorization frames extractor, so that each sentence that contained at least one main verb was stored in a database together with its syntactic organization. The annotation was manually carried out by a linguist using an annotation interface. Both the annotated and non-annotated data were exported to an XML format, which is readily available for download. The reason behind exporting non-annotated data is that there is syntactic information collected from the parser annotation in the non-annotated data, and this could be useful for other researchers. The sentences from both corpora were annotated separately, so that it is possible to access sentences either from the Cardiology or from the newspaper corpus. The full resource presents more than seven thousand semantically annotated sentences, containing 192 different verbs and more than 15 thousand individual arguments and adjuncts.

2015

pdf bib

VerbLexPor: um recurso léxico com anotação de papéis semânticos para o português (VerbLexPor: a lexical resource annotated with semantic roles for Portuguese)
Leonardo Zilio | Maria José Bocorny Finatto | Aline Villavicencio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

2014

pdf bib abs

Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them
Bruno Laranjeira | Viviane Moreira | Aline Villavicencio | Carlos Ramisch | Maria José Finatto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.

2011

pdf bib

Comparando Avaliações de Inteligibilidade Textual entre Originais e Traduções de Textos Literários (Comparing Textual Intelligibility Evaluations among Literary Source Texts and their Translations) [in Portuguese]
Bianca Franco Pasqualini | Carolina Evaristo Scarton | Maria José B. Finatto
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

pdf bib

Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero (Characteristics of Popular News: the Evaluation of Intelligibility and Support to the Genre Description) [in Portuguese]
Maria José B. Finatto | Carolina Evaristo Scarton | Amanda Rocha | Sandra Aluísio
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

2009

pdf bib

Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
Helena Caseli | Aline Villavicencio | André Machado | Maria José Finatto
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)