Adriana Pagano

2026

Viés de gênero na tradução automática: uma avaliação no par linguístico inglês-português
Tayane A. Soares | Yohan B. Gumiel | Rafael Junqueira | Tácio Gomes | Adriana Pagano
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Este artigo apresenta uma avaliação do viés de gênero na tradução automática (TA) do inglês ao português, analisando o desempenho de três tradutores comerciais (Google Translate, Microsoft Translator, Amazon Translate) e três modelos de linguagem de propósito geral (GPT-3.5 Turbo, GPT-4o-mini e Llama-3 8B-Instruct). Utilizando o corpus de teste WinoMT (Stanovsky et al., 2019), a análise quantitativa mediu a acurácia e o viés (ΔG e ΔS) no corpus traduzido. Os resultados mostram que todos os sistemas apresentam viés, com melhor desempenho na tradução de entidades-alvo masculinas (ΔG positivo) e daquelas que corroboram estereótipos ocupacionais (ΔS positivo). A análise qualitativa, fundamentada na Teoria Sistêmico-Funcional, enfocando nas profissões ‘nurse’ e ‘physician’, revela como o viés de gênero constrói significados distintos das sentenças-fontes em relação às entidades-alvo e compromete a coesão referencial. O estudo valida um algoritmo de avaliação adaptado para o português e reitera a persistência do viés como um problema sociotécnico (Savoldi et al., 2025b.). Conclui-se observando a necessidade de avaliações contínuas e de desenvolvimento de métodos de avaliação que considerem diferentes contextos de uso da TA, principalmente em domínios críticos, a fim de ponderar e mitigar danos.

pdf bib abs

Diálogos Tóxicos: Gatilhos e Padrões de Interação no Reddit Brasileiro
Giovana Piorino | Marco Antônio de Alcântara Machado | Luiz Henrique Quevedo Lima | Adriana Pagano | Ana Paula Couto da Silva
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

In this paper we analyze the structural and linguistic dynamics of online toxicity in Reddit discussion trees, focusing on how trigger comments escalate conflicts in Brazilian Portuguese. Using a fine-tuned BERTAbaporu model, we show that toxic discussions are deeper, more engaging, and initially semantically cohesive, but degrade over time, while non-toxic interactions emphasize social bonding. Our findings contribute to a better understanding of toxicity escalation and support early detection of discursive conflicts.

2025

pdf bib abs

With NLP research being rapidly productionized into real-world applications, it is important to be aware of and think through the consequences of our work. Such ethical considerations are important in both authoring and reviewing (e.g. privacy, consent, fairness, among others). This tutorial will equip participants with basic guidelines for thinking deeply about ethical issues and review common considerations that recur in NLP research. The methodology is interactive and participatory, including discussion of case studies and group work. Participants will gain practical experience on when to flag a paper for ethics review and how to write an ethical consideration section to be shared with the broader community. Most importantly, the participants will be co-creating the tutorial outcomes and extending tutorial materials to share as public outcomes.

pdf bib

Ontology-Guided Domain Entity Recognition in Environmental Texts: Evaluating Syntax-Driven and LLM Approaches Using BabelNet and GEMET
Elisa Chierchiello | Patricia Chiril | Adriana Pagano
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

pdf bib abs

This paper details the findings of the 2025 UniDive shared task on multilingual morphosyntactic parsing. It introduces a new representation in which morphology and syntax are modelled jointly to form dependency trees of contentful elements, each characterized by features determined by grammatical words and morphemes. This schema allows bypassing the theoretical debate over the definition of “words” and it encourages development of parsers for typologically diverse languages. The data for the task, spanning 9 languages, was annotated based on existing Universal Dependencies (UD) treebanks that were adapted to the new format. We accompany the data with a new metric, MSLAS, that combines syntactic LAS with F1 over grammatical features. The task received two submissions, which together with three baselines give a detailed view on the ability of multi-task encoder models to cope with the task at hand. The best performing system, UM, achieved 78.7 MSLAS macro-averaged over all languages, improving by 31.4 points over the few-shot prompting baseline.

2024

pdf bib abs

Neural end-to-end surface realizers output more fluent texts than classical architectures. However, they tend to suffer from adequacy problems, in particular hallucinations in numerical referring expression generation. This poses a problem to language generation in sensitive domains, as is the case of robot journalism covering COVID-19 and Amazon deforestation. We propose an approach whereby numerical referring expressions are converted from digits to plain word form descriptions prior to being fed to state-of-the-art Large Language Models. We conduct automatic and human evaluations to report the best strategy to numerical superficial realization. Code and data are publicly available.

pdf bib

Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology
Daniela Barreiro Claro | Adriana Pagano
Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology

pdf bib abs

A Persona-Based Corpus in the Diabetes Self-Care Domain - Applying a Human-Centered Approach to a Low-Resource Context
Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Alves
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

While Natural Language Processing (NLP) models have gained substantial attention, only in recent years has research opened new paths for tackling Human-Computer Design (HCD) from the perspective of natural language. We focus on developing a human-centered corpus, more specifically, a persona-based corpus in a particular healthcare domain (diabetes mellitus self-care). In order to follow an HCD approach, we created personas to model interpersonal interaction (expert and non-expert users) in that specific domain. We show that an HCD approach benefits language generation from different perspectives, from machines to humans - contributing with new directions for low-resource contexts (languages other than English and sensitive domains) where the need to promote effective communication is essential.

pdf bib

Authorship attribution in translated texts: a stylometric approach to translator style
Ana Pagano | Carlos Perini | Evandro Cunha | Adriana Pagano
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2

2023

pdf bib

A funcionalidade dos adjetivos em dois gêneros discursivos: uma investigação com base nas dependências universais
Andre Coneglian | Adriana Pagano | Carlos Perini
Proceedings of the 14th Brazilian Symposium in Information and Human Language Technology

pdf bib

Vies de gênero na traduão automatica do GPT-3.5 turbo: avaliando o par linguistico inglês-português
Tayane Soares | Yohan Gumiel | Rafael Junqueira | Tacio Gomes | Adriana Pagano
Proceedings of the 14th Brazilian Symposium in Information and Human Language Technology

2021

This study describes the development of a Portuguese Community-Question Answering benchmark in the domain of Diabetes Mellitus using a Recognizing Question Entailment (RQE) approach. Given a premise question, RQE aims to retrieve semantically similar, already answered, archived questions. We build a new Portuguese benchmark corpus with 785 pairs between premise questions and archived answered questions marked with relevance judgments by medical experts. Based on the benchmark corpus, we leveraged and evaluated several RQE approaches ranging from traditional information retrieval methods to novel large pre-trained language models and ensemble techniques using learn-to-rank approaches. Our experimental results show that a supervised transformer-based method trained with multiple languages and for multiple tasks (MUSE) outperforms the alternatives. Our results also show that ensembles of methods (stacking) as well as a traditional (light) information retrieval method (BM25) can produce competitive results. Finally, among the tested strategies, those that exploit only the question (not the answer), provide the best effectiveness-efficiency trade-off. Code is publicly available.

pdf bib abs

Enriching the E2E dataset
Thiago Castro Ferreira | Helena Vaz | Brian Davis | Adriana Pagano
Proceedings of the 14th International Conference on Natural Language Generation

This study introduces an enriched version of the E2E dataset, one of the most popular language resources for data-to-text NLG. We extract intermediate representations for popular pipeline tasks such as discourse ordering, text structuring, lexicalization and referring expression generation, enabling researchers to rapidly develop and evaluate their data-to-text pipeline systems. The intermediate representations are extracted by aligning non-linguistic and text representations through a process called delexicalization, which consists in replacing input referring expressions to entities/attributes with placeholders. The enriched dataset is publicly available.

pdf bib

On auxiliary verb in Universal Dependencies: untangling the issue and proposing a systematized annotation strategy
Magali Duran | Adriana Pagano | Amanda Rassi | Thiago Pardo
Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021)

pdf bib

Sentiment Analysis in Portuguese Texts from Online Health Community Forums: Data, Model and Evaluation
Yohan Gumiel | Isabela Lee | Tayane Soares | Thiago Ferreira | Adriana Pagano
Proceedings of the 13th Brazilian Symposium in Information and Human Language Technology

2020

pdf bib abs

Building The First English-Brazilian Portuguese Corpus for Automatic Post-Editing
Felipe Almeida Costa | Thiago Castro Ferreira | Adriana Pagano | Wagner Meira
Proceedings of the 28th International Conference on Computational Linguistics

This paper introduces the first corpus for Automatic Post-Editing of English and a low-resource language, Brazilian Portuguese. The source English texts were extracted from the WebNLG corpus and automatically translated into Portuguese using a state-of-the-art industrial neural machine translator. Post-edits were then obtained in an experiment with native speakers of Brazilian Portuguese. To assess the quality of the corpus, we performed error analysis and computed complexity indicators measuring how difficult the APE task would be. We report preliminary results of Phrase-Based and Neural Machine Translation Models on this new corpus. Data and code publicly available in our repository.

pdf bib abs

Referring to what you know and do not know: Making Referring Expression Generation Models Generalize To Unseen Entities
Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Alves
Proceedings of the 28th International Conference on Computational Linguistics

Data-to-text Natural Language Generation (NLG) is the computational process of generating natural language in the form of text or voice from non-linguistic data. A core micro-planning task within NLG is referring expression generation (REG), which aims to automatically generate noun phrases to refer to entities mentioned as discourse unfolds. A limitation of novel REG models is not being able to generate referring expressions to entities not encountered during the training process. To solve this problem, we propose two extensions to NeuralREG, a state-of-the-art encoder-decoder REG model. The first is a copy mechanism, whereas the second consists of representing the gender and type of the referent as inputs to the model. Drawing on the results of automatic and human evaluation as well as an ablation study using the WebNLG corpus, we contend that our proposal contributes to the generation of more meaningful referring expressions to unseen entities than the original system and related work. Code and all produced data are publicly available.

pdf bib abs

DaMata: A Robot-Journalist Covering the Brazilian Amazon Deforestation
André Luiz Rosa Teixeira | João Campos | Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Cozman
Proceedings of the 13th International Conference on Natural Language Generation

This demo paper introduces DaMata, a robot-journalist covering deforestation in the Brazilian Amazon. The robot-journalist is based on a pipeline architecture of Natural Language Generation, which yields multilingual daily and monthly reports based on the public data provided by DETER, a real-time deforestation satellite monitor developed and maintained by the Brazilian National Institute for Space Research (INPE). DaMata automatically generates reports in Brazilian Portuguese and English and publishes them on the Twitter platform. Corpus and code are publicly available.

2017

pdf bib

Estudo exploratório de categorias gramaticais com potencial de indicadores para a Análise de Sentimentos (An Exploratory study of grammatical categories as potential indicators for Sentiment Analysis)[In Portuguese]
Júlia Rodrigues | Adriana Pagano | Emerson Paraiso
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology