Sandra Aluísio

Also published as: Sandra Maria Aluísio, Sandra Aluisio, Sandra M. Aluísio

2025

pdf bib abs
MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling
Sidney Evaldo Leal | Arnaldo Candido Junior | Ricardo Marcacini | Edresson Casanova | Odilon Gonçalves | Anderson Silva Soares | Rodrigo Freitas Lima | Lucas Rafael Stefanel Gris | Sandra Aluísio
Proceedings of the 31st International Conference on Computational Linguistics

Recently, several public datasets for automatic speech recognition (ASR) in Brazilian Portuguese (BP) have been released, improving ASR systems performance. However, these datasets lack diversity in terms of age groups, regional accents, and education levels. In this paper, we present a new publicly available dataset consisting of 289 life story interviews (365 hours), featuring a broad range of speakers varying in age, education, and regional accents. First, we demonstrated the presence of bias in current BP ASR models concerning education levels and age groups. Second, we showed that our dataset helps mitigate these biases. Additionally, an ASR model trained on our dataset performed better during evaluation on a diverse test set. Finally, the ASR model trained with our dataset was extrinsically evaluated through a topic modeling task that utilized the automatically transcribed output.

2024

pdf bib
Simple and Fast Automatic Prosodic Segmentation of Brazilian Portuguese Spontaneous Speech
Giovana Meloni Craveiro | Vinicius Gonçalves Santos | Gabriel Jose Pellisser Dalalana | Flaviane R. Fernandes Svartman | Sandra Maria Aluísio
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib
Portal NURC-SP: Design, Development, and Speech Processing Corpora Resources to Support the Public Dissemination of Portuguese Spoken Language
Ana Carolina Rodrigues | Alessandra A. Macedo | Arnaldo Candido Jr | Flaviane R. F. Svartman | Giovana M. Craveiro | Marli Quadros Leite | Sandra M. Aluísio | Vinícius G. Santos | Vinícius M. Garcia
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib
TTS applied to the generation of datasets for automatic speech recognition
Edresson Casanova | Sandra Aluísio | Moacir Antonelli Ponti
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

2021

2020

pdf bib abs
Using Eye-tracking Data to Predict the Readability of Brazilian Portuguese Sentences in Single-task, Multi-task and Sequential Transfer Learning Approaches
Sidney Evaldo Leal | João Marcos Munguba Vieira | Erica dos Santos Rodrigues | Elisângela Nogueira Teixeira | Sandra Aluísio
Proceedings of the 28th International Conference on Computational Linguistics

Sentence complexity assessment is a relatively new task in Natural Language Processing. One of its aims is to highlight in a text which sentences are more complex to support the simplification of contents for a target audience (e.g., children, cognitively impaired users, non-native speakers and low-literacy readers (Scarton and Specia, 2018)). This task is evaluated using datasets of pairs of aligned sentences including the complex and simple version of the same sentence. For Brazilian Portuguese, the task was addressed by (Leal et al., 2018), who set up the first dataset to evaluate the task in this language, reaching 87.8% of accuracy with linguistic features. The present work advances these results, using models inspired by (Gonzalez-Garduño and Søgaard, 2018), which hold the state-of-the-art for the English language, with multi-task learning and eye-tracking measures. First-Pass Duration, Total Regression Duration and Total Fixation Duration were used in two moments; first to select a subset of linguistic features and then as an auxiliary task in the multi-task and sequential learning models. The best model proposed here reaches the new state-of-the-art for Portuguese with 97.5% accuracy 1 , an increase of almost 10 points compared to the best previous results, in addition to proposing improvements in the public dataset after analysing the errors of our best model.

pdf bib abs
Evaluating Sentence Segmentation in Different Datasets of Neuropsychological Language Tests in Brazilian Portuguese
Edresson Casanova | Marcos Treviso | Lilian Hübner | Sandra Aluísio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic analysis of connected speech by natural language processing techniques is a promising direction for diagnosing cognitive impairments. However, some difficulties still remain: the time required for manual narrative transcription and the decision on how transcripts should be divided into sentences for successful application of parsers used in metrics, such as Idea Density, to analyze the transcripts. The main goal of this paper was to develop a generic segmentation system for narratives of neuropsychological language tests. We explored the performance of our previous single-dataset-trained sentence segmentation architecture in a richer scenario involving three new datasets used to diagnose cognitive impairments, comprising different stories and two types of stimulus presentation for eliciting narratives — visual and oral — via illustrated story-book and sequence of scenes, and by retelling. Also, we proposed and evaluated three modifications to our previous RCNN architecture: (i) the inclusion of a Linear Chain CRF; (ii) the inclusion of a self-attention mechanism; and (iii) the replacement of the LSTM recurrent layer by a Quasi-Recurrent Neural Network layer. Our study allowed us to develop two new models for segmenting impaired speech transcriptions, along with an ideal combination of datasets and specific groups of narratives to be used as the training set.

bib abs
SIMPLEX-PB 2.0: A Reliable Dataset for Lexical Simplification in Brazilian Portuguese
Nathan Hartmann | Gustavo Henrique Paetzold | Sandra Aluísio
Proceedings of the Fourth Widening Natural Language Processing Workshop

Most research on Lexical Simplification (LS) addresses non-native speakers of English, since they are numerous and easy to recruit. This makes it difficult to create LS solutions for other languages and target audiences. This paper presents SIMPLEX-PB 2.0, a dataset for LS in Brazilian Portuguese that, unlike its predecessor SIMPLEX-PB, accurately captures the needs of Brazilian underprivileged children. To create SIMPLEX-PB 2.0, we addressed all limitations of the old SIMPLEX-PB through multiple rounds of manual annotation. As a result, SIMPLEX-PB 2.0 features much more reliable and numerous candidate substitutions to complex words, as well as word complexity rankings produced by a group underprivileged children.

2018

pdf bib abs
A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese
Sidney Evaldo Leal | Magali Sanches Duran | Sandra Maria Aluísio
Proceedings of the 27th International Conference on Computational Linguistics

Effective textual communication depends on readers being proficient enough to comprehend texts, and texts being clear enough to be understood by the intended audience, in a reading task. When the meaning of textual information and instructions is not well conveyed, many losses and damages may occur. Among the solutions to alleviate this problem is the automatic evaluation of sentence readability, task which has been receiving a lot of attention due to its large applicability. However, a shortage of resources, such as corpora for training and evaluation, hinders the full development of this task. In this paper, we generate a nontrivial sentence corpus in Portuguese. We evaluate three scenarios for building it, taking advantage of a parallel corpus of simplification, in which each sentence triplet is aligned and has simplification operations annotated, being ideal for justifying possible mistakes of future methods. The best scenario of our corpus PorSimplesSent is composed of 4,888 pairs, which is bigger than a similar corpus for English; all the three versions of it are publicly available. We created four baselines for PorSimplesSent and made available a pairwise ranking method, using 17 linguistic and psycholinguistic features, which correctly identifies the ranking of sentence pairs with an accuracy of 74.2%.

2017

pdf bib abs
Sentence Segmentation in Narrative Transcripts from Neuropsychological Tests using Recurrent Convolutional Neural Networks
Marcos Treviso | Christopher Shulby | Sandra Aluísio
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Automated discourse analysis tools based on Natural Language Processing (NLP) aiming at the diagnosis of language-impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of sentence boundary segmentation in the transcripts prevents the direct application of NLP methods which rely on these marks in order to function properly, such as taggers and parsers. We present the first steps taken towards automatic neuropsychological evaluation based on narrative discourse analysis, presenting a new automatic sentence segmentation method for impaired speech. Our model uses recurrent convolutional neural networks with prosodic, Part of Speech (PoS) features, and word embeddings. It was evaluated intrinsically on impaired, spontaneous speech as well as normal, prepared speech and presents better results for healthy elderly (CTL) (F1 = 0.74) and Mild Cognitive Impairment (MCI) patients (F1 = 0.70) than the Conditional Random Fields method (F1 = 0.55 and 0.53, respectively) used in the same context of our study. The results suggest that our model is robust for impaired speech and can be used in automated discourse analysis tools to differentiate narratives produced by MCI and CTL.

pdf bib abs
Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impairment from Speech Transcripts
Leandro Santos | Edilson Anselmo Corrêa Júnior | Osvaldo Oliveira Jr | Diego Amancio | Letícia Mansur | Sandra Aluísio
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Mild Cognitive Impairment (MCI) is a mental disorder difficult to diagnose. Linguistic features, mainly from parsers, have been used to detect MCI, but this is not suitable for large-scale assessments. MCI disfluencies produce non-grammatical speech that requires manual or high precision automatic correction of transcripts. In this paper, we modeled transcripts into complex networks and enriched them with word embedding (CNE) to better represent short texts produced in neuropsychological assessments. The network measurements were applied with well-known classifiers to automatically identify MCI in transcripts, in a binary classification task. A comparison was made with the performance of traditional approaches using Bag of Words (BoW) and linguistic features for three datasets: DementiaBank in English, and Cinderella and Arizona-Battery in Portuguese. Overall, CNE provided higher accuracy than using only complex networks, while Support Vector Machine was superior to other classifiers. CNE provided the highest accuracies for DementiaBank and Cinderella, but BoW was more efficient for the Arizona-Battery dataset probably owing to its short narratives. The approach using linguistic features yielded higher accuracy if the transcriptions of the Cinderella dataset were manually revised. Taken together, the results indicate that complex networks enriched with embedding is promising for detecting MCI in large-scale assessments.

pdf bib abs
Discriminating between Similar Languages with Word-level Convolutional Neural Networks
Marcelo Criscuolo | Sandra Maria Aluísio
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

Discriminating between Similar Languages (DSL) is a challenging task addressed at the VarDial Workshop series. We report on our participation in the DSL shared task with a two-stage system. In the first stage, character n-grams are used to separate language groups, then specialized classifiers distinguish similar language varieties. We have conducted experiments with three system configurations and submitted one run for each. Our main approach is a word-level convolutional neural network (CNN) that learns task-specific vectors with minimal text preprocessing. We also experiment with multi-layer perceptron (MLP) networks and another hybrid configuration. Our best run achieved an accuracy of 90.76%, ranking 8th among 11 participants and getting very close to the system that ranked first (less than 2 points). Even though the CNN model could not achieve the best results, it still makes a viable approach to discriminating between similar languages.

pdf bib
Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
Nathan Hartmann | Erick Fonseca | Christopher Shulby | Marcos Treviso | Jéssica Silva | Sandra Aluísio
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology

pdf bib
Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts
Marcos Treviso | Christopher Shulby | Sandra Aluísio
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology

2015

pdf bib
Automatic Generation of a Lexical Resource to support Semantic Role Labeling in Portuguese
Magali Sanches Duran | Sandra Aluísio
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
A Deep Architecture for Non-Projective Dependency Parsing
Erick Fonseca | Sandra Aluísio
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf bib
Portal Min@s: Uma Ferramenta Geral de Apoio ao Processamento de Córpus de Propósito Geral (Portal Min@s: A General Purpose Support Tool for Corpora Processing)
Arnaldo Candido Junior | Thiago Lima Vieira | Marcel Serikawa | Matheus Antonio Ribeiro Silva | Régis Zangirolami | Sandra Maria Aluísio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

pdf bib
Semi-Automatic Construction of a Textual Entailment Dataset: Selecting Candidates with Vector Space Models
Erick R. Fonseca | Sandra Maria Aluísio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

2014

pdf bib abs
Generating a Lexicon of Errors in Portuguese to Support an Error Identification System for Spanish Native Learners
Lianet Sepúlveda Torres | Magali Sanches Duran | Sandra Aluísio
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Portuguese is a less resourced language in what concerns foreign language learning. Aiming to inform a module of a system designed to support scientific written production of Spanish native speakers learning Portuguese, we developed an approach to automatically generate a lexicon of wrong words, reproducing language transfer errors made by such foreign learners. Each item of the artificially generated lexicon contains, besides the wrong word, the respective Spanish and Portuguese correct words. The wrong word is used to identify the interlanguage error and the correct Spanish and Portuguese forms are used to generate the suggestions. Keeping control of the correct word forms, we can provide correction or, at least, useful suggestions for the learners. We propose to combine two automatic procedures to obtain the error correction: i) a similarity measure and ii) a translation algorithm based on aligned parallel corpus. The similarity-based method achieved a precision of 52%, whereas the alignment-based method achieved a precision of 90%. In this paper we focus only on interlanguage errors involving suffixes that have different forms in both languages. The approach, however, is very promising to tackle other types of errors, such as gender errors.

pdf bib abs
A Large Corpus of Product Reviews in Portuguese: Tackling Out-Of-Vocabulary Words
Nathan Hartmann | Lucas Avanço | Pedro Balage | Magali Duran | Maria das Graças Volpe Nunes | Thiago Pardo | Sandra Aluísio
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Web 2.0 has allowed a never imagined communication boom. With the widespread use of computational and mobile devices, anyone, in practically any language, may post comments in the web. As such, formal language is not necessarily used. In fact, in these communicative situations, language is marked by the absence of more complex syntactic structures and the presence of internet slang, with missing diacritics, repetitions of vowels, and the use of chat-speak style abbreviations, emoticons and colloquial expressions. Such language use poses severe new challenges for Natural Language Processing (NLP) tools and applications, which, so far, have focused on well-written texts. In this work, we report the construction of a large web corpus of product reviews in Brazilian Portuguese and the analysis of its lexical phenomena, which support the development of a lexical normalization tool for, in future work, subsidizing the use of standard NLP products for web opinion mining and summarization purposes.

pdf bib
Some Issues on the Normalization of a Corpus of Products Reviews in Portuguese
Magali Sanches Duran | Lucas Avanço | Sandra Aluísio | Thiago Pardo | Maria da Graça Volpe Nunes
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

2013

pdf bib
Identifying Pronominal Verbs: Towards Automatic Disambiguation of the Clitic ‘se’ in Portuguese
Magali Sanches Duran | Carolina Evaristo Scarton | Sandra Maria Aluísio | Carlos Ramisch
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
Um repositório de verbos para a anotação de papéis semânticos disponível na web (A Verb Repository for Semantic Role Labeling Available in the Web) [in Portuguese]
Magali Sanches Duran | Jhonata Pereira Martins | Sandra Maria Aluísio
Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology

pdf bib
Approaches for Helping Brazilian Students Improve their Scientific Writings
Ethel Schuster | Rick Lizotte | Sandra M. Aluísio | Carmen Dayrell
Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology

pdf bib
An Evaluation of the Brazilian Portuguese LIWC Dictionary for Sentiment Analysis
Pedro P. Balage Filho | Thiago Alexandre Salgueiro Pardo | Sandra M. Aluísio
Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology

2012

pdf bib abs
Propbank-Br: a Brazilian Treebank annotated with semantic role labels
Magali Sanches Duran | Sandra Maria Aluísio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper reports the annotation of a Brazilian Portuguese Treebank with semantic role labels following Propbank guidelines. A different language and a different parser output impact the task and require some decisions on how to annotate the corpus. Therefore, a new annotation guide ― called Propbank-Br - has been generated to deal with specific language phenomena and parser problems. In this phase of the project, the corpus was annotated by a unique linguist. The annotation task reported here is inserted in a larger projet for the Brazilian Portuguese language. This project aims to build Brazilian verbs frames files and a broader and distributed annotation of semantic role labels in Brazilian Portuguese, allowing inter-annotator agreement measures. The corpus, available in web, is already being used to build a semantic tagger for Portuguese language.

pdf bib abs
Rhetorical Move Detection in English Abstracts: Multi-label Sentence Classifiers and their Annotated Corpora
Carmen Dayrell | Arnaldo Candido Jr. | Gabriel Lima | Danilo Machado Jr. | Ann Copestake | Valéria Feltrim | Stella Tagnin | Sandra Aluisio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The relevance of automatically identifying rhetorical moves in scientific texts has been widely acknowledged in the literature. This study focuses on abstracts of standard research papers written in English and aims to tackle a fundamental limitation of current machine-learning classifiers: they are mono-labeled, that is, a sentence can only be assigned one single label. However, such approach does not adequately reflect actual language use since a move can be realized by a clause, a sentence, or even several sentences. Here, we present MAZEA (Multi-label Argumentative Zoning for English Abstracts), a multi-label classifier which automatically identifies rhetorical moves in abstracts but allows for a given sentence to be assigned as many labels as appropriate. We have resorted to various other NLP tools and used two large training corpora: (i) one corpus consists of 645 abstracts from physical sciences and engineering (PE) and (ii) the other corpus is made up of 690 from life and health sciences (LH). This paper presents our preliminary results and also discusses the various challenges involved in multi-label tagging and works towards satisfactory solutions. In addition, we also make our two training corpora publicly available so that they may serve as benchmark for this new task.

2011

pdf bib
Identifying and Analyzing Brazilian Portuguese Complex Predicates
Magali Sanches Duran | Carlos Ramisch | Sandra Maria Aluísio | Aline Villavicencio
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf bib
Towards an on-demand Simple Portuguese Wikipedia
Arnaldo Candido Jr | Ann Copestake | Lucia Specia | Sandra Maria Aluísio
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero (Characteristics of Popular News: the Evaluation of Intelligibility and Support to the Genre Description) [in Portuguese]
Maria José B. Finatto | Carolina Evaristo Scarton | Amanda Rocha | Sandra Aluísio
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

pdf bib
Using machine learning methods to avoid the pitfall of cognates and false friends in Spanish-Portuguese word pairs
Lianet Sepúlveda Torres | Sandra Maria Aluísio
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

pdf bib
Propbank-Br: a Brazilian Portuguese corpus annotated with semantic role labels
Magali Sanches Duran | Sandra Maria Aluísio
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

2010

pdf bib abs
Assigning Wh-Questions to Verbal Arguments: Annotation Tools Evaluation and Corpus Building
Magali Sanches Duran | Marcelo Adriano Amâncio | Sandra Maria Aluísio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This work reports the evaluation and selection of annotation tools to assign wh-question labels to verbal arguments in a sentence. Wh-question assignment discussed herein is a kind of semantic annotation which involves two tasks: making delimitation of verbs and arguments, and linking verbs to its arguments by question labels. As it is a new type of semantic annotation, there is no report about requirements an annotation tool should have to face it. For this reason, we decided to select the most appropriated tool in two phases. In the first phase, we executed the task with an annotation tool we have used before in another task. Such phase helped us to test the task and enabled us to know which features were or not desirable in an annotation tool for our purpose. In the second phase, guided by such requirements, we evaluated several tools and selected a tool for the real task. After corpus annotation conclusion, we report some of the annotation results and some comments on the improvements there should be made in an annotation tool to better support such kind of annotation task.

pdf bib
SIMPLIFICA: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments
Carolina Scarton | Matheus Oliveira | Arnaldo Candido Jr. | Caroline Gasperin | Sandra Aluísio
Proceedings of the NAACL HLT 2010 Demonstration Session

pdf bib
Readability Assessment for Text Simplification
Sandra Aluisio | Lucia Specia | Caroline Gasperin | Carolina Scarton
Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Fostering Digital Inclusion and Accessibility: The PorSimples project for Simplification of Portuguese Texts
Sandra Aluísio | Caroline Gasperin
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2009

pdf bib
Building a Corpus-based Historical Portuguese Dictionary : Challenges and Opportunities
Arnaldo Junior Candido | Sandra Maria Aluísio
Traitement Automatique des Langues, Volume 50, Numéro 2 : Langues anciennes [Ancient Languages]

pdf bib
Supporting the Adaptation of Texts for Poor Literacy Readers: a Text Simplification Editor for Brazilian Portuguese
Arnaldo Candido | Erick Maziero | Lucia Specia | Caroline Gasperin | Thiago Pardo | Sandra Aluisio
Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications

2004

pdf bib abs
The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools
Sandra Aluisio | Gisele Montilha Pinheiro | Aline M. P. Manfrin | Leandro H. M. de Oliveira | Luiz C. Genoves, Jr. | Stella E. O. Tagnin
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

In this paper we discuss the five requirements for building large publicly available corpora which geared the construction of the Lácio-Web corpora and their environments: 1) a comprehensive text typology; 2) text copyright clearance, compilation and annotation scheme; 3) a friendly and didactic interface; 4) the need to serve as support for several types of research; 5) the need to offer an array of associated tools. Also, we present the features that make Lácio-Web corpora interesting and novel as well as the limitations of this project, such as corpora size and balance, and the non-inclusion of spoken texts in the project’s reference corpus.

pdf bib
What is my Style? Using Stylistic Features of Portuguese Web Texts to Classify Web Pages According to Users’ Needs
Rachel Aires | Aline Manfrin | Sandra Aluísio | Diana Santos
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)