Bryan K. S. Barbosa

2026

Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models
Ariani Di Felippo | Norton Trevisan Roman | Bryan K. S. Barbosa | Gabriela Pinheiro de Oliveira | Clarissa Lenina Scandarolli
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1

Twitter/X remains a key source of user-generated content, requiring Natural Language Processing tools capable of handling non-canonical language. This study presents a manual annotation of lexical and orthographic phenomena in DANTEStocks, a corpus of financial tweets in Brazilian Portuguese, using a hierarchical typology to capture both creative uses and deviations from the standard norm. Results show that orthographic variation is strongly influenced by creative forms, mainly driven by platform- and domain-specific innovations. Standard norm variation is systematic, mostly involving predictable omissions of diacritics and the cedilla, and most tokens exhibit only one phenomenon, reflecting stable and largely independent patterns of variation in this Twitter subgenre. The identified variant forms enabled the construction of a lexicon for evaluating embedding models. We assessed how BERTimbau, Word2Vec, and FastText handle lexical variation in raw, unnormalized data, showing that the lexicon reduces out-of-vocabulary rates and improves coverage. These results highlight model robustness and the value of curated lexical resources in complementing both fixed and data-driven vocabularies.

2023

pdf bib

Em Direção à Anotação Sintatica - UD de Tweets do Mercado Financeiro
Bryan K. S. Barbosa | Ariani Di-Felippo
Proceedings of the 2nd Edition of the Universal Dependencies Brazilian Festival

Co-authors

Venues

PROPOR1
UDFestBR1

Fix author