Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models

Ariani Di Felippo; Norton Trevisan Roman; Bryan K. S. Barbosa; Gabriela Pinheiro de Oliveira; Clarissa Lenina Scandarolli

Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models

Ariani Di Felippo, Norton Trevisan Roman, Bryan K. S. Barbosa, Gabriela Pinheiro de Oliveira, Clarissa Lenina Scandarolli

Abstract

Twitter/X remains a key source of user-generated content, requiring Natural Language Processing tools capable of handling non-canonical language. This study presents a manual annotation of lexical and orthographic phenomena in DANTEStocks, a corpus of financial tweets in Brazilian Portuguese, using a hierarchical typology to capture both creative uses and deviations from the standard norm. Results show that orthographic variation is strongly influenced by creative forms, mainly driven by platform- and domain-specific innovations. Standard norm variation is systematic, mostly involving predictable omissions of diacritics and the cedilla, and most tokens exhibit only one phenomenon, reflecting stable and largely independent patterns of variation in this Twitter subgenre. The identified variant forms enabled the construction of a lexicon for evaluating embedding models. We assessed how BERTimbau, Word2Vec, and FastText handle lexical variation in raw, unnormalized data, showing that the lexicon reduces out-of-vocabulary rates and improves coverage. These results highlight model robustness and the value of curated lexical resources in complementing both fixed and data-driven vocabularies.

Anthology ID:: 2026.propor-1.62
Volume:: Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Month:: April
Year:: 2026
Address:: Salvador, Brazil
Editors:: Marlo Souza, Iria de-Dios-Flores, Diana Santos, Larissa Freitas, Jackson Wilke da Cruz Souza, Eugénio Ribeiro
Venue:: PROPOR
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 628–637
Language:
URL:: https://aclanthology.org/2026.propor-1.62/
DOI:
Bibkey:
Cite (ACL):: Ariani Di Felippo, Norton Trevisan Roman, Bryan K. S. Barbosa, Gabriela Pinheiro de Oliveira, and Clarissa Lenina Scandarolli. 2026. Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 628–637, Salvador, Brazil. Association for Computational Linguistics.
Cite (Informal):: Lexical and Orthographic Variation in Portuguese Financial Tweets: Annotation, Analysis, and Implications for Embedding Models (Felippo et al., PROPOR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.propor-1.62.pdf

PDF Cite Search Fix data