Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese

Diego Alves


Abstract
We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.
Anthology ID:
2025.vardial-1.2
Volume:
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9–19
Language:
URL:
https://aclanthology.org/2025.vardial-1.2/
DOI:
Bibkey:
Cite (ACL):
Diego Alves. 2025. Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese. In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 9–19, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese (Alves, VarDial 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.vardial-1.2.pdf