Developing an Informal-Formal Persian Corpus: Highlighting the Differences between Two Writing Styles

Vahide Tajalli, Mehrnoush Shamsfard, Fateme Kalantari


Abstract
Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language undergoes some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. In the present paper, the methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level is described. The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs. The observed differences between formal and informal writing are explained in detail.
Anthology ID:
2025.abjadnlp-1.6
Volume:
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editor:
Mo El-Haj
Venues:
AbjadNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–53
Language:
URL:
https://aclanthology.org/2025.abjadnlp-1.6/
DOI:
Bibkey:
Cite (ACL):
Vahide Tajalli, Mehrnoush Shamsfard, and Fateme Kalantari. 2025. Developing an Informal-Formal Persian Corpus: Highlighting the Differences between Two Writing Styles. In Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script, pages 44–53, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Developing an Informal-Formal Persian Corpus: Highlighting the Differences between Two Writing Styles (Tajalli et al., AbjadNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.abjadnlp-1.6.pdf