Fateme Kalantari
2025
Developing an Informal-Formal Persian Corpus: Highlighting the Differences between Two Writing Styles
Vahide Tajalli
|
Mehrnoush Shamsfard
|
Fateme Kalantari
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language undergoes some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. In the present paper, the methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level is described. The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs. The observed differences between formal and informal writing are explained in detail.