FaBERT: Pre-training BERT on Persian Blogs

Mostafa Masumi; Seyed Soroush Majd; Mehrnoush Shamsfard; Hamid Beigy

doi:10.18653/v1/2025.wnut-1.10

FaBERT: Pre-training BERT on Persian Blogs

Mostafa Masumi, Seyed Soroush Majd, Mehrnoush Shamsfard, Hamid Beigy

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications.

Anthology ID:: 2025.wnut-1.10
Volume:: Proceedings of the Tenth Workshop on Noisy and User-generated Text
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico, USA
Editors:: JinYeong Bak, Rob van der Goot, Hyeju Jang, Weerayut Buaphet, Alan Ramponi, Wei Xu, Alan Ritter
Venues:: WNUT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 85–96
Language:
URL:: https://aclanthology.org/2025.wnut-1.10/
DOI:: 10.18653/v1/2025.wnut-1.10
Bibkey:
Cite (ACL):: Mostafa Masumi, Seyed Soroush Majd, Mehrnoush Shamsfard, and Hamid Beigy. 2025. FaBERT: Pre-training BERT on Persian Blogs. In Proceedings of the Tenth Workshop on Noisy and User-generated Text, pages 85–96, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: FaBERT: Pre-training BERT on Persian Blogs (Masumi et al., WNUT 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wnut-1.10.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{masumi-etal-2025-fabert,
    title = "{F}a{BERT}: Pre-training {BERT} on {P}ersian Blogs",
    author = "Masumi, Mostafa  and
      Majd, Seyed Soroush  and
      Shamsfard, Mehrnoush  and
      Beigy, Hamid",
    editor = "Bak, JinYeong  and
      Goot, Rob van der  and
      Jang, Hyeju  and
      Buaphet, Weerayut  and
      Ramponi, Alan  and
      Xu, Wei  and
      Ritter, Alan",
    booktitle = "Proceedings of the Tenth Workshop on Noisy and User-generated Text",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.wnut-1.10/",
    doi = "10.18653/v1/2025.wnut-1.10",
    pages = "85--96",
    ISBN = "979-8-89176-232-9",
    abstract = "We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="masumi-etal-2025-fabert">
    <titleInfo>
        <title>FaBERT: Pre-training BERT on Persian Blogs</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Mostafa</namePart>
        <namePart type="family">Masumi</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Seyed</namePart>
        <namePart type="given">Soroush</namePart>
        <namePart type="family">Majd</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Mehrnoush</namePart>
        <namePart type="family">Shamsfard</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Hamid</namePart>
        <namePart type="family">Beigy</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2025-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Tenth Workshop on Noisy and User-generated Text</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">JinYeong</namePart>
            <namePart type="family">Bak</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Rob</namePart>
            <namePart type="given">van</namePart>
            <namePart type="given">der</namePart>
            <namePart type="family">Goot</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Hyeju</namePart>
            <namePart type="family">Jang</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Weerayut</namePart>
            <namePart type="family">Buaphet</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alan</namePart>
            <namePart type="family">Ramponi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Wei</namePart>
            <namePart type="family">Xu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alan</namePart>
            <namePart type="family">Ritter</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Albuquerque, New Mexico, USA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-8-89176-232-9</identifier>
    </relatedItem>
    <abstract>We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications.</abstract>
    <identifier type="citekey">masumi-etal-2025-fabert</identifier>
    <identifier type="doi">10.18653/v1/2025.wnut-1.10</identifier>
    <location>
        <url>https://aclanthology.org/2025.wnut-1.10/</url>
    </location>
    <part>
        <date>2025-05</date>
        <extent unit="page">
            <start>85</start>
            <end>96</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T FaBERT: Pre-training BERT on Persian Blogs
%A Masumi, Mostafa
%A Majd, Seyed Soroush
%A Shamsfard, Mehrnoush
%A Beigy, Hamid
%Y Bak, JinYeong
%Y Goot, Rob van der
%Y Jang, Hyeju
%Y Buaphet, Weerayut
%Y Ramponi, Alan
%Y Xu, Wei
%Y Ritter, Alan
%S Proceedings of the Tenth Workshop on Noisy and User-generated Text
%D 2025
%8 May
%I Association for Computational Linguistics
%C Albuquerque, New Mexico, USA
%@ 979-8-89176-232-9
%F masumi-etal-2025-fabert
%X We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications.
%R 10.18653/v1/2025.wnut-1.10
%U https://aclanthology.org/2025.wnut-1.10/
%U https://doi.org/10.18653/v1/2025.wnut-1.10
%P 85-96

Download as File

Markdown (Informal)

[FaBERT: Pre-training BERT on Persian Blogs](https://aclanthology.org/2025.wnut-1.10/) (Masumi et al., WNUT 2025)

FaBERT: Pre-training BERT on Persian Blogs (Masumi et al., WNUT 2025)

ACL

Mostafa Masumi, Seyed Soroush Majd, Mehrnoush Shamsfard, and Hamid Beigy. 2025. FaBERT: Pre-training BERT on Persian Blogs. In Proceedings of the Tenth Workshop on Noisy and User-generated Text, pages 85–96, Albuquerque, New Mexico, USA. Association for Computational Linguistics.