Czert – Czech BERT-like Model for Language Representation

Jakub Sido; Ondřej Pražák; Pavel Přibáň; Jan Pašek; Michal Seják; Miloslav Konopík

Czert – Czech BERT-like Model for Language Representation

Jakub Sido, Ondřej Pražák, Pavel Přibáň, Jan Pašek, Michal Seják, Miloslav Konopík

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.

Anthology ID:: 2021.ranlp-1.149
Volume:: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:: September
Year:: 2021
Address:: Held Online
Editors:: Ruslan Mitkov, Galia Angelova
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd.
Note:
Pages:: 1326–1338
Language:
URL:: https://aclanthology.org/2021.ranlp-1.149/
DOI:
Bibkey:
Cite (ACL):: Jakub Sido, Ondřej Pražák, Pavel Přibáň, Jan Pašek, Michal Seják, and Miloslav Konopík. 2021. Czert – Czech BERT-like Model for Language Representation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1326–1338, Held Online. INCOMA Ltd..
Cite (Informal):: Czert – Czech BERT-like Model for Language Representation (Sido et al., RANLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.ranlp-1.149.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{sido-etal-2021-czert,
    title = "Czert {--} {C}zech {BERT}-like Model for Language Representation",
    author = "Sido, Jakub  and
      Pra{\v{z}}{\'a}k, Ond{\v{r}}ej  and
      P{\v{r}}ib{\'a}{\v{n}}, Pavel  and
      Pa{\v{s}}ek, Jan  and
      Sej{\'a}k, Michal  and
      Konop{\'i}k, Miloslav",
    editor = "Mitkov, Ruslan  and
      Angelova, Galia",
    booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)",
    month = sep,
    year = "2021",
    address = "Held Online",
    publisher = "INCOMA Ltd.",
    url = "https://aclanthology.org/2021.ranlp-1.149/",
    pages = "1326--1338",
    abstract = "This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="sido-etal-2021-czert">
    <titleInfo>
        <title>Czert – Czech BERT-like Model for Language Representation</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Jakub</namePart>
        <namePart type="family">Sido</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ondřej</namePart>
        <namePart type="family">Pražák</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Pavel</namePart>
        <namePart type="family">Přibáň</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Jan</namePart>
        <namePart type="family">Pašek</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Michal</namePart>
        <namePart type="family">Seják</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Miloslav</namePart>
        <namePart type="family">Konopík</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2021-09</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Ruslan</namePart>
            <namePart type="family">Mitkov</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Galia</namePart>
            <namePart type="family">Angelova</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>INCOMA Ltd.</publisher>
            <place>
                <placeTerm type="text">Held Online</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.</abstract>
    <identifier type="citekey">sido-etal-2021-czert</identifier>
    <location>
        <url>https://aclanthology.org/2021.ranlp-1.149/</url>
    </location>
    <part>
        <date>2021-09</date>
        <extent unit="page">
            <start>1326</start>
            <end>1338</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Czert – Czech BERT-like Model for Language Representation
%A Sido, Jakub
%A Pražák, Ondřej
%A Přibáň, Pavel
%A Pašek, Jan
%A Seják, Michal
%A Konopík, Miloslav
%Y Mitkov, Ruslan
%Y Angelova, Galia
%S Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
%D 2021
%8 September
%I INCOMA Ltd.
%C Held Online
%F sido-etal-2021-czert
%X This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.
%U https://aclanthology.org/2021.ranlp-1.149/
%P 1326-1338

Download as File

Markdown (Informal)

[Czert – Czech BERT-like Model for Language Representation](https://aclanthology.org/2021.ranlp-1.149/) (Sido et al., RANLP 2021)

Czert – Czech BERT-like Model for Language Representation (Sido et al., RANLP 2021)

ACL

Jakub Sido, Ondřej Pražák, Pavel Přibáň, Jan Pašek, Michal Seják, and Miloslav Konopík. 2021. Czert – Czech BERT-like Model for Language Representation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1326–1338, Held Online. INCOMA Ltd..