BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Nikola Ljubešić; Davor Lauc

BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation we introduce COPA-HR - a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTić model is made available for free usage and further task-specific fine-tuning through HuggingFace.

Anthology ID:: 2021.bsnlp-1.5
Volume:: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Month:: April
Year:: 2021
Address:: Kiyv, Ukraine
Editors:: Bogdan Babych, Olga Kanishcheva, Preslav Nakov, Jakub Piskorski, Lidia Pivovarova, Vasyl Starko, Josef Steinberger, Roman Yangarber, Michał Marcińczuk, Senja Pollak, Pavel Přibáň, Marko Robnik-Šikonja
Venue:: BSNLP
SIG:: SIGSLAV
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37–42
Language:
URL:: https://aclanthology.org/2021.bsnlp-1.5/
DOI:
Bibkey:
Cite (ACL):: Nikola Ljubešić and Davor Lauc. 2021. BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 37–42, Kiyv, Ukraine. Association for Computational Linguistics.
Cite (Informal):: BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian (Ljubešić & Lauc, BSNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.bsnlp-1.5.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{ljubesic-lauc-2021-bertic,
    title = "{BERT}i{\'c} - The Transformer Language Model for {B}osnian, {C}roatian, {M}ontenegrin and {S}erbian",
    author = "Ljube{\v{s}}i{\'c}, Nikola  and
      Lauc, Davor",
    editor = "Babych, Bogdan  and
      Kanishcheva, Olga  and
      Nakov, Preslav  and
      Piskorski, Jakub  and
      Pivovarova, Lidia  and
      Starko, Vasyl  and
      Steinberger, Josef  and
      Yangarber, Roman  and
      Marci{\'n}czuk, Micha{\l}  and
      Pollak, Senja  and
      P{\v{r}}ib{\'a}{\v{n}}, Pavel  and
      Robnik-{\v{S}}ikonja, Marko",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bsnlp-1.5/",
    pages = "37--42",
    abstract = "In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation we introduce COPA-HR - a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTi{\'c} model is made available for free usage and further task-specific fine-tuning through HuggingFace."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="ljubesic-lauc-2021-bertic">
    <titleInfo>
        <title>BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Nikola</namePart>
        <namePart type="family">Ljubešić</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Davor</namePart>
        <namePart type="family">Lauc</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2021-04</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Bogdan</namePart>
            <namePart type="family">Babych</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Olga</namePart>
            <namePart type="family">Kanishcheva</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Preslav</namePart>
            <namePart type="family">Nakov</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jakub</namePart>
            <namePart type="family">Piskorski</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Lidia</namePart>
            <namePart type="family">Pivovarova</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Vasyl</namePart>
            <namePart type="family">Starko</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Josef</namePart>
            <namePart type="family">Steinberger</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Roman</namePart>
            <namePart type="family">Yangarber</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Michał</namePart>
            <namePart type="family">Marcińczuk</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Senja</namePart>
            <namePart type="family">Pollak</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Pavel</namePart>
            <namePart type="family">Přibáň</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Marko</namePart>
            <namePart type="family">Robnik-Šikonja</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Kiyv, Ukraine</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation we introduce COPA-HR - a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTić model is made available for free usage and further task-specific fine-tuning through HuggingFace.</abstract>
    <identifier type="citekey">ljubesic-lauc-2021-bertic</identifier>
    <location>
        <url>https://aclanthology.org/2021.bsnlp-1.5/</url>
    </location>
    <part>
        <date>2021-04</date>
        <extent unit="page">
            <start>37</start>
            <end>42</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian
%A Ljubešić, Nikola
%A Lauc, Davor
%Y Babych, Bogdan
%Y Kanishcheva, Olga
%Y Nakov, Preslav
%Y Piskorski, Jakub
%Y Pivovarova, Lidia
%Y Starko, Vasyl
%Y Steinberger, Josef
%Y Yangarber, Roman
%Y Marcińczuk, Michał
%Y Pollak, Senja
%Y Přibáň, Pavel
%Y Robnik-Šikonja, Marko
%S Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
%D 2021
%8 April
%I Association for Computational Linguistics
%C Kiyv, Ukraine
%F ljubesic-lauc-2021-bertic
%X In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation we introduce COPA-HR - a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTić model is made available for free usage and further task-specific fine-tuning through HuggingFace.
%U https://aclanthology.org/2021.bsnlp-1.5/
%P 37-42

Download as File

Markdown (Informal)

[BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian](https://aclanthology.org/2021.bsnlp-1.5/) (Ljubešić & Lauc, BSNLP 2021)

BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian (Ljubešić & Lauc, BSNLP 2021)

ACL

Nikola Ljubešić and Davor Lauc. 2021. BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 37–42, Kiyv, Ukraine. Association for Computational Linguistics.