Pre-trained language models evaluating themselves - A comparative study

Philipp Koch; Matthias Aßenmacher; Christian Heumann

doi:10.18653/v1/2022.insights-1.25

Pre-trained language models evaluating themselves - A comparative study

Philipp Koch, Matthias Aßenmacher, Christian Heumann

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.

Anthology ID:: 2022.insights-1.25
Volume:: Proceedings of the Third Workshop on Insights from Negative Results in NLP
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Shabnam Tafreshi, João Sedoc, Anna Rogers, Aleksandr Drozd, Anna Rumshisky, Arjun Akula
Venue:: insights
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 180–187
Language:
URL:: https://aclanthology.org/2022.insights-1.25/
DOI:: 10.18653/v1/2022.insights-1.25
Bibkey:
Cite (ACL):: Philipp Koch, Matthias Aßenmacher, and Christian Heumann. 2022. Pre-trained language models evaluating themselves - A comparative study. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 180–187, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Pre-trained language models evaluating themselves - A comparative study (Koch et al., insights 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.insights-1.25.pdf
Video:: https://aclanthology.org/2022.insights-1.25.mp4

PDF Cite Search Video Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{koch-etal-2022-pre,
    title = "Pre-trained language models evaluating themselves - A comparative study",
    author = "Koch, Philipp  and
      A{\ss}enmacher, Matthias  and
      Heumann, Christian",
    editor = "Tafreshi, Shabnam  and
      Sedoc, Jo{\~a}o  and
      Rogers, Anna  and
      Drozd, Aleksandr  and
      Rumshisky, Anna  and
      Akula, Arjun",
    booktitle = "Proceedings of the Third Workshop on Insights from Negative Results in NLP",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.insights-1.25/",
    doi = "10.18653/v1/2022.insights-1.25",
    pages = "180--187",
    abstract = "Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="koch-etal-2022-pre">
    <titleInfo>
        <title>Pre-trained language models evaluating themselves - A comparative study</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Philipp</namePart>
        <namePart type="family">Koch</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Matthias</namePart>
        <namePart type="family">Aßenmacher</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Christian</namePart>
        <namePart type="family">Heumann</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2022-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Third Workshop on Insights from Negative Results in NLP</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Shabnam</namePart>
            <namePart type="family">Tafreshi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">João</namePart>
            <namePart type="family">Sedoc</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anna</namePart>
            <namePart type="family">Rogers</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Aleksandr</namePart>
            <namePart type="family">Drozd</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anna</namePart>
            <namePart type="family">Rumshisky</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Arjun</namePart>
            <namePart type="family">Akula</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Dublin, Ireland</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.</abstract>
    <identifier type="citekey">koch-etal-2022-pre</identifier>
    <identifier type="doi">10.18653/v1/2022.insights-1.25</identifier>
    <location>
        <url>https://aclanthology.org/2022.insights-1.25/</url>
    </location>
    <part>
        <date>2022-05</date>
        <extent unit="page">
            <start>180</start>
            <end>187</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Pre-trained language models evaluating themselves - A comparative study
%A Koch, Philipp
%A Aßenmacher, Matthias
%A Heumann, Christian
%Y Tafreshi, Shabnam
%Y Sedoc, João
%Y Rogers, Anna
%Y Drozd, Aleksandr
%Y Rumshisky, Anna
%Y Akula, Arjun
%S Proceedings of the Third Workshop on Insights from Negative Results in NLP
%D 2022
%8 May
%I Association for Computational Linguistics
%C Dublin, Ireland
%F koch-etal-2022-pre
%X Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.
%R 10.18653/v1/2022.insights-1.25
%U https://aclanthology.org/2022.insights-1.25/
%U https://doi.org/10.18653/v1/2022.insights-1.25
%P 180-187

Download as File

Markdown (Informal)

[Pre-trained language models evaluating themselves - A comparative study](https://aclanthology.org/2022.insights-1.25/) (Koch et al., insights 2022)

Pre-trained language models evaluating themselves - A comparative study (Koch et al., insights 2022)

ACL

Philipp Koch, Matthias Aßenmacher, and Christian Heumann. 2022. Pre-trained language models evaluating themselves - A comparative study. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 180–187, Dublin, Ireland. Association for Computational Linguistics.