Understanding the Impact of Experiment Design for Evaluating Dialogue System Output

Sashank Santhanam; Samira Shaikh

doi:10.18653/v1/2020.winlp-1.33

Understanding the Impact of Experiment Design for Evaluating Dialogue System Output

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output

Anthology ID:: 2020.winlp-1.33
Volume:: Proceedings of the Fourth Widening Natural Language Processing Workshop
Month:: July
Year:: 2020
Address:: Seattle, USA
Editors:: Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
Venue:: WiNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 124–127
Language:
URL:: https://aclanthology.org/2020.winlp-1.33/
DOI:: 10.18653/v1/2020.winlp-1.33
Bibkey:
Cite (ACL):: Sashank Santhanam and Samira Shaikh. 2020. Understanding the Impact of Experiment Design for Evaluating Dialogue System Output. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 124–127, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):: Understanding the Impact of Experiment Design for Evaluating Dialogue System Output (Santhanam & Shaikh, WiNLP 2020)
Copy Citation:
Video:: http://slideslive.com/38929573

Cite Search Video Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{santhanam-shaikh-2020-understanding,
    title = "Understanding the Impact of Experiment Design for Evaluating Dialogue System Output",
    author = "Santhanam, Sashank  and
      Shaikh, Samira",
    editor = "Cunha, Rossana  and
      Shaikh, Samira  and
      Varis, Erika  and
      Georgi, Ryan  and
      Tsai, Alicia  and
      Anastasopoulos, Antonios  and
      Chandu, Khyathi Raghavi",
    booktitle = "Proceedings of the Fourth Widening Natural Language Processing Workshop",
    month = jul,
    year = "2020",
    address = "Seattle, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.winlp-1.33/",
    doi = "10.18653/v1/2020.winlp-1.33",
    pages = "124--127",
    abstract = "Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output"
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="santhanam-shaikh-2020-understanding">
    <titleInfo>
        <title>Understanding the Impact of Experiment Design for Evaluating Dialogue System Output</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Sashank</namePart>
        <namePart type="family">Santhanam</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Samira</namePart>
        <namePart type="family">Shaikh</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-07</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Fourth Widening Natural Language Processing Workshop</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Rossana</namePart>
            <namePart type="family">Cunha</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Samira</namePart>
            <namePart type="family">Shaikh</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Erika</namePart>
            <namePart type="family">Varis</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ryan</namePart>
            <namePart type="family">Georgi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alicia</namePart>
            <namePart type="family">Tsai</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Antonios</namePart>
            <namePart type="family">Anastasopoulos</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Khyathi</namePart>
            <namePart type="given">Raghavi</namePart>
            <namePart type="family">Chandu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Seattle, USA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output</abstract>
    <identifier type="citekey">santhanam-shaikh-2020-understanding</identifier>
    <identifier type="doi">10.18653/v1/2020.winlp-1.33</identifier>
    <location>
        <url>https://aclanthology.org/2020.winlp-1.33/</url>
    </location>
    <part>
        <date>2020-07</date>
        <extent unit="page">
            <start>124</start>
            <end>127</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Understanding the Impact of Experiment Design for Evaluating Dialogue System Output
%A Santhanam, Sashank
%A Shaikh, Samira
%Y Cunha, Rossana
%Y Shaikh, Samira
%Y Varis, Erika
%Y Georgi, Ryan
%Y Tsai, Alicia
%Y Anastasopoulos, Antonios
%Y Chandu, Khyathi Raghavi
%S Proceedings of the Fourth Widening Natural Language Processing Workshop
%D 2020
%8 July
%I Association for Computational Linguistics
%C Seattle, USA
%F santhanam-shaikh-2020-understanding
%X Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output
%R 10.18653/v1/2020.winlp-1.33
%U https://aclanthology.org/2020.winlp-1.33/
%U https://doi.org/10.18653/v1/2020.winlp-1.33
%P 124-127

Download as File

Markdown (Informal)

[Understanding the Impact of Experiment Design for Evaluating Dialogue System Output](https://aclanthology.org/2020.winlp-1.33/) (Santhanam & Shaikh, WiNLP 2020)

Understanding the Impact of Experiment Design for Evaluating Dialogue System Output (Santhanam & Shaikh, WiNLP 2020)

ACL

Sashank Santhanam and Samira Shaikh. 2020. Understanding the Impact of Experiment Design for Evaluating Dialogue System Output. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 124–127, Seattle, USA. Association for Computational Linguistics.