SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

Bogdan Gliwa; Iwona Mochol; Maciej Biesek; Aleksander Wawer

doi:10.18653/v1/D19-5409

SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, Aleksander Wawer

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use ... for bold, ... for italic, ... for underline, <sc>...</sc> for small-caps, <tt>...<tt> for typewriter text, <url>...</url> for URLs, <a href=...> for hyperlinks, and <par/> for paragraph breaks.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.

Anthology ID:: D19-5409
Volume:: Proceedings of the 2nd Workshop on New Frontiers in Summarization
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, Fei Liu
Venues:: NewSum | WS
SIG:: SIGSUMM
Publisher:: Association for Computational Linguistics
Note:
Pages:: 70–79
Language:
URL:: https://aclanthology.org/D19-5409/
DOI:: 10.18653/v1/D19-5409
Bibkey:
Cite (ACL):: Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization (Gliwa et al., NewSum 2019)
Copy Citation:
PDF:: https://aclanthology.org/D19-5409.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{gliwa-etal-2019-samsum,
    title = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization",
    author = "Gliwa, Bogdan  and
      Mochol, Iwona  and
      Biesek, Maciej  and
      Wawer, Aleksander",
    editor = "Wang, Lu  and
      Cheung, Jackie Chi Kit  and
      Carenini, Giuseppe  and
      Liu, Fei",
    booktitle = "Proceedings of the 2nd Workshop on New Frontiers in Summarization",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D19-5409/",
    doi = "10.18653/v1/D19-5409",
    pages = "70--79",
    abstract = "This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news {--} in contrast with human evaluators' judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="gliwa-etal-2019-samsum">
    <titleInfo>
        <title>SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Bogdan</namePart>
        <namePart type="family">Gliwa</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Iwona</namePart>
        <namePart type="family">Mochol</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Maciej</namePart>
        <namePart type="family">Biesek</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Aleksander</namePart>
        <namePart type="family">Wawer</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2019-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 2nd Workshop on New Frontiers in Summarization</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Lu</namePart>
            <namePart type="family">Wang</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jackie</namePart>
            <namePart type="given">Chi</namePart>
            <namePart type="given">Kit</namePart>
            <namePart type="family">Cheung</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Giuseppe</namePart>
            <namePart type="family">Carenini</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Fei</namePart>
            <namePart type="family">Liu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Hong Kong, China</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.</abstract>
    <identifier type="citekey">gliwa-etal-2019-samsum</identifier>
    <identifier type="doi">10.18653/v1/D19-5409</identifier>
    <location>
        <url>https://aclanthology.org/D19-5409/</url>
    </location>
    <part>
        <date>2019-11</date>
        <extent unit="page">
            <start>70</start>
            <end>79</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization
%A Gliwa, Bogdan
%A Mochol, Iwona
%A Biesek, Maciej
%A Wawer, Aleksander
%Y Wang, Lu
%Y Cheung, Jackie Chi Kit
%Y Carenini, Giuseppe
%Y Liu, Fei
%S Proceedings of the 2nd Workshop on New Frontiers in Summarization
%D 2019
%8 November
%I Association for Computational Linguistics
%C Hong Kong, China
%F gliwa-etal-2019-samsum
%X This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.
%R 10.18653/v1/D19-5409
%U https://aclanthology.org/D19-5409/
%U https://doi.org/10.18653/v1/D19-5409
%P 70-79

Download as File

Markdown (Informal)

[SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization](https://aclanthology.org/D19-5409/) (Gliwa et al., NewSum 2019)

SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization (Gliwa et al., NewSum 2019)

ACL

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.