MLSUM: The Multilingual Summarization Corpus

Thomas Scialom; Paul-Alexis Dray; Sylvain Lamprier; Benjamin Piwowarski; Jacopo Staiano

doi:10.18653/v1/2020.emnlp-main.647

MLSUM: The Multilingual Summarization Corpus

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

Anthology ID:: 2020.emnlp-main.647
Volume:: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:: November
Year:: 2020
Address:: Online
Editors:: Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8051–8067
Language:
URL:: https://aclanthology.org/2020.emnlp-main.647/
DOI:: 10.18653/v1/2020.emnlp-main.647
Bibkey:
Cite (ACL):: Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020. MLSUM: The Multilingual Summarization Corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067, Online. Association for Computational Linguistics.
Cite (Informal):: MLSUM: The Multilingual Summarization Corpus (Scialom et al., EMNLP 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.emnlp-main.647.pdf
Video:: https://slideslive.com/38938723

PDF Cite Search Video Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = "Scialom, Thomas  and
      Dray, Paul-Alexis  and
      Lamprier, Sylvain  and
      Piwowarski, Benjamin  and
      Staiano, Jacopo",
    editor = "Webber, Bonnie  and
      Cohn, Trevor  and
      He, Yulan  and
      Liu, Yang",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.647/",
    doi = "10.18653/v1/2020.emnlp-main.647",
    pages = "8051--8067",
    abstract = "We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages {--} namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="scialom-etal-2020-mlsum">
    <titleInfo>
        <title>MLSUM: The Multilingual Summarization Corpus</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Thomas</namePart>
        <namePart type="family">Scialom</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Paul-Alexis</namePart>
        <namePart type="family">Dray</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Sylvain</namePart>
        <namePart type="family">Lamprier</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Benjamin</namePart>
        <namePart type="family">Piwowarski</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Jacopo</namePart>
        <namePart type="family">Staiano</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Bonnie</namePart>
            <namePart type="family">Webber</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Trevor</namePart>
            <namePart type="family">Cohn</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Yulan</namePart>
            <namePart type="family">He</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Yang</namePart>
            <namePart type="family">Liu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Online</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.</abstract>
    <identifier type="citekey">scialom-etal-2020-mlsum</identifier>
    <identifier type="doi">10.18653/v1/2020.emnlp-main.647</identifier>
    <location>
        <url>https://aclanthology.org/2020.emnlp-main.647/</url>
    </location>
    <part>
        <date>2020-11</date>
        <extent unit="page">
            <start>8051</start>
            <end>8067</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T MLSUM: The Multilingual Summarization Corpus
%A Scialom, Thomas
%A Dray, Paul-Alexis
%A Lamprier, Sylvain
%A Piwowarski, Benjamin
%A Staiano, Jacopo
%Y Webber, Bonnie
%Y Cohn, Trevor
%Y He, Yulan
%Y Liu, Yang
%S Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
%D 2020
%8 November
%I Association for Computational Linguistics
%C Online
%F scialom-etal-2020-mlsum
%X We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
%R 10.18653/v1/2020.emnlp-main.647
%U https://aclanthology.org/2020.emnlp-main.647/
%U https://doi.org/10.18653/v1/2020.emnlp-main.647
%P 8051-8067

Download as File

Markdown (Informal)

[MLSUM: The Multilingual Summarization Corpus](https://aclanthology.org/2020.emnlp-main.647/) (Scialom et al., EMNLP 2020)

MLSUM: The Multilingual Summarization Corpus (Scialom et al., EMNLP 2020)

ACL

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020. MLSUM: The Multilingual Summarization Corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067, Online. Association for Computational Linguistics.