DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

Martin Brümmer; Milan Dojchinovski; Sebastian Hellmann

DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

Martin Brümmer, Milan Dojchinovski, Sebastian Hellmann

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.

Anthology ID:: L16-1532
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3339–3343
Language:
URL:: https://aclanthology.org/L16-1532/
DOI:
Bibkey:
Cite (ACL):: Martin Brümmer, Milan Dojchinovski, and Sebastian Hellmann. 2016. DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3339–3343, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus (Brümmer et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1532.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{brummer-etal-2016-dbpedia,
    title = "{DB}pedia Abstracts: A Large-Scale, Open, Multilingual {NLP} Training Corpus",
    author = {Br{\"u}mmer, Martin  and
      Dojchinovski, Milan  and
      Hellmann, Sebastian},
    editor = "Calzolari, Nicoletta  and
      Choukri, Khalid  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Grobelnik, Marko  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, Helene  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L16-1532/",
    pages = "3339--3343",
    abstract = "The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="brummer-etal-2016-dbpedia">
    <titleInfo>
        <title>DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Martin</namePart>
        <namePart type="family">Brümmer</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Milan</namePart>
        <namePart type="family">Dojchinovski</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Sebastian</namePart>
        <namePart type="family">Hellmann</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2016-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Nicoletta</namePart>
            <namePart type="family">Calzolari</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Khalid</namePart>
            <namePart type="family">Choukri</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Thierry</namePart>
            <namePart type="family">Declerck</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Sara</namePart>
            <namePart type="family">Goggi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Marko</namePart>
            <namePart type="family">Grobelnik</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Bente</namePart>
            <namePart type="family">Maegaard</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Joseph</namePart>
            <namePart type="family">Mariani</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Helene</namePart>
            <namePart type="family">Mazo</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Asuncion</namePart>
            <namePart type="family">Moreno</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jan</namePart>
            <namePart type="family">Odijk</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Stelios</namePart>
            <namePart type="family">Piperidis</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>European Language Resources Association (ELRA)</publisher>
            <place>
                <placeTerm type="text">Portorož, Slovenia</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.</abstract>
    <identifier type="citekey">brummer-etal-2016-dbpedia</identifier>
    <location>
        <url>https://aclanthology.org/L16-1532/</url>
    </location>
    <part>
        <date>2016-05</date>
        <extent unit="page">
            <start>3339</start>
            <end>3343</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus
%A Brümmer, Martin
%A Dojchinovski, Milan
%A Hellmann, Sebastian
%Y Calzolari, Nicoletta
%Y Choukri, Khalid
%Y Declerck, Thierry
%Y Goggi, Sara
%Y Grobelnik, Marko
%Y Maegaard, Bente
%Y Mariani, Joseph
%Y Mazo, Helene
%Y Moreno, Asuncion
%Y Odijk, Jan
%Y Piperidis, Stelios
%S Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16)
%D 2016
%8 May
%I European Language Resources Association (ELRA)
%C Portorož, Slovenia
%F brummer-etal-2016-dbpedia
%X The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
%U https://aclanthology.org/L16-1532/
%P 3339-3343

Download as File

Markdown (Informal)

[DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus](https://aclanthology.org/L16-1532/) (Brümmer et al., LREC 2016)

DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus (Brümmer et al., LREC 2016)

ACL

Martin Brümmer, Milan Dojchinovski, and Sebastian Hellmann. 2016. DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3339–3343, Portorož, Slovenia. European Language Resources Association (ELRA).