Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Isaac Johnson; Lucie-Aimée Kaffee; Miriam Redi

doi:10.18653/v1/2024.wikinlp-1.14

Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Isaac Johnson, Lucie-Aimée Kaffee, Miriam Redi

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.

Anthology ID:: 2024.wikinlp-1.14
Volume:: Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Lucie Lucie-Aimée, Angela Fan, Tajuddeen Gwadabe, Isaac Johnson, Fabio Petroni, Daniel van Strien
Venues:: WikiNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 91–101
Language:
URL:: https://aclanthology.org/2024.wikinlp-1.14/
DOI:: 10.18653/v1/2024.wikinlp-1.14
Bibkey:
Cite (ACL):: Isaac Johnson, Lucie-Aimée Kaffee, and Miriam Redi. 2024. Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing. In Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia, pages 91–101, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing (Johnson et al., WikiNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.wikinlp-1.14.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{johnson-etal-2024-wikimedia,
    title = "Wikimedia data for {AI}: a review of Wikimedia datasets for {NLP} tasks and {AI}-assisted editing",
    author = "Johnson, Isaac  and
      Kaffee, Lucie-Aim{\'e}e  and
      Redi, Miriam",
    editor = "Lucie-Aim{\'e}e, Lucie  and
      Fan, Angela  and
      Gwadabe, Tajuddeen  and
      Johnson, Isaac  and
      Petroni, Fabio  and
      van Strien, Daniel",
    booktitle = "Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.wikinlp-1.14/",
    doi = "10.18653/v1/2024.wikinlp-1.14",
    pages = "91--101",
    abstract = "Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="johnson-etal-2024-wikimedia">
    <titleInfo>
        <title>Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Isaac</namePart>
        <namePart type="family">Johnson</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Lucie-Aimée</namePart>
        <namePart type="family">Kaffee</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Miriam</namePart>
        <namePart type="family">Redi</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2024-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Lucie</namePart>
            <namePart type="family">Lucie-Aimée</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Angela</namePart>
            <namePart type="family">Fan</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Tajuddeen</namePart>
            <namePart type="family">Gwadabe</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Isaac</namePart>
            <namePart type="family">Johnson</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Fabio</namePart>
            <namePart type="family">Petroni</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Daniel</namePart>
            <namePart type="family">van Strien</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Miami, Florida, USA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.</abstract>
    <identifier type="citekey">johnson-etal-2024-wikimedia</identifier>
    <identifier type="doi">10.18653/v1/2024.wikinlp-1.14</identifier>
    <location>
        <url>https://aclanthology.org/2024.wikinlp-1.14/</url>
    </location>
    <part>
        <date>2024-11</date>
        <extent unit="page">
            <start>91</start>
            <end>101</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing
%A Johnson, Isaac
%A Kaffee, Lucie-Aimée
%A Redi, Miriam
%Y Lucie-Aimée, Lucie
%Y Fan, Angela
%Y Gwadabe, Tajuddeen
%Y Johnson, Isaac
%Y Petroni, Fabio
%Y van Strien, Daniel
%S Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
%D 2024
%8 November
%I Association for Computational Linguistics
%C Miami, Florida, USA
%F johnson-etal-2024-wikimedia
%X Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.
%R 10.18653/v1/2024.wikinlp-1.14
%U https://aclanthology.org/2024.wikinlp-1.14/
%U https://doi.org/10.18653/v1/2024.wikinlp-1.14
%P 91-101

Download as File

Markdown (Informal)

[Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing](https://aclanthology.org/2024.wikinlp-1.14/) (Johnson et al., WikiNLP 2024)

Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing (Johnson et al., WikiNLP 2024)

ACL

Isaac Johnson, Lucie-Aimée Kaffee, and Miriam Redi. 2024. Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing. In Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia, pages 91–101, Miami, Florida, USA. Association for Computational Linguistics.