CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Patrick Huber; Armen Aghajanyan; Barlas Oguz; Dmytro Okhonko; Wen-tau Yih; Sonal Gupta; Xilun Chen

doi:10.18653/v1/2022.findings-naacl.184

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Patrick Huber, Armen Aghajanyan, Barlas Oguz, Dmytro Okhonko, Scott Yih, Sonal Gupta, Xilun Chen

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

We propose a novel open-domain question-answering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our large-scale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

Anthology ID:: 2022.findings-naacl.184
Volume:: Findings of the Association for Computational Linguistics: NAACL 2022
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2402–2420
Language:
URL:: https://aclanthology.org/2022.findings-naacl.184/
DOI:: 10.18653/v1/2022.findings-naacl.184
Bibkey:
Cite (ACL):: Patrick Huber, Armen Aghajanyan, Barlas Oguz, Dmytro Okhonko, Scott Yih, Sonal Gupta, and Xilun Chen. 2022. CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2402–2420, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training (Huber et al., Findings 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.findings-naacl.184.pdf
Video:: https://aclanthology.org/2022.findings-naacl.184.mp4

PDF Cite Search Video Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{huber-etal-2022-ccqa,
    title = "{CCQA}: A New Web-Scale Question Answering Dataset for Model Pre-Training",
    author = "Huber, Patrick  and
      Aghajanyan, Armen  and
      Oguz, Barlas  and
      Okhonko, Dmytro  and
      Yih, Scott  and
      Gupta, Sonal  and
      Chen, Xilun",
    editor = "Carpuat, Marine  and
      de Marneffe, Marie-Catherine  and
      Meza Ruiz, Ivan Vladimir",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.184/",
    doi = "10.18653/v1/2022.findings-naacl.184",
    pages = "2402--2420",
    abstract = "We propose a novel open-domain question-answering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our large-scale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="huber-etal-2022-ccqa">
    <titleInfo>
        <title>CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Patrick</namePart>
        <namePart type="family">Huber</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Armen</namePart>
        <namePart type="family">Aghajanyan</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Barlas</namePart>
        <namePart type="family">Oguz</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Dmytro</namePart>
        <namePart type="family">Okhonko</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Scott</namePart>
        <namePart type="family">Yih</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Sonal</namePart>
        <namePart type="family">Gupta</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Xilun</namePart>
        <namePart type="family">Chen</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2022-07</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Findings of the Association for Computational Linguistics: NAACL 2022</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Marine</namePart>
            <namePart type="family">Carpuat</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Marie-Catherine</namePart>
            <namePart type="family">de Marneffe</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ivan</namePart>
            <namePart type="given">Vladimir</namePart>
            <namePart type="family">Meza Ruiz</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Seattle, United States</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>We propose a novel open-domain question-answering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our large-scale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.</abstract>
    <identifier type="citekey">huber-etal-2022-ccqa</identifier>
    <identifier type="doi">10.18653/v1/2022.findings-naacl.184</identifier>
    <location>
        <url>https://aclanthology.org/2022.findings-naacl.184/</url>
    </location>
    <part>
        <date>2022-07</date>
        <extent unit="page">
            <start>2402</start>
            <end>2420</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training
%A Huber, Patrick
%A Aghajanyan, Armen
%A Oguz, Barlas
%A Okhonko, Dmytro
%A Yih, Scott
%A Gupta, Sonal
%A Chen, Xilun
%Y Carpuat, Marine
%Y de Marneffe, Marie-Catherine
%Y Meza Ruiz, Ivan Vladimir
%S Findings of the Association for Computational Linguistics: NAACL 2022
%D 2022
%8 July
%I Association for Computational Linguistics
%C Seattle, United States
%F huber-etal-2022-ccqa
%X We propose a novel open-domain question-answering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our large-scale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.
%R 10.18653/v1/2022.findings-naacl.184
%U https://aclanthology.org/2022.findings-naacl.184/
%U https://doi.org/10.18653/v1/2022.findings-naacl.184
%P 2402-2420

Download as File

Markdown (Informal)

[CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training](https://aclanthology.org/2022.findings-naacl.184/) (Huber et al., Findings 2022)

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training (Huber et al., Findings 2022)

ACL

Patrick Huber, Armen Aghajanyan, Barlas Oguz, Dmytro Okhonko, Scott Yih, Sonal Gupta, and Xilun Chen. 2022. CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2402–2420, Seattle, United States. Association for Computational Linguistics.