Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

J. Edward Hu; Abhinav Singh; Nils Holzenberger; Matt Post; Benjamin Van Durme

doi:10.18653/v1/K19-1005

Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

J. Edward Hu, Abhinav Singh, Nils Holzenberger, Matt Post, Benjamin Van Durme

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use ... for bold, ... for italic, ... for underline, <sc>...</sc> for small-caps, <tt>...<tt> for typewriter text, <url>...</url> for URLs, <a href=...> for hyperlinks, and <par/> for paragraph breaks.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.

Anthology ID:: K19-1005
Volume:: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Mohit Bansal, Aline Villavicencio
Venue:: CoNLL
SIG:: SIGNLL
Publisher:: Association for Computational Linguistics
Note:
Pages:: 44–54
Language:
URL:: https://aclanthology.org/K19-1005/
DOI:: 10.18653/v1/K19-1005
Bibkey:
Cite (ACL):: J. Edward Hu, Abhinav Singh, Nils Holzenberger, Matt Post, and Benjamin Van Durme. 2019. Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 44–54, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering (Hu et al., CoNLL 2019)
Copy Citation:
PDF:: https://aclanthology.org/K19-1005.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{hu-etal-2019-large,
    title = "Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering",
    author = "Hu, J. Edward  and
      Singh, Abhinav  and
      Holzenberger, Nils  and
      Post, Matt  and
      Van Durme, Benjamin",
    editor = "Bansal, Mohit  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/K19-1005/",
    doi = "10.18653/v1/K19-1005",
    pages = "44--54",
    abstract = "Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="hu-etal-2019-large">
    <titleInfo>
        <title>Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">J</namePart>
        <namePart type="given">Edward</namePart>
        <namePart type="family">Hu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Abhinav</namePart>
        <namePart type="family">Singh</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Nils</namePart>
        <namePart type="family">Holzenberger</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Matt</namePart>
        <namePart type="family">Post</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Benjamin</namePart>
        <namePart type="family">Van Durme</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2019-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Mohit</namePart>
            <namePart type="family">Bansal</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Aline</namePart>
            <namePart type="family">Villavicencio</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Hong Kong, China</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.</abstract>
    <identifier type="citekey">hu-etal-2019-large</identifier>
    <identifier type="doi">10.18653/v1/K19-1005</identifier>
    <location>
        <url>https://aclanthology.org/K19-1005/</url>
    </location>
    <part>
        <date>2019-11</date>
        <extent unit="page">
            <start>44</start>
            <end>54</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering
%A Hu, J. Edward
%A Singh, Abhinav
%A Holzenberger, Nils
%A Post, Matt
%A Van Durme, Benjamin
%Y Bansal, Mohit
%Y Villavicencio, Aline
%S Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
%D 2019
%8 November
%I Association for Computational Linguistics
%C Hong Kong, China
%F hu-etal-2019-large
%X Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.
%R 10.18653/v1/K19-1005
%U https://aclanthology.org/K19-1005/
%U https://doi.org/10.18653/v1/K19-1005
%P 44-54

Download as File

Markdown (Informal)

[Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering](https://aclanthology.org/K19-1005/) (Hu et al., CoNLL 2019)

Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering (Hu et al., CoNLL 2019)

ACL

J. Edward Hu, Abhinav Singh, Nils Holzenberger, Matt Post, and Benjamin Van Durme. 2019. Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 44–54, Hong Kong, China. Association for Computational Linguistics.