Pretraining Without Attention

Junxiong Wang; Jing Nathan Yan; Albert Gu; Alexander M. Rush

doi:10.18653/v1/2023.findings-emnlp.5

Pretraining Without Attention

Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander Rush

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT and scales more efficiently to longer sequences.

Anthology ID:: 2023.findings-emnlp.5
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 58–69
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.5/
DOI:: 10.18653/v1/2023.findings-emnlp.5
Bibkey:
Cite (ACL):: Junxiong Wang, Jing Nathan Yan, Albert Gu, and Alexander Rush. 2023. Pretraining Without Attention. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 58–69, Singapore. Association for Computational Linguistics.
Cite (Informal):: Pretraining Without Attention (Wang et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.5.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{wang-etal-2023-pretraining,
    title = "Pretraining Without Attention",
    author = "Wang, Junxiong  and
      Yan, Jing Nathan  and
      Gu, Albert  and
      Rush, Alexander",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.5/",
    doi = "10.18653/v1/2023.findings-emnlp.5",
    pages = "58--69",
    abstract = "Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT and scales more efficiently to longer sequences."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="wang-etal-2023-pretraining">
    <titleInfo>
        <title>Pretraining Without Attention</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Junxiong</namePart>
        <namePart type="family">Wang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Jing</namePart>
        <namePart type="given">Nathan</namePart>
        <namePart type="family">Yan</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Albert</namePart>
        <namePart type="family">Gu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Alexander</namePart>
        <namePart type="family">Rush</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2023-12</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Findings of the Association for Computational Linguistics: EMNLP 2023</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Houda</namePart>
            <namePart type="family">Bouamor</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Juan</namePart>
            <namePart type="family">Pino</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Kalika</namePart>
            <namePart type="family">Bali</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Singapore</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT and scales more efficiently to longer sequences.</abstract>
    <identifier type="citekey">wang-etal-2023-pretraining</identifier>
    <identifier type="doi">10.18653/v1/2023.findings-emnlp.5</identifier>
    <location>
        <url>https://aclanthology.org/2023.findings-emnlp.5/</url>
    </location>
    <part>
        <date>2023-12</date>
        <extent unit="page">
            <start>58</start>
            <end>69</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Pretraining Without Attention
%A Wang, Junxiong
%A Yan, Jing Nathan
%A Gu, Albert
%A Rush, Alexander
%Y Bouamor, Houda
%Y Pino, Juan
%Y Bali, Kalika
%S Findings of the Association for Computational Linguistics: EMNLP 2023
%D 2023
%8 December
%I Association for Computational Linguistics
%C Singapore
%F wang-etal-2023-pretraining
%X Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT and scales more efficiently to longer sequences.
%R 10.18653/v1/2023.findings-emnlp.5
%U https://aclanthology.org/2023.findings-emnlp.5/
%U https://doi.org/10.18653/v1/2023.findings-emnlp.5
%P 58-69

Download as File

Markdown (Informal)

[Pretraining Without Attention](https://aclanthology.org/2023.findings-emnlp.5/) (Wang et al., Findings 2023)

Pretraining Without Attention (Wang et al., Findings 2023)

ACL

Junxiong Wang, Jing Nathan Yan, Albert Gu, and Alexander Rush. 2023. Pretraining Without Attention. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 58–69, Singapore. Association for Computational Linguistics.