What is the Best Sequence Length for BabyLM?

Suchir Salhan; Richard Diehl Martinez; Zébulon Goriely; Paula Buttery

doi:10.18653/v1/2025.babylm-main.10

What is the Best Sequence Length for BabyLM?

Suchir Salhan, Richard Diehl Martinez, Zebulon Goriely, Paula Buttery

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.

Anthology ID:: 2025.babylm-main.10
Volume:: Proceedings of the First BabyLM Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:: BabyLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 130–146
Language:
URL:: https://aclanthology.org/2025.babylm-main.10/
DOI:: 10.18653/v1/2025.babylm-main.10
Bibkey:
Cite (ACL):: Suchir Salhan, Richard Diehl Martinez, Zebulon Goriely, and Paula Buttery. 2025. What is the Best Sequence Length for BabyLM?. In Proceedings of the First BabyLM Workshop, pages 130–146, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: What is the Best Sequence Length for BabyLM? (Salhan et al., BabyLM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.babylm-main.10.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{salhan-etal-2025-best,
    title = "What is the Best Sequence Length for {B}aby{LM}?",
    author = "Salhan, Suchir  and
      Diehl Martinez, Richard  and
      Goriely, Zebulon  and
      Buttery, Paula",
    editor = "Charpentier, Lucas  and
      Choshen, Leshem  and
      Cotterell, Ryan  and
      Gul, Mustafa Omer  and
      Hu, Michael Y.  and
      Liu, Jing  and
      Jumelet, Jaap  and
      Linzen, Tal  and
      Mueller, Aaron  and
      Ross, Candace  and
      Shah, Raj Sanjay  and
      Warstadt, Alex  and
      Wilcox, Ethan Gotlieb  and
      Williams, Adina",
    booktitle = "Proceedings of the First BabyLM Workshop",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.babylm-main.10/",
    doi = "10.18653/v1/2025.babylm-main.10",
    pages = "130--146",
    ISBN = "TODO",
    abstract = "Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="salhan-etal-2025-best">
    <titleInfo>
        <title>What is the Best Sequence Length for BabyLM?</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Suchir</namePart>
        <namePart type="family">Salhan</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Richard</namePart>
        <namePart type="family">Diehl Martinez</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Zebulon</namePart>
        <namePart type="family">Goriely</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Paula</namePart>
        <namePart type="family">Buttery</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2025-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the First BabyLM Workshop</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Lucas</namePart>
            <namePart type="family">Charpentier</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Leshem</namePart>
            <namePart type="family">Choshen</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ryan</namePart>
            <namePart type="family">Cotterell</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Mustafa</namePart>
            <namePart type="given">Omer</namePart>
            <namePart type="family">Gul</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Michael</namePart>
            <namePart type="given">Y</namePart>
            <namePart type="family">Hu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jing</namePart>
            <namePart type="family">Liu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jaap</namePart>
            <namePart type="family">Jumelet</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Tal</namePart>
            <namePart type="family">Linzen</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Aaron</namePart>
            <namePart type="family">Mueller</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Candace</namePart>
            <namePart type="family">Ross</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Raj</namePart>
            <namePart type="given">Sanjay</namePart>
            <namePart type="family">Shah</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alex</namePart>
            <namePart type="family">Warstadt</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ethan</namePart>
            <namePart type="given">Gotlieb</namePart>
            <namePart type="family">Wilcox</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Adina</namePart>
            <namePart type="family">Williams</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Suzhou, China</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">TODO</identifier>
    </relatedItem>
    <abstract>Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.</abstract>
    <identifier type="citekey">salhan-etal-2025-best</identifier>
    <identifier type="doi">10.18653/v1/2025.babylm-main.10</identifier>
    <location>
        <url>https://aclanthology.org/2025.babylm-main.10/</url>
    </location>
    <part>
        <date>2025-11</date>
        <extent unit="page">
            <start>130</start>
            <end>146</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T What is the Best Sequence Length for BabyLM?
%A Salhan, Suchir
%A Diehl Martinez, Richard
%A Goriely, Zebulon
%A Buttery, Paula
%Y Charpentier, Lucas
%Y Choshen, Leshem
%Y Cotterell, Ryan
%Y Gul, Mustafa Omer
%Y Hu, Michael Y.
%Y Liu, Jing
%Y Jumelet, Jaap
%Y Linzen, Tal
%Y Mueller, Aaron
%Y Ross, Candace
%Y Shah, Raj Sanjay
%Y Warstadt, Alex
%Y Wilcox, Ethan Gotlieb
%Y Williams, Adina
%S Proceedings of the First BabyLM Workshop
%D 2025
%8 November
%I Association for Computational Linguistics
%C Suzhou, China
%@ TODO
%F salhan-etal-2025-best
%X Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.
%R 10.18653/v1/2025.babylm-main.10
%U https://aclanthology.org/2025.babylm-main.10/
%U https://doi.org/10.18653/v1/2025.babylm-main.10
%P 130-146

Download as File

Markdown (Informal)

[What is the Best Sequence Length for BabyLM?](https://aclanthology.org/2025.babylm-main.10/) (Salhan et al., BabyLM 2025)

What is the Best Sequence Length for BabyLM? (Salhan et al., BabyLM 2025)

ACL

Suchir Salhan, Richard Diehl Martinez, Zebulon Goriely, and Paula Buttery. 2025. What is the Best Sequence Length for BabyLM?. In Proceedings of the First BabyLM Workshop, pages 130–146, Suzhou, China. Association for Computational Linguistics.