Identifying and analyzing ‘noisy’ spelling errors in a second language corpus

Alan Juffs; Ben Naismith

doi:10.18653/v1/2025.wnut-1.4

Identifying and analyzing ‘noisy’ spelling errors in a second language corpus

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

This paper addresses the problem of identifying and analyzing ‘noisy’ spelling errors in texts written by second language (L2) learners’ texts in a written corpus. Using Python, spelling errors were identified in 5774 texts greater than or equal to 66 words (total=1,814,209 words), selected from a corpus of 4.2 million words (Authors-1). The statistical analysis used hurdle() models in R, which are appropriate for non-normal, count data, with many zeros.

Anthology ID:: 2025.wnut-1.4
Volume:: Proceedings of the Tenth Workshop on Noisy and User-generated Text
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico, USA
Editors:: JinYeong Bak, Rob van der Goot, Hyeju Jang, Weerayut Buaphet, Alan Ramponi, Wei Xu, Alan Ritter
Venues:: WNUT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26–37
Language:
URL:: https://aclanthology.org/2025.wnut-1.4/
DOI:: 10.18653/v1/2025.wnut-1.4
Bibkey:
Cite (ACL):: Alan Juffs and Ben Naismith. 2025. Identifying and analyzing ‘noisy’ spelling errors in a second language corpus. In Proceedings of the Tenth Workshop on Noisy and User-generated Text, pages 26–37, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: Identifying and analyzing ‘noisy’ spelling errors in a second language corpus (Juffs & Naismith, WNUT 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wnut-1.4.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{juffs-naismith-2025-identifying,
    title = "Identifying and analyzing `noisy' spelling errors in a second language corpus",
    author = "Juffs, Alan  and
      Naismith, Ben",
    editor = "Bak, JinYeong  and
      Goot, Rob van der  and
      Jang, Hyeju  and
      Buaphet, Weerayut  and
      Ramponi, Alan  and
      Xu, Wei  and
      Ritter, Alan",
    booktitle = "Proceedings of the Tenth Workshop on Noisy and User-generated Text",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.wnut-1.4/",
    doi = "10.18653/v1/2025.wnut-1.4",
    pages = "26--37",
    ISBN = "979-8-89176-232-9",
    abstract = "This paper addresses the problem of identifying and analyzing `noisy' spelling errors in texts written by second language (L2) learners' texts in a written corpus. Using Python, spelling errors were identified in 5774 texts greater than or equal to 66 words (total=1,814,209 words), selected from a corpus of 4.2 million words (Authors-1). The statistical analysis used hurdle() models in R, which are appropriate for non-normal, count data, with many zeros."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="juffs-naismith-2025-identifying">
    <titleInfo>
        <title>Identifying and analyzing ‘noisy’ spelling errors in a second language corpus</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Alan</namePart>
        <namePart type="family">Juffs</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ben</namePart>
        <namePart type="family">Naismith</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2025-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Tenth Workshop on Noisy and User-generated Text</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">JinYeong</namePart>
            <namePart type="family">Bak</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Rob</namePart>
            <namePart type="given">van</namePart>
            <namePart type="given">der</namePart>
            <namePart type="family">Goot</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Hyeju</namePart>
            <namePart type="family">Jang</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Weerayut</namePart>
            <namePart type="family">Buaphet</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alan</namePart>
            <namePart type="family">Ramponi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Wei</namePart>
            <namePart type="family">Xu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alan</namePart>
            <namePart type="family">Ritter</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Albuquerque, New Mexico, USA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-8-89176-232-9</identifier>
    </relatedItem>
    <abstract>This paper addresses the problem of identifying and analyzing ‘noisy’ spelling errors in texts written by second language (L2) learners’ texts in a written corpus. Using Python, spelling errors were identified in 5774 texts greater than or equal to 66 words (total=1,814,209 words), selected from a corpus of 4.2 million words (Authors-1). The statistical analysis used hurdle() models in R, which are appropriate for non-normal, count data, with many zeros.</abstract>
    <identifier type="citekey">juffs-naismith-2025-identifying</identifier>
    <identifier type="doi">10.18653/v1/2025.wnut-1.4</identifier>
    <location>
        <url>https://aclanthology.org/2025.wnut-1.4/</url>
    </location>
    <part>
        <date>2025-05</date>
        <extent unit="page">
            <start>26</start>
            <end>37</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Identifying and analyzing ‘noisy’ spelling errors in a second language corpus
%A Juffs, Alan
%A Naismith, Ben
%Y Bak, JinYeong
%Y Goot, Rob van der
%Y Jang, Hyeju
%Y Buaphet, Weerayut
%Y Ramponi, Alan
%Y Xu, Wei
%Y Ritter, Alan
%S Proceedings of the Tenth Workshop on Noisy and User-generated Text
%D 2025
%8 May
%I Association for Computational Linguistics
%C Albuquerque, New Mexico, USA
%@ 979-8-89176-232-9
%F juffs-naismith-2025-identifying
%X This paper addresses the problem of identifying and analyzing ‘noisy’ spelling errors in texts written by second language (L2) learners’ texts in a written corpus. Using Python, spelling errors were identified in 5774 texts greater than or equal to 66 words (total=1,814,209 words), selected from a corpus of 4.2 million words (Authors-1). The statistical analysis used hurdle() models in R, which are appropriate for non-normal, count data, with many zeros.
%R 10.18653/v1/2025.wnut-1.4
%U https://aclanthology.org/2025.wnut-1.4/
%U https://doi.org/10.18653/v1/2025.wnut-1.4
%P 26-37

Download as File

Markdown (Informal)

[Identifying and analyzing ‘noisy’ spelling errors in a second language corpus](https://aclanthology.org/2025.wnut-1.4/) (Juffs & Naismith, WNUT 2025)

Identifying and analyzing ‘noisy’ spelling errors in a second language corpus (Juffs & Naismith, WNUT 2025)

ACL

Alan Juffs and Ben Naismith. 2025. Identifying and analyzing ‘noisy’ spelling errors in a second language corpus. In Proceedings of the Tenth Workshop on Noisy and User-generated Text, pages 26–37, Albuquerque, New Mexico, USA. Association for Computational Linguistics.