Generating Errors: OCR Post-Processing for Icelandic

Atli Jasonarson; Steinþór Steingrímsson; Einar Sigurðsson; Árni Magnússon; Finnur Ingimundarson

Generating Errors: OCR Post-Processing for Icelandic

Atli Jasonarson, Steinþór Steingrímsson, Einar Sigurðsson, Árni Magnússon, Finnur Ingimundarson

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.

Anthology ID:: 2023.nodalida-1.29
Volume:: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:: May
Year:: 2023
Address:: Tórshavn, Faroe Islands
Editors:: Tanel Alumäe, Mark Fishel
Venue:: NoDaLiDa
SIG:
Publisher:: University of Tartu Library
Note:
Pages:: 286–291
Language:
URL:: https://aclanthology.org/2023.nodalida-1.29/
DOI:
Bibkey:
Cite (ACL):: Atli Jasonarson, Steinþór Steingrímsson, Einar Sigurðsson, Árni Magnússon, and Finnur Ingimundarson. 2023. Generating Errors: OCR Post-Processing for Icelandic. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 286–291, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):: Generating Errors: OCR Post-Processing for Icelandic (Jasonarson et al., NoDaLiDa 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.nodalida-1.29.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{jasonarson-etal-2023-generating,
    title = "Generating Errors: {OCR} Post-Processing for {I}celandic",
    author = "Jasonarson, Atli  and
      Steingr{\'i}msson, Stein{\th}{\'o}r  and
      Sigur{\dh}sson, Einar  and
      Magn{\'u}sson, {\'A}rni  and
      Ingimundarson, Finnur",
    editor = {Alum{\"a}e, Tanel  and
      Fishel, Mark},
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.29/",
    pages = "286--291",
    abstract = "We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google{'}s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="jasonarson-etal-2023-generating">
    <titleInfo>
        <title>Generating Errors: OCR Post-Processing for Icelandic</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Atli</namePart>
        <namePart type="family">Jasonarson</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Stein\thór</namePart>
        <namePart type="family">Steingrímsson</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Einar</namePart>
        <namePart type="family">Sigur\dhsson</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Árni</namePart>
        <namePart type="family">Magnússon</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Finnur</namePart>
        <namePart type="family">Ingimundarson</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2023-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Tanel</namePart>
            <namePart type="family">Alumäe</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Mark</namePart>
            <namePart type="family">Fishel</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>University of Tartu Library</publisher>
            <place>
                <placeTerm type="text">Tórshavn, Faroe Islands</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.</abstract>
    <identifier type="citekey">jasonarson-etal-2023-generating</identifier>
    <location>
        <url>https://aclanthology.org/2023.nodalida-1.29/</url>
    </location>
    <part>
        <date>2023-05</date>
        <extent unit="page">
            <start>286</start>
            <end>291</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Generating Errors: OCR Post-Processing for Icelandic
%A Jasonarson, Atli
%A Steingrímsson, Stein\thór
%A Sigur\dhsson, Einar
%A Magnússon, Árni
%A Ingimundarson, Finnur
%Y Alumäe, Tanel
%Y Fishel, Mark
%S Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
%D 2023
%8 May
%I University of Tartu Library
%C Tórshavn, Faroe Islands
%F jasonarson-etal-2023-generating
%X We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.
%U https://aclanthology.org/2023.nodalida-1.29/
%P 286-291

Download as File

Markdown (Informal)

[Generating Errors: OCR Post-Processing for Icelandic](https://aclanthology.org/2023.nodalida-1.29/) (Jasonarson et al., NoDaLiDa 2023)

Generating Errors: OCR Post-Processing for Icelandic (Jasonarson et al., NoDaLiDa 2023)

ACL

Atli Jasonarson, Steinþór Steingrímsson, Einar Sigurðsson, Árni Magnússon, and Finnur Ingimundarson. 2023. Generating Errors: OCR Post-Processing for Icelandic. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 286–291, Tórshavn, Faroe Islands. University of Tartu Library.