Preserving the Authenticity of Handwritten Learner Language: Annotation Guidelines for Creating Transcripts Retaining Orthographic Features

Christian Gold, Ronja Laarmann-quante, Torsten Zesch


Abstract
Handwritten texts produced by young learners often contain orthographic features like spelling errors, capitalization errors, punctuation errors, and impurities such as strikethroughs, inserts, and smudges. All of those are typically normalized or ignored in existing transcriptions. For applications like handwriting recognition with the goal of automatically analyzing a learner’s language performance, however, retaining such features would be necessary. To address this, we present transcription guidelines that retain the features addressed above. Our guidelines were developed iteratively and include numerous example images to illustrate the various issues. On a subset of about 90 double-transcribed texts, we compute inter-annotator agreement and show that our guidelines can be applied with high levels of percentage agreement of about .98. Overall, we transcribed 1,350 learner texts, which is about the same size as the widely adopted handwriting recognition datasets IAM (1,500 pages) and CVL (1,600 pages). Our final corpus can be used to train a handwriting recognition system that transcribes closely to the real productions by young learners. Such a system is a prerequisite for applying automatic orthography feedback systems to handwritten texts in the future.
Anthology ID:
2023.cawl-1.3
Volume:
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Kyle Gorman, Richard Sproat, Brian Roark
Venue:
CAWL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14–21
Language:
URL:
https://aclanthology.org/2023.cawl-1.3
DOI:
10.18653/v1/2023.cawl-1.3
Bibkey:
Cite (ACL):
Christian Gold, Ronja Laarmann-quante, and Torsten Zesch. 2023. Preserving the Authenticity of Handwritten Learner Language: Annotation Guidelines for Creating Transcripts Retaining Orthographic Features. In Proceedings of the Workshop on Computation and Written Language (CAWL 2023), pages 14–21, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Preserving the Authenticity of Handwritten Learner Language: Annotation Guidelines for Creating Transcripts Retaining Orthographic Features (Gold et al., CAWL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.cawl-1.3.pdf