GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

Masato Hagiwara, Masato Mita


Abstract
The lack of large-scale datasets has been a major hindrance to the development of NLP tasks such as spelling correction and grammatical error correction (GEC). As a complementary new resource for these tasks, we present the GitHub Typo Corpus, a large-scale, multilingual dataset of misspellings and grammatical errors along with their corrections harvested from GitHub, a large and popular platform for hosting and sharing git repositories. The dataset, which we have made publicly available, contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date. We also describe our process for filtering true typo edits based on learned classifiers on a small annotated subset, and demonstrate that typo edits can be identified with F1 0.9 using a very simple classifier with only three features. The detailed analyses of the dataset show that existing spelling correctors merely achieve an F-measure of approx. 0.5, suggesting that the dataset serves as a new, rich source of spelling errors that complement existing datasets.
Anthology ID:
2020.lrec-1.835
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6761–6768
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.835
DOI:
Bibkey:
Cite (ACL):
Masato Hagiwara and Masato Mita. 2020. GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6761–6768, Marseille, France. European Language Resources Association.
Cite (Informal):
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors (Hagiwara & Mita, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.835.pdf
Code
 mhagiwara/github-typo-corpus
Data
GitHub Typo Corpus