A Large List of Confusion Sets for Spellchecking Assessed Against a Corpus of Real-word Errors

Jennifer Pedler; Roger Mitton

A Large List of Confusion Sets for Spellchecking Assessed Against a Corpus of Real-word Errors

Abstract

One of the methods that has been proposed for dealing with real-word errors (errors that occur when a correctly spelled word is substituted for the one intended) is the ""confusion-set"" approach - a confusion set being a small group of words that are likely to be confused with one another. Using a list of confusion sets drawn up in advance, a spellchecker, on finding one of these words in a text, can assess whether one of the other members of its set would be a better fit and, if it appears to be so, propose that word as a correction. Much of the research using this approach has suffered from two weaknesses. The first is the small number of confusion sets used. The second is that systems have largely been tested on artificial errors. In this paper we address these two weaknesses. We describe the creation of a realistically sized list of confusion sets, then the assembling of a corpus of real-word errors, and then we assess the potential of that list in relation to that corpus.

Anthology ID:: L10-1077
Volume:: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:: May
Year:: 2010
Address:: Valletta, Malta
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/122_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Jennifer Pedler and Roger Mitton. 2010. A Large List of Confusion Sets for Spellchecking Assessed Against a Corpus of Real-word Errors. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):: A Large List of Confusion Sets for Spellchecking Assessed Against a Corpus of Real-word Errors (Pedler & Mitton, LREC 2010)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/122_Paper.pdf

PDF Cite Search Fix data