Corpus-Induced Corpus Clean-up

Martin Reynaert

Corpus-Induced Corpus Clean-up

Abstract

We explore the feasibility of using only unsupervised means to identify non-words, i.e. typos, in a frequency list derived from a large corpus of Dutch and to distinguish between these non-words and real-words in the language. We call the system we built and evaluate in this paper ciccl, which stands for Corpus-Induced Corpus Clean-up. The algorithm on which ciccl is primarily based is the anagram-key hashing algorithm introduced by (Reynaert, 2004). The core correction mechanism is a simple and effective method which translates the actual characters which make up a word into a large natural number in such a way that all the anagrams, i.e. all the words composed of precisely the same subset of characters, are allocated the same natural number. In effect, this constitutes a novel approximate string matching algorithm for indexed text search. This is because by simple addition, subtraction or a combination of both, all variants within reach of the range of numerical values defined in the alphabet are retrieved by iterating over the alphabet. ciccl's input consists primarily of corpus derived frequency lists, from which it derives valuable morphological information by performing frequency counts over the substrings of the words, which are then used to perform decompounding, as well as for distinguishing between most likely correctly spelled words and typos.

Anthology ID:: L06-1126
Volume:: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:: May
Year:: 2006
Address:: Genoa, Italy
Editors:: Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2006/pdf/229_pdf.pdf
DOI:
Bibkey:
Cite (ACL):: Martin Reynaert. 2006. Corpus-Induced Corpus Clean-up. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):: Corpus-Induced Corpus Clean-up (Reynaert, LREC 2006)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2006/pdf/229_pdf.pdf

PDF Cite Search Fix data