Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance

Siim Orasmaa; Reina Käärik; Jaak Vilo; Tiit Hennoste

Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance

Siim Orasmaa, Reina Käärik, Jaak Vilo, Tiit Hennoste

Abstract

An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with large spoken corpora: how to find all variants of the word without annotating them manually? Our work describes a search engine that enables finding different spelling variants (true positives) from corpus of spoken language, and reduces efficiently the amount of false positives returned during the search. Our search engine uses a generalized variant of the edit distance algorithm that allows defining text-specific string to string transformations in addition to the default edit operations defined in edit distance. We have extended our algorithm with capability to block transformations in specific substrings of search words. User can mark certain regions (blocked regions) of the search word where edit operations are not allowed. Our material comes from the Corpus of Spoken Estonian of the University of Tartu which consists of about 2000 dialogues and texts, about 1.4 million running text units in total.

Anthology ID:: L10-1410
Volume:: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:: May
Year:: 2010
Address:: Valletta, Malta
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/600_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Siim Orasmaa, Reina Käärik, Jaak Vilo, and Tiit Hennoste. 2010. Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):: Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance (Orasmaa et al., LREC 2010)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/600_Paper.pdf

PDF Cite Search Fix data