Jaak Vilo


2010

pdf bib
Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance
Siim Orasmaa | Reina Käärik | Jaak Vilo | Tiit Hennoste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with large spoken corpora: how to find all variants of the word without annotating them manually? Our work describes a search engine that enables finding different spelling variants (true positives) from corpus of spoken language, and reduces efficiently the amount of false positives returned during the search. Our search engine uses a generalized variant of the edit distance algorithm that allows defining text-specific string to string transformations in addition to the default edit operations defined in edit distance. We have extended our algorithm with capability to block transformations in specific substrings of search words. User can mark certain regions (blocked regions) of the search word where edit operations are not allowed. Our material comes from the Corpus of Spoken Estonian of the University of Tartu which consists of about 2000 dialogues and texts, about 1.4 million running text units in total.

2008

pdf bib
Strengthening the Estonian Language Technology
Einar Meister | Jaak Vilo
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper will give an overview of developments in Estonia in the field of Human Language Technologies. Despite of the fact that Estonian is one of the smallest official languages in EU and therefore in less favourable position in the HLT-market, the national initiatives are undertaken in order to promote HLT development in Estonia. The paper will introduce recent activities in Estonia, including National Programme for Estonian Language Technology (2006-2010).