Michael P. Oakes
The use of approximate string matching techniques in the alignment of sentences in parallel corpora
Anthony M. McEnery | Michael P. Oakes | Roger G. Garside
Proceedings of the Second International Conference on Machine Translation: Ten years on
Parallel corpora such as the Canadian Hansard corpus and the International Telecommunications Union (ITU) corpus each provide the same text in two or more languages, and have been aptly described as the "Rosetta Stone" of modern corpus linguistics . Their use within MT is burgeoning, permeating all levels of the discipline, and even being used as the basis of full-blown statistically based MT systems. This paper will concern itself with the task of automatic bilingual lexicon construction, which is one of the major goals of the CRATER project (“Corpus Resources and Terminology Extraction”, funded under the MLAP initiative of the CEC, grant number MLAP-93/20). The approach to bilingual lexicon alignment taken here entails the alignment of corpora, and then a detailed search through the corpus for lexical cognates. Consequently the paper will begin with a brief discussion of the alignment procedures used on the project to date, and move to a discussion of various similarity metrics used to evaluate lexical similarity.