Kornél Markó

Also published as: Kornel Marko, Kornel Markó


2006

pdf bib
Language Specific and Topic Focused Web Crawling
Olena Medelyan | Stefan Schulz | Jan Paetzold | Michael Poprat | Kornél Markó
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the crawler builds a new large collection consisting only of documents that satisfy both the language and the topic model. The manual analysis of acquired English and German medicine corpora reveals the high accuracy of the crawler. However, there are significant differences between both languages.

pdf bib
Semantic Atomicity and Multilinguality in the Medical Domain: Design Considerations for the MorphoSaurus Subword Lexicon
Stefan Schulz | Kornél Markó | Philipp Daumke | Udo Hahn | Susanne Hanser | Percy Nohama | Roosewelt Leite de Andrade | Edson Pacheco | Martin Romacker
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present the lexico-semantic foundations underlying a multilingual lexicon the entries of which are constituted by so-called subwords. These subwords reflect semantic atomicity constraints in the medical domain which diverge from canonical lexicological understanding in NLP. We focus here on criteria to identify and delimit reasonable subword units, to group them into functionally adequate synonymy classes and relate them by two types of lexical relations. The lexicon we implemented on the basis of these considerations forms the lexical backbone for MorphoSaurus, a cross-language document retrieval engine for the medical domain.

2005

pdf bib
Subword Clusters as Light-Weight Interlingua for Multilingual Document Retrieval
Udo Hahn | Kornel Marko | Stefan Schulz
Proceedings of Machine Translation Summit X: Papers

We introduce a light-weight interlingua for a cross-language document retrieval system in the medical domain. It is composed of equivalence classes of semantically primitive, language-specific subwords which are clustered by interlingual and intralingual synonymy. Each subword cluster represents a basic conceptual entity of the language-independent interlingua. Documents, as well as queries, are mapped to this interlingua level on which retrieval operations are performed. Evaluation experiments reveal that this interlingua-based retrieval model outperforms a direct translation approach.

2004

pdf bib
Cognate Mapping - A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon
Stefan Schulz | Kornel Markó | Eduardo Sbrissia | Percy Nohama | Udo Hahn
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics