2020
pdf
bib
abs
Current Challenges in Web Corpus Building
Miloš Jakubíček
|
Vojtěch Kovář
|
Pavel Rychlý
|
Vit Suchomel
Proceedings of the 12th Web as Corpus Workshop
In this paper we discuss some of the current challenges in web corpus building that we faced in the recent years when expanding the corpora in Sketch Engine. The purpose of the paper is to provide an overview and raise discussion on possible solutions, rather than bringing ready solutions to the readers. For every issue we try to assess its severity and briefly discuss possible mitigation options.
2016
pdf
bib
English-French Document Alignment Based on Keywords and Statistical Translation
Marek Medveď
|
Miloš Jakubíček
|
Vojtech Kovář
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
pdf
bib
abs
Finding Definitions in Large Corpora with Sketch Engine
Vojtěch Kovář
|
Monika Močiariková
|
Pavel Rychlý
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The paper describes automatic definition finding implemented within the leading corpus query and management tool, Sketch Engine. The implementation exploits complex pattern-matching queries in the corpus query language (CQL) and the indexing mechanism of word sketches for finding and storing definition candidates throughout the corpus. The approach is evaluated for Czech and English corpora, showing that the results are usable in practice: precision of the tool ranges between 30 and 75 percent (depending on the major corpus text types) and we were able to extract nearly 2 million definition candidates from an English corpus with 1.4 billion words. The feature is embedded into the interface as a concordance filter, so that users can search for definitions of any query to the corpus, including very specific multi-word queries. The results also indicate that ordinary texts (unlike explanatory texts) contain rather low number of definitions, which is perhaps the most important problem with automatic definition finding in general.
2014
pdf
bib
abs
Extrinsic Corpus Evaluation with a Collocation Dictionary Task
Adam Kilgarriff
|
Pavel Rychlý
|
Miloš Jakubíček
|
Vojtěch Kovář
|
Vít Baisa
|
Lucia Kocincová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of ‘general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.
pdf
bib
Finding Terms in Corpora for Many Languages with the Sketch Engine
Miloš Jakubíček
|
Adam Kilgarriff
|
Vojtěch Kovář
|
Pavel Rychlý
|
Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics
2010
pdf
bib
Through Low-Cost Annotation to Reliable Parsing Evaluation
Marek Grác
|
Miloš Jakubíček
|
Vojtěch Kovář
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation