Jozef Juhár

Also published as: Jozef Juhar


pdf bib
Building of children speech corpus for improving automatic subtitling services
Matus Pleva | Stanislav Ondas | Daniel Hládek | Jozef Juhar | Ján Staš | Yuan-Fu Liao
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019)


pdf bib
Evaluation Set for Slovak News Information Retrieval
Daniel Hládek | Jan Staš | Jozef Juhár
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This work proposes an information retrieval evaluation set for the Slovak language. A set of 80 queries written in the natural language is given together with the set of relevant documents. The document set contains 3980 newspaper articles sorted into 6 categories. Each document in the result set is manually annotated for relevancy with its corresponding query. The evaluation set is mostly compatible with the Cranfield test collection using the same methodology for queries and annotation of relevancy. In addition to that it provides annotation for document title, author, publication date and category that can be used for evaluation of automatic document clustering and categorization.

pdf bib
An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation
Peter Viszlay | Ján Staš | Tomáš Koctúr | Martin Lojka | Jozef Juhár
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we introduce an extension of our previously released TUKE-BNews-SK corpus based on a semi-automatic annotation scheme. It firstly relies on the automatic transcription of the BN data performed by our Slovak large vocabulary continuous speech recognition system. The generated hypotheses are then manually corrected and completed by trained human annotators. The corpus is composed of 25 hours of fully-annotated spontaneous and prepared speech. In addition, we have acquired 900 hours of another BN data, part of which we plan to annotate semi-automatically. We present a preliminary corpus evaluation that gives very promising results.


pdf bib
The Slovak Categorized News Corpus
Daniel Hladek | Jan Stas | Jozef Juhar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The presented corpus aims to be the first attempt to create a representative sample of the contemporary Slovak language from various domains with easy searching and automated processing. This first version of the corpus contains words and automatic morphological and named entity annotations and transcriptions of abbreviations and numerals. Integral part of the proposed paper is a word boundary and sentence boundary detection algorithm that utilizes characteristic features of the language.

pdf bib
TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation
Matúš Pleva | Jozef Juhár
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article presents an overview of the existing acoustical corpuses suitable for broadcast news automatic transcription task in the Slovak language. The TUKE-BNews-SK database created in our department was built to support the application development for automatic broadcast news processing and spontaneous speech recognition of the Slovak language. The audio corpus is composed of 479 Slovak TV broadcast news shows from public Slovak television called STV1 or “Jednotka” containing 265 hours of material and 186 hours of clean transcribed speech (4 hours subset extracted for testing purposes). The recordings were manually transcribed using Transcriber tool modified for Slovak annotators and automatic Slovak spell checking. The corpus design, acquisition, annotation scheme and pronunciation transcription is described together with corpus statistics and tools used. Finally the evaluation procedure using automatic speech recognition is presented on the broadcast news and parliamentary speeches test sets.


pdf bib
The COST 278 MASPER Initiative - Crosslingual Speech Recognition with Large Telephone Databases
Andrej Žgank | Zdravko Kačič | Frank Diehl | Klara Vicsi | Gyorgy Szaszak | Jozef Juhar | Slavomir Lihan
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)