2012
pdf
bib
abs
Creating a Data Collection for Evaluating Rich Speech Retrieval
Maria Eskevich
|
Gareth J.F. Jones
|
Martha Larson
|
Roeland Ordelman
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We describe the development of a test collection for the investigation of speech retrieval beyond identification of relevant content. This collection focuses on satisfying user information needs for queries associated with specific types of speech acts. The collection is based on an archive of the Internet video from Internet video sharing platform (blip.tv), and was provided by the MediaEval benchmarking initiative. A crowdsourcing approach was used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. We describe and reflect on our experiences with crowdsourcing this test collection using the Amazon Mechanical Turk platform. We highlight the challenges of constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items.
2009
pdf
bib
abs
Relevance of ASR for the Automatic Generation of Keywords Suggestions for TV programs
Véronique Malaisé
|
Luit Gazendam
|
Willemijn Heeren
|
Roeland Ordelman
|
Hennie Brugman
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Semantic access to multimedia content in audiovisual archives is to a large extent dependent on quantity and quality of the metadata, and particularly the content descriptions that are attached to the individual items. However, the manual annotation of collections puts heavy demands on resources. A large number of archives are introducing (semi) automatic annotation techniques for generating and/or enhancing metadata. The NWO funded CATCH-CHOICE project has investigated the extraction of keywords from textual resources related to TV programs to be archived (context documents), in collaboration with the Dutch audiovisual archives, Sound and Vision. This paper investigates the suitability of Automatic Speech Recognition transcripts produced in the CATCH-CHoral project for generating such keywords, which we evaluate against manual annotations of the documents, and against keywords automatically generated from context documents describing the TV programs’ content.
2008
pdf
bib
abs
Evaluation of Spoken Document Retrieval for Historic Speech Collections
Willemijn Heeren
|
Franciska de Jong
|
Laurens van der Werff
|
Marijn Huijbregts
|
Roeland Ordelman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The re-use of spoken word audio collections maintained by audiovisual archives is severely hindered by their generally limited access. The CHoral project, which is part of the CATCH program funded by the Dutch Research Council, aims to provide users of speech archives with online, instead of on-location, access to relevant fragments, instead of full documents. To meet this goal, a spoken document retrieval framework is being developed. In this paper the evaluation efforts undertaken so far to assess and improve various aspects of the framework are presented. These efforts include (i) evaluation of the automatically generated textual representations of the spoken word documents that enable word-based search, (ii) the development of measures to estimate the quality of the textual representations for use in information retrieval, and (iii) studies to establish the potential user groups of the to-be-developed technology, and the first versions of the user interface supporting online access to spoken word collections.
pdf
bib
abs
From D-Coi to SoNaR: a reference corpus for Dutch
Nelleke Oostdijk
|
Martin Reynaert
|
Paola Monachesi
|
Gertjan Van Noord
|
Roeland Ordelman
|
Ineke Schuurman
|
Vincent Vandeghinste
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.
2006
pdf
bib
abs
Annotating Emotions in Meetings
Dennis Reidsma
|
Dirk Heylen
|
Roeland Ordelman
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
We present the results of two trials testing procedures for the annotation of emotion and mental state of the AMI corpus. The first procedure is an adaptation of the FeelTrace method, focusing on a continuous labelling of emotion dimensions. The second method is centered around more discrete labeling of segments using categorical labels. The results reported are promising for this hard task.