Roeland van Hout

2024

pdf bib abs
A Joint Approach for Automatic Analysis of Reading and Writing Errors
Wieke Harmsen | Catia Cucchiarini | Roeland van Hout | Helmer Strik
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

Analyzing the errors that children make on their ways to becoming fluent readers and writers can provide invaluable scientific insights into the processes that underlie literacy acquisition. To this end, we present in this paper an extension of an earlier developed spelling error detection and classification algorithm for Dutch, so that reading errors can also be automatically detected from their phonetic transcription. The strength of this algorithm lies in its ability to detect errors at Phoneme-Corresponding Unit (PCU) level, where a PCU is a sequence of letters corresponding to one phoneme. We validated this algorithm and found good agreement between manual and automatic reading error classifications. We also used the algorithm to analyze written words by second graders and phonetic transcriptions of read words by first graders. With respect to the writing data, we found that the PCUs ‘ei’, ‘eu’, ‘g’, ‘ij’ and ‘ch’ were most frequently written incorrectly, for the reading data, these were the PCUs ‘v’, ‘ui’, ‘ng’, ‘a’ and ‘g’. This study presents a first attempt at developing a joint method for detecting reading and writing errors. In future research this algorithm can be used to analyze corpora containing reading and writing data from the same children.

2022

pdf bib abs
The Effect of eHealth Training on Dysarthric Speech
Chiara Pesenti | Loes Van Bemmel | Roeland van Hout | Helmer Strik
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

In the current study on dysarthric speech, we investigate the effect of web-based treatment, and whether there is a difference between content and function words. Since the goal of the treatment is to speak louder, without raising pitch, we focus on acoustic-phonetic features related to loudness, intensity, and pitch. We analyse dysarthric read speech from eight speakers at word level. We also investigate whether there are differences between content words and function words, and whether the treatment has a different impact on these two classes of words. Linear Mixed-Effects models show that there are differences before and after treatment, that for some speakers the treatment has the desired effect, but not for all speakers, and that the effect of the treatment on words for the two categories does not seem to be different. To a large extent, our results are in line with the results of a previous study in which the same data were analyzed in a different way, i.e. by studying intelligibility scores.

2018

pdf bib
A Fast and Flexible Webinterface for Dialect Research in the Low Countries
Roeland van Hout | Nicoline van der Sijs | Erwin Komen | Henk van den Heuvel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs
Palabras: Crowdsourcing Transcriptions of L2 Speech
Eric Sanders | Pepi Burgos | Catia Cucchiarini | Roeland van Hout
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We developed a web application for crowdsourcing transcriptions of Dutch words spoken by Spanish L2 learners. In this paper we discuss the design of the application and the influence of metadata and various forms of feedback. Useful data were obtained from 159 participants, with an average of over 20 transcriptions per item, which seems a satisfactory result for this type of research. Informing participants about how many items they still had to complete, and not how many they had already completed, turned to be an incentive to do more items. Assigning participants a score for their performance made it more attractive for them to carry out the transcription task, but this seemed to influence their performance. We discuss possible advantages and disadvantages in connection with the aim of the research and consider possible lessons for designing future experiments.

2014

pdf bib abs
ASR-based CALL systems and learner speech data: new resources and opportunities for research and development in second language learning
Catia Cucchiarini | Steve Bodnar | Bart Penning de Vries | Roeland van Hout | Helmer Strik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we describe the language resources developed within the project Feedback and the Acquisition of Syntax in Oral Proficiency (FASOP), which is aimed at investigating the effectiveness of various forms of practice and feedback on the acquisition of syntax in second language (L2) oral proficiency, as well as their interplay with learner characteristics such as education level, learner motivation and confidence. For this purpose, use is made of a Computer Assisted Language Learning (CALL) system that employs Automatic Speech Recognition (ASR) technology to allow spoken interaction and to create an experimental environment that guarantees as much control over the language learning setting as possible. The focus of the present paper is on the resources that are being produced in FASOP. In line with the theme of this conference, we present the different types of resources developed within this project and the way in which these could be used to pursue innovative research in second language acquisition and to develop and improve ASR-based language learning applications.

The VALID Data Archive is an open multimedia data archive (under construction) with data from speakers suffering from language impairments. We report on a pilot project in the CLARIN-NL framework in which five data resources were curated. For all data sets concerned, written informed consent from the participants or their caretakers has been obtained. All materials were anonymized. The audio files were converted into wav (linear PCM) files and the transcriptions into CHAT or ELAN format. Research data that consisted of test, SPSS and Excel files were documented and converted into CSV files. All data sets obtained appropriate CMDI metadata files. A new CMDI metadata profile for this type of data resources was established and care was taken that ISOcat metadata categories were used to optimize interoperability. After curation all data are deposited at the Max Planck Institute for Psycholinguistics Nijmegen where persistent identifiers are linked to all resources. The content of the transcriptions in CHAT and plain text format can be searched with the TROVA search engine.

2008

pdf bib abs
Evaluating the Relationship between Linguistic and Geographic Distances using a 3D Visualization
Folkert de Vriend | Jan Pieter Kunst | Louis ten Bosch | Charlotte Giesbers | Roeland van Hout
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we discuss how linguistic and geographic distances can be related using a 3D visualization. We will convert linguistic data for locations along the German-Dutch border to linguistic distances that can be compared directly to geographic distances. This enables us to visualize linguistic distances as real distances with the use of the third dimension available in 3D modelling software. With such a visualization we will test if descriptive dialect data support the hypothesis that the German-Dutch state border became a linguistic border between the German and Dutch dialects. Our visualization is implemented in the 3D modelling software SketchUp.

2006

pdf bib abs
A Unified Structure for Dutch Dialect Dictionary Data
Folkert de Vriend | Lou Boves | Henk van den Heuvel | Roeland van Hout | Joep Kruijsen | Jos Swanenberg
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The traditional dialect vocabulary of the Netherlands and Flanders is recorded and researched in several Dutch and Belgian research institutes and universities. Most of these distributed dictionary creation and research projects collaborate in the Permanent Overlegorgaan Regionale Woordenboeken (ReWo). In the project digital databases and digital tools for WBD and WLD (D-square) the dialect data published by two of these dictionary projects (Woordenboek van de Brabantse Dialecten and Woordenboek van de Limburgse Dialecten) is being digitised. One of the additional goals of the D-square project is the development of an infrastructure for electronic access to all dialect dictionaries collaborating in the ReWo. In this paper we will firstly reconsider the nature of the core data types - form, sense and location - present in the different dialect dictionaries and the ways these data types are further classified. Next we will focus on the problems encountered when trying to unify this dictionary data and their classifications and suggest solutions. Finally we will look at several implementation issues regarding a specific encoding for the dictionaries.