The paper reports on the first steps in developing a time-stamped multimodal dataset of reading data by Bulgarian children. Data are being collected, structured and analysed by means of ReadLet, an innovative infrastructure for multimodal language data collection that uses a tablet as a reader’s front-end. The overall goal of the project is to quantitatively analyse the reading skills of a sample of early Bulgarian readers collected over a two-year period, and compare them with the reading data of early readers of Italian, collected using the same protocol. We illustrate design issues of the experimental protocol, as well as the data acquisition process and the post-processing phase of data annotation/augmentation. To evaluate the potential and usefulness of the Bulgarian dataset for reading research, we present some preliminary statistical analyses of our recently collected data. They show robust convergence trends between Bulgarian and Italian early reading development stages.
Eye tracking data during reading provides significant insights into the cognitive processes underlying language comprehension. It allows for the estimation of lexical, contextual, and higher-level structural effects on word identification through metrics such as fixation duration. Despite advancements in psycholinguistic experiments that have elucidated these effects, the extent to which computational models can predict gaze patterns remains unclear. Recent developments in computational modeling, particularly the use of pre-trained transformer language models, have shown promising results in mirroring human reading behaviors. However, previous studies have not adequately compared these models to alternative architectures or considered various input features comprehensively. This paper addresses these gaps by replicating prior findings on English data, critically evaluating performance metrics, and proposing a stricter accuracy measurement method. Furthermore, it compares different computational models, demonstrating that simpler architectures can achieve results comparable to or better than transformers. The study also emphasizes the significance of individual differences in reading behavior, presenting challenges for simulating natural reading tasks.
The paper presents the design and construction of a time-stamped multimodal dataset for reading research, including multiple time-aligned temporal signals elicited with four experimental trials of connected text reading by both child and adult readers. We present the experimental protocols, as well as the data acquisition process and the post-processing phase of data annotation/augmentation. To evaluate the potential and usefulness of a time-aligned multimodal dataset for reading research, we present a few statistical analyses showing the correlation and complementarity of multimodal time-series of reading data, as well as some results of modelling adults’ reading data by integrating different modalities. The total dataset size amounts to about 2.5 GByte in compressed format.