Martijn Goudbeek


pdf bib
How reproducible is best-worst scaling for human evaluation? A reproduction of ‘Data-to-text Generation with Macro Planning’
Emiel van Miltenburg | Anouck Braggaar | Nadine Braun | Debby Damen | Martijn Goudbeek | Chris van der Lee | Frédéric Tomas | Emiel Krahmer
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

This paper is part of the larger ReproHum project, where different teams of researchers aim to reproduce published experiments from the NLP literature. Specifically, ReproHum focuses on the reproducibility of human evaluation studies, where participants indicate the quality of different outputs of Natural Language Generation (NLG) systems. This is necessary because without reproduction studies, we do not know how reliable earlier results are. This paper aims to reproduce the second human evaluation study of Puduppully & Lapata (2021), while another lab is attempting to do the same. This experiment uses best-worst scaling to determine the relative performance of different NLG systems. We found that the worst performing system in the original study is now in fact the best performing system across the board. This means that we cannot fully reproduce the original results. We also carry out alternative analyses of the data, and discuss how our results may be combined with the other reproduction study that is carried out in parallel with this paper.


pdf bib
A reproduction study of methods for evaluating dialogue system output: Replicating Santhanam and Shaikh (2019)
Anouck Braggaar | Frédéric Tomas | Peter Blomsma | Saar Hommes | Nadine Braun | Emiel van Miltenburg | Chris van der Lee | Martijn Goudbeek | Emiel Krahmer
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

In this paper, we describe our reproduction ef- fort of the paper: Towards Best Experiment Design for Evaluating Dialogue System Output by Santhanam and Shaikh (2019) for the 2022 ReproGen shared task. We aim to produce the same results, using different human evaluators, and a different implementation of the automatic metrics used in the original paper. Although overall the study posed some challenges to re- produce (e.g. difficulties with reproduction of automatic metrics and statistics), in the end we did find that the results generally replicate the findings of Santhanam and Shaikh (2019) and seem to follow similar trends.


pdf bib
On task effects in NLG corpus elicitation: a replication study using mixed effects modeling
Emiel van Miltenburg | Merel van de Kerkhof | Ruud Koolen | Martijn Goudbeek | Emiel Krahmer
Proceedings of the 12th International Conference on Natural Language Generation

Task effects in NLG corpus elicitation recently started to receive more attention, but are usually not modeled statistically. We present a controlled replication of the study by Van Miltenburg et al. (2018b), contrasting spoken with written descriptions. We collected additional written Dutch descriptions to supplement the spoken data from the DIDEC corpus, and analyzed the descriptions using mixed effects modeling to account for variation between participants and items. Our results show that the effects of modality largely disappear in a controlled setting.


pdf bib
Proceedings of the 11th International Conference on Natural Language Generation
Emiel Krahmer | Albert Gatt | Martijn Goudbeek
Proceedings of the 11th International Conference on Natural Language Generation


pdf bib
The Multilingual Affective Soccer Corpus (MASC): Compiling a biased parallel corpus on soccer reportage in English, German and Dutch
Nadine Braun | Martijn Goudbeek | Emiel Krahmer
Proceedings of the 9th International Natural Language Generation conference


pdf bib
The Demo / Kemo Corpus: A Principled Approach to the Study of Cross-cultural Differences in the Vocal Expression and Perception of Emotion
Martijn Goudbeek | Mirjam Broersma
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents the Demo / Kemo corpus of Dutch and Korean emotional speech. The corpus has been specifically developed for the purpose of cross-linguistic comparison, and is more balanced than any similar corpus available so far: a) it contains expressions by both Dutch and Korean actors as well as judgments by both Dutch and Korean listeners; b) the same elicitation technique and recording procedure was used for recordings of both languages; c) the same nonsense sentence, which was constructed to be permissible in both languages, was used for recordings of both languages; and d) the emotions present in the corpus are balanced in terms of valence, arousal, and dominance. The corpus contains a comparatively large number of emotions (eight) uttered by a large number of speakers (eight Dutch and eight Korean). The counterbalanced nature of the corpus will enable a stricter investigation of language-specific versus universal aspects of emotional expression than was possible so far. Furthermore, given the carefully controlled phonetic content of the expressions, it allows for analysis of the role of specific phonetic features in emotional expression in Dutch and Korean.

pdf bib
Preferences versus Adaptation during Referring Expression Generation
Martijn Goudbeek | Emiel Krahmer
Proceedings of the ACL 2010 Conference Short Papers