AbstractAs in any field of inquiry that depends on experiments, the verifiability of experimental studies is important in computational linguistics. Despite increased attention to verification of empirical results, the practices in the field are unclear. Furthermore, we argue, certain traditions and practices that are seemingly useful for verification may in fact be counterproductive. We demonstrate this through a set of multi-lingual experiments on parsing Universal Dependencies treebanks. In particular, we show that emphasis on exact replication leads to practices (some of which are now well established) that hide the variation in experimental results, effectively hindering verifiability with a false sense of certainty. The purpose of the present paper is to highlight the magnitude of the issues resulting from these common practices with the hope of instigating further discussion. Once we, as a community, are convinced about the importance of the problems, the solutions are rather obvious, although not necessarily easy to implement.