%0 Conference Proceedings %T Pimlico: A toolkit for corpus-processing pipelines and reproducible experiments %A Granroth-Wilding, Mark %Y Park, Eunjeong L. %Y Hagiwara, Masato %Y Milajevs, Dmitrijs %Y Liu, Nelson F. %Y Chauhan, Geeticka %Y Tan, Liling %S Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS) %D 2020 %8 November %I Association for Computational Linguistics %C Online %F granroth-wilding-2020-pimlico %X We present Pimlico, an open source toolkit for building pipelines for processing large corpora. It is especially focused on processing linguistic corpora and provides wrappers around existing, widely used NLP tools. A particular goal is to ease distribution of reproducible and extensible experiments by making it easy to document and re-run all steps involved, including data loading, pre-processing, model training and evaluation. Once a pipeline is released, it is easy to adapt, for example, to run on a new dataset, or to re-run an experiment with different parameters. The toolkit takes care of many common challenges in writing and distributing corpus-processing code, such as managing data between the steps of a pipeline, installing required software and combining existing toolkits with new, task-specific code. %R 10.18653/v1/2020.nlposs-1.14 %U https://aclanthology.org/2020.nlposs-1.14 %U https://doi.org/10.18653/v1/2020.nlposs-1.14 %P 101-109