AbstractWe present Pimlico, an open source toolkit for building pipelines for processing large corpora. It is especially focused on processing linguistic corpora and provides wrappers around existing, widely used NLP tools. A particular goal is to ease distribution of reproducible and extensible experiments by making it easy to document and re-run all steps involved, including data loading, pre-processing, model training and evaluation. Once a pipeline is released, it is easy to adapt, for example, to run on a new dataset, or to re-run an experiment with different parameters. The toolkit takes care of many common challenges in writing and distributing corpus-processing code, such as managing data between the steps of a pipeline, installing required software and combining existing toolkits with new, task-specific code.