%0 Conference Proceedings
%T Pimlico: A toolkit for corpus-processing pipelines and reproducible experiments
%A Granroth-Wilding, Mark
%Y Park, Eunjeong L.
%Y Hagiwara, Masato
%Y Milajevs, Dmitrijs
%Y Liu, Nelson F.
%Y Chauhan, Geeticka
%Y Tan, Liling
%S Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
%D 2020
%8 November
%I Association for Computational Linguistics
%C Online
%F granroth-wilding-2020-pimlico
%X We present Pimlico, an open source toolkit for building pipelines for processing large corpora. It is especially focused on processing linguistic corpora and provides wrappers around existing, widely used NLP tools. A particular goal is to ease distribution of reproducible and extensible experiments by making it easy to document and re-run all steps involved, including data loading, pre-processing, model training and evaluation. Once a pipeline is released, it is easy to adapt, for example, to run on a new dataset, or to re-run an experiment with different parameters. The toolkit takes care of many common challenges in writing and distributing corpus-processing code, such as managing data between the steps of a pipeline, installing required software and combining existing toolkits with new, task-specific code.
%R 10.18653/v1/2020.nlposs-1.14
%U https://aclanthology.org/2020.nlposs-1.14
%U https://doi.org/10.18653/v1/2020.nlposs-1.14
%P 101-109