SPLIT: Smart Preprocessing (Quasi) Language Independent Tool

Mohamed Al-Badrashiny, Arfath Pasha, Mona Diab, Nizar Habash, Owen Rambow, Wael Salloum, Ramy Eskander


Abstract
Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.
Anthology ID:
L16-1640
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4055–4060
Language:
URL:
https://aclanthology.org/L16-1640
DOI:
Bibkey:
Cite (ACL):
Mohamed Al-Badrashiny, Arfath Pasha, Mona Diab, Nizar Habash, Owen Rambow, Wael Salloum, and Ramy Eskander. 2016. SPLIT: Smart Preprocessing (Quasi) Language Independent Tool. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4055–4060, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
SPLIT: Smart Preprocessing (Quasi) Language Independent Tool (Al-Badrashiny et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1640.pdf