%0 Conference Proceedings
%T Improving Document Clustering by Removing Unnatural Language
%A Jang, Myungha
%A Choi, Jinho D.
%A Allan, James
%Y Derczynski, Leon
%Y Xu, Wei
%Y Ritter, Alan
%Y Baldwin, Tim
%S Proceedings of the 3rd Workshop on Noisy User-generated Text
%D 2017
%8 September
%I Association for Computational Linguistics
%C Copenhagen, Denmark
%F jang-etal-2017-improving
%X Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can bean important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of un-natural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various for-mats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that re-moving unnatural language components gives an absolute improvement in document cluster-ing by up to 15%. Our corpus and tool are publicly available
%R 10.18653/v1/W17-4416
%U https://aclanthology.org/W17-4416
%U https://doi.org/10.18653/v1/W17-4416
%P 122-130