%0 Conference Proceedings
%T A Dataset and Classifier for Recognizing Social Media English
%A Blodgett, Su Lin
%A Wei, Johnny
%A O’Connor, Brendan
%Y Derczynski, Leon
%Y Xu, Wei
%Y Ritter, Alan
%Y Baldwin, Tim
%S Proceedings of the 3rd Workshop on Noisy User-generated Text
%D 2017
%8 September
%I Association for Computational Linguistics
%C Copenhagen, Denmark
%F blodgett-etal-2017-dataset
%X While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.
%R 10.18653/v1/W17-4408
%U https://aclanthology.org/W17-4408
%U https://doi.org/10.18653/v1/W17-4408
%P 56-61