A Dataset and Classifier for Recognizing Social Media English

Su Lin Blodgett, Johnny Wei, Brendan O’Connor


Abstract
While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.
Anthology ID:
W17-4408
Volume:
Proceedings of the 3rd Workshop on Noisy User-generated Text
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
56–61
Language:
URL:
https://aclanthology.org/W17-4408
DOI:
10.18653/v1/W17-4408
Bibkey:
Cite (ACL):
Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. 2017. A Dataset and Classifier for Recognizing Social Media English. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 56–61, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
A Dataset and Classifier for Recognizing Social Media English (Blodgett et al., WNUT 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-4408.pdf