Evaluating hypotheses in geolocation on a very large sample of Twitter

Bahar Salehi, Anders Søgaard


Abstract
Recent work in geolocation has made several hypotheses about what linguistic markers are relevant to detect where people write from. In this paper, we examine six hypotheses against a corpus consisting of all geo-tagged tweets from the US, or whose geo-tags could be inferred, in a 19% sample of Twitter history. Our experiments lend support to all six hypotheses, including that spelling variants and hashtags are strong predictors of location. We also study what kinds of common nouns are predictive of location after controlling for named entities such as dolphins or sharks
Anthology ID:
W17-4409
Volume:
Proceedings of the 3rd Workshop on Noisy User-generated Text
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Leon Derczynski, Wei Xu, Alan Ritter, Tim Baldwin
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
62–67
Language:
URL:
https://aclanthology.org/W17-4409/
DOI:
10.18653/v1/W17-4409
Bibkey:
Cite (ACL):
Bahar Salehi and Anders Søgaard. 2017. Evaluating hypotheses in geolocation on a very large sample of Twitter. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 62–67, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Evaluating hypotheses in geolocation on a very large sample of Twitter (Salehi & Søgaard, WNUT 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-4409.pdf