Clustering tweets usingWikipedia concepts

Guoyu Tang, Yunqing Xia, Weizhi Wang, Raymond Lau, Fang Zheng


Abstract
Two challenging issues are notable in tweet clustering. Firstly, the sparse data problem is serious since no tweet can be longer than 140 characters. Secondly, synonymy and polysemy are rather common because users intend to present a unique meaning with a great number of manners in tweets. Enlightened by the recent research which indicates Wikipedia is promising in representing text, we exploit Wikipedia concepts in representing tweets with concept vectors. We address the polysemy issue with a Bayesian model, and the synonymy issue by exploiting the Wikipedia redirections. To further alleviate the sparse data problem, we further make use of three types of out-links in Wikipedia. Evaluation on a twitter dataset shows that the concept model outperforms the traditional VSM model in tweet clustering.
Anthology ID:
L14-1640
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2262–2267
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/83_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Guoyu Tang, Yunqing Xia, Weizhi Wang, Raymond Lau, and Fang Zheng. 2014. Clustering tweets usingWikipedia concepts. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2262–2267, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Clustering tweets usingWikipedia concepts (Tang et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/83_Paper.pdf