Segmenting Hashtags using Automatically Created Training Data

Arda Çelebi, Arzucan Özgür


Abstract
Hashtags, which are commonly composed of multiple words, are increasingly used to convey the actual messages in tweets. Understanding what tweets are saying is getting more dependent on understanding hashtags. Therefore, identifying the individual words that constitute a hashtag is an important, yet a challenging task due to the abrupt nature of the language used in tweets. In this study, we introduce a feature-rich approach based on using supervised machine learning methods to segment hashtags. Our approach is unsupervised in the sense that instead of using manually segmented hashtags for training the machine learning classifiers, we automatically create our training data by using tweets as well as by automatically extracting hashtag segmentations from a large corpus. We achieve promising results with such automatically created noisy training data.
Anthology ID:
L16-1476
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2981–2985
Language:
URL:
https://aclanthology.org/L16-1476
DOI:
Bibkey:
Cite (ACL):
Arda Çelebi and Arzucan Özgür. 2016. Segmenting Hashtags using Automatically Created Training Data. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2981–2985, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Segmenting Hashtags using Automatically Created Training Data (Çelebi & Özgür, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1476.pdf