HashSet - A Dataset For Hashtag Segmentation

Prashant Kodali, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru


Abstract
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We analyze the performance of SOTA models for Hashtag Segmentation, and show that the proposed dataset provides an alternate set of hashtags to train and assess models.
Anthology ID:
2022.lrec-1.782
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7215–7219
Language:
URL:
https://aclanthology.org/2022.lrec-1.782
DOI:
Bibkey:
Cite (ACL):
Prashant Kodali, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava, and Ponnurangam Kumaraguru. 2022. HashSet - A Dataset For Hashtag Segmentation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7215–7219, Marseille, France. European Language Resources Association.
Cite (Informal):
HashSet - A Dataset For Hashtag Segmentation (Kodali et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.782.pdf
Code
 prashantkodali/hashset