A Turkish Dataset for Gender Identification of Twitter Users

Erhan Sezerer, Ozan Polatbilek, Selma Tekir


Abstract
Author profiling is the identification of an author’s gender, age, and language from his/her texts. With the increasing trend of using Twitter as a means to express thought, profiling the gender of an author from his/her tweets has become a challenge. Although several datasets in different languages have been released on this problem, there is still a need for multilingualism. In this work, we propose a dataset of tweets of Turkish Twitter users which are labeled with their gender information. The dataset has 3368 users in training set and 1924 users in test set where each user has 100 tweets. The dataset is publicly available.
Anthology ID:
W19-4023
Original:
W19-4023v1
Version 2:
W19-4023v2
Volume:
Proceedings of the 13th Linguistic Annotation Workshop
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Annemarie Friedrich, Deniz Zeyrek, Jet Hoek
Venue:
LAW
SIG:
SIGANN
Publisher:
Association for Computational Linguistics
Note:
Pages:
203–207
Language:
URL:
https://aclanthology.org/W19-4023/
DOI:
10.18653/v1/W19-4023
Bibkey:
Cite (ACL):
Erhan Sezerer, Ozan Polatbilek, and Selma Tekir. 2019. A Turkish Dataset for Gender Identification of Twitter Users. In Proceedings of the 13th Linguistic Annotation Workshop, pages 203–207, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
A Turkish Dataset for Gender Identification of Twitter Users (Sezerer et al., LAW 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4023.pdf