“He Said She Said” ― a Male/Female Corpus of Polish

Filip Gralinski; Łukasz Borchmann; Piotr Wierzchoń

“He Said She Said” ― a Male/Female Corpus of Polish

Filip Graliński, Łukasz Borchmann, Piotr Wierzchoń

Abstract

Gender differences in language use have long been of interest in linguistics. The task of automatic gender attribution has been considered in computational linguistics as well. Most research of this type is done using (usually English) texts with authorship metadata. In this paper, we propose a new method of male/female corpus creation based on gender-specific first-person expressions. The method was applied on CommonCrawl Web corpus for Polish (language, in which gender-revealing first-person expressions are particularly frequent) to yield a large (780M words) and varied collection of men’s and women’s texts. The whole procedure for building the corpus and filtering out unwanted texts is described in the present paper. The quality check was done on a random sample of the corpus to make sure that the majority (84%) of texts are correctly attributed, natural texts. Some preliminary (socio)linguistic insights (websites and words frequently occurring in male/female fragments) are given as well.

Anthology ID:: L16-1648
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4105–4110
Language:
URL:: https://aclanthology.org/L16-1648/
DOI:
Bibkey:
Cite (ACL):: Filip Graliński, Łukasz Borchmann, and Piotr Wierzchoń. 2016. “He Said She Said” ― a Male/Female Corpus of Polish. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4105–4110, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: “He Said She Said” ― a Male/Female Corpus of Polish (Graliński et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1648.pdf

PDF Cite Search Fix data