When POS data sets don’t add up: Combatting sample bias

Dirk Hovy, Barbara Plank, Anders Søgaard


Abstract
Several works in Natural Language Processing have recently looked into part-of-speech annotation of Twitter data and typically used their own data sets. Since conventions on Twitter change rapidly, models often show sample bias. Training on a combination of the existing data sets should help overcome this bias and produce more robust models than any trained on the individual corpora. Unfortunately, combining the existing corpora proves difficult: many of the corpora use proprietary tag sets that have little or no overlap. Even when mapped to a common tag set, the different corpora systematically differ in their treatment of various tags and tokens. This includes both pre-processing decisions, as well as default labels for frequent tokens, thus exhibiting data bias and label bias, respectively. Only if we address these biases can we combine the existing data sets to also overcome sample bias. We present a systematic study of several Twitter POS data sets, the problems of label and data bias, discuss their effects on model performance, and show how to overcome them to learn models that perform well on various test sets, achieving relative error reduction of up to 21%.
Anthology ID:
L14-1402
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4472–4475
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/476_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Dirk Hovy, Barbara Plank, and Anders Søgaard. 2014. When POS data sets don’t add up: Combatting sample bias. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4472–4475, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
When POS data sets don’t add up: Combatting sample bias (Hovy et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/476_Paper.pdf