A Quantitative Study of Data in the NLP community

Margot Mieskes


Abstract
We present results on a quantitative analysis of publications in the NLP domain on collecting, publishing and availability of research data. We find that a wide range of publications rely on data crawled from the web, but few give details on how potentially sensitive data was treated. Additionally, we find that while links to repositories of data are given, they often do not work even a short time after publication. We put together several suggestions on how to improve this situation based on publications from the NLP domain, but also other research areas.
Anthology ID:
W17-1603
Volume:
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Dirk Hovy, Shannon Spruit, Margaret Mitchell, Emily M. Bender, Michael Strube, Hanna Wallach
Venue:
EthNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23–29
Language:
URL:
https://aclanthology.org/W17-1603/
DOI:
10.18653/v1/W17-1603
Bibkey:
Cite (ACL):
Margot Mieskes. 2017. A Quantitative Study of Data in the NLP community. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 23–29, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
A Quantitative Study of Data in the NLP community (Mieskes, EthNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1603.pdf