Publishing the Trove Newspaper Corpus

Steve Cassidy


Abstract
The Trove Newspaper Corpus is derived from the National Library of Australia’s digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.
Anthology ID:
L16-1715
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4520–4525
Language:
URL:
https://aclanthology.org/L16-1715
DOI:
Bibkey:
Cite (ACL):
Steve Cassidy. 2016. Publishing the Trove Newspaper Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4520–4525, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Publishing the Trove Newspaper Corpus (Cassidy, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1715.pdf