Document Attrition in Web Corpora: an Exploration

Stephen Wattam, Paul Rayson, Damon Berridge


Abstract
Increases in the use of web data for corpus-building, coupled with the use of specialist, single-use corpora, make for an increasing reliance on language that changes quickly, affecting the long-term validity of studies based on these methods. This ‘drift' through time affects both users of open-source corpora and those attempting to interpret the results of studies based on web data. The attrition of documents online, also called link rot or document half-life, has been studied many times for the purposes of optimising search engine web crawlers, producing robust and reliable archival systems, and ensuring the integrity of distributed information stores, however, the affect that attrition has upon corpora of varying construction remains largely unknown. This paper presents a preliminary investigation into the differences in attrition rate between corpora selected using different corpus construction methods. It represents the first step in a larger longitudinal analysis, and as such presents URI-based content clues, chosen to relate to studies from other areas. The ultimate goal of this larger study is to produce a detailed enumeration of the primary biases online, and identify sampling strategies which control and minimise unwanted effects of document attrition.
Anthology ID:
L12-1475
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1486–1489
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/806_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Stephen Wattam, Paul Rayson, and Damon Berridge. 2012. Document Attrition in Web Corpora: an Exploration. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1486–1489, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Document Attrition in Web Corpora: an Exploration (Wattam et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/806_Paper.pdf