The Influence of Corpus Quality on Statistical Measurements on Language Resources

Thomas Eckart, Uwe Quasthoff, Dirk Goldhahn


Abstract
The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts """"word"""" and """"sentence"""" is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.
Anthology ID:
L12-1257
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2318–2321
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/476_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Thomas Eckart, Uwe Quasthoff, and Dirk Goldhahn. 2012. The Influence of Corpus Quality on Statistical Measurements on Language Resources. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2318–2321, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
The Influence of Corpus Quality on Statistical Measurements on Language Resources (Eckart et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/476_Paper.pdf