Textual Characteristics for Language Engineering

Mathias Bank, Robert Remus, Martin Schierle


Abstract
Language statistics are widely used to characterize and better understand language. In parallel, the amount of text mining and information retrieval methods grew rapidly within the last decades, with many algorithms evaluated on standardized corpora, often drawn from newspapers. However, up to now there were almost no attempts to link the areas of natural language processing and language statistics in order to properly characterize those evaluation corpora, and to help others to pick the most appropriate algorithms for their particular corpus. We believe no results in the field of natural language processing should be published without quantitatively describing the used corpora. Only then the real value of proposed methods can be determined and the transferability to corpora originating from different genres or domains can be estimated. We lay ground for a language engineering process by gathering and defining a set of textual characteristics we consider valuable with respect to building natural language processing systems. We carry out a case study for the analysis of automotive repair orders and explicitly call upon the scientific community to provide feedback and help to establish a good practice of corpus-aware evaluations.
Anthology ID:
L12-1046
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
515–519
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/182_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mathias Bank, Robert Remus, and Martin Schierle. 2012. Textual Characteristics for Language Engineering. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 515–519, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Textual Characteristics for Language Engineering (Bank et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/182_Paper.pdf