Fivehundredmillionandone Tokens. Loading the AAC Container with Text Resources for Text Studies.

Hanno Biber, Evelyn Breiteneder


Abstract
The """"AAC - Austrian Academy Corpus"""" is a diachronic German language digital text corpus of more than 500 million tokens. The text corpus has collected several thousands of texts representing a wide range of different text types. The primary research aim is to develop text language resources for the study of texts. For corpus linguistics and corpus based language research large text corpora need to be structured in a systematic way. For this structural purpose the AAC is making use of the notion of container. By container in the context of corpus research we understand a flexible system of pragmatic representation, manipulation, modification and structured storage of annotated items of text. The issue of representing a large corpus in formats that offer only limited space is paradigmatic for the general task of representing a language by just a small collection of text or a small sample of the language. Methods based upon structural normalization and standardization have to be developed in order to provide useful instruments for text studies.
Anthology ID:
L12-1510
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1067–1070
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/857_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Hanno Biber and Evelyn Breiteneder. 2012. Fivehundredmillionandone Tokens. Loading the AAC Container with Text Resources for Text Studies.. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1067–1070, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Fivehundredmillionandone Tokens. Loading the AAC Container with Text Resources for Text Studies. (Biber & Breiteneder, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/857_Paper.pdf