SuperCAT: The (New and Improved) Corpus Analysis Toolkit

K. Bretonnel Cohen, William A. Baumgartner Jr., Irina Temnikova


Abstract
This paper reports SuperCAT, a corpus analysis toolkit. It is a radical extension of SubCAT, the Sublanguage Corpus Analysis Toolkit, from sublanguage analysis to corpus analysis in general. The idea behind SuperCAT is that representative corpora have no tendency towards closure―that is, they tend towards infinity. In contrast, non-representative corpora have a tendency towards closure―roughly, finiteness. SuperCAT focuses on general techniques for the quantitative description of the characteristics of any corpus (or other language sample), particularly concerning the characteristics of lexical distributions. Additionally, SuperCAT features a complete re-engineering of the previous SubCAT architecture.
Anthology ID:
L16-1442
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2784–2788
Language:
URL:
https://aclanthology.org/L16-1442
DOI:
Bibkey:
Cite (ACL):
K. Bretonnel Cohen, William A. Baumgartner Jr., and Irina Temnikova. 2016. SuperCAT: The (New and Improved) Corpus Analysis Toolkit. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2784–2788, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
SuperCAT: The (New and Improved) Corpus Analysis Toolkit (Cohen et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1442.pdf