Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese

Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso, Yasuharu Den


Abstract
Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and so forth, and when possible, samples are drawn from the rigidly defined statistical populations by means of random sampling. All texts are dually POS-analyzed based upon two different, but mutually related, definitions of ‘word.’ Currently, more than 90 million words have been sampled and XML annotated with respect to text-structure and lexical and character information. A preliminary linear discriminant analysis of text genres using the data of POS frequencies and sentence length revealed it was possible to classify the text genres with a correct identification rate of 88% as far as the samples of books, newspapers, whitepapers, and internet bulletin boards are concerned. When the samples of blogs were included in this data set, however, the identification rate went down to 68%, suggesting the considerable variance of the blog texts in terms of the textual register and style.
Anthology ID:
L10-1059
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/99_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso, and Yasuharu Den. 2010. Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese (Maekawa et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/99_Paper.pdf