Text Readability and Word Distribution in Japanese

Satoshi Sato


Abstract
This paper reports the relation between text readability and word distribution in the Japanese language. There was no similar study in the past due to three major obstacles: (1) unclear definition of Japanese “word”, (2) no balanced corpus, and (3) no readability measure. Compilation of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) and development of a readability predictor remove these three obstacles and enable this study. First, we have counted the frequency of each word in each text in the corpus. Then we have calculated the frequency rank of words both in the whole corpus and in each of three readability bands. Three major findings are: (1) the proportion of high-frequent words to tokens in Japanese is lower than that in English; (2) the type-coverage curve of words in the difficult-band draws an unexpected shape; (3) the size of the intersection between high-frequent words in the easy-band and these in the difficult-band is unexpectedly small.
Anthology ID:
L14-1505
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2811–2815
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/633_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Satoshi Sato. 2014. Text Readability and Word Distribution in Japanese. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2811–2815, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Text Readability and Word Distribution in Japanese (Sato, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/633_Paper.pdf