Database of Mandarin Neighborhood Statistics

Karl Neergaard, Hongzhi Xu, Chu-Ren Huang


Abstract
In the design of controlled experiments with language stimuli, researchers from psycholinguistic, neurolinguistic, and related fields, require language resources that isolate variables known to affect language processing. This article describes a freely available database that provides word level statistics for words and nonwords of Mandarin, Chinese. The featured lexical statistics include subtitle corpus frequency, phonological neighborhood density, neighborhood frequency, and homophone density. The accompanying word descriptors include pinyin, ascii phonetic transcription (sampa), lexical tone, syllable structure, dominant PoS, and syllable, segment and pinyin lengths for each phonological word. It is designed for researchers particularly concerned with language processing of isolated words and made to accommodate multiple existing hypotheses concerning the structure of the Mandarin syllable. The database is divided into multiple files according to the desired search criteria: 1) the syllable segmentation schema used to calculate density measures, and 2) whether the search is for words or nonwords. The database is open to the research community at https://github.com/karlneergaard/Mandarin-Neighborhood-Statistics.
Anthology ID:
L16-1636
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4032–4036
Language:
URL:
https://aclanthology.org/L16-1636
DOI:
Bibkey:
Cite (ACL):
Karl Neergaard, Hongzhi Xu, and Chu-Ren Huang. 2016. Database of Mandarin Neighborhood Statistics. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4032–4036, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Database of Mandarin Neighborhood Statistics (Neergaard et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1636.pdf
Code
 karlneergaard/Mandarin-Neighborhood-Statistics