A Japanese Word Dependency Corpus

Shinsuke Mori, Hideki Ogura, Tetsuro Sasada


Abstract
In this paper, we present a corpus annotated with dependency relationships in Japanese. It contains about 30 thousand sentences in various domains. Six domains in Balanced Corpus of Contemporary Written Japanese have part-of-speech and pronunciation annotation as well. Dictionary example sentences have pronunciation annotation and cover basic vocabulary in Japanese with English sentence equivalent. Economic newspaper articles also have pronunciation annotation and the topics are similar to those of Penn Treebank. Invention disclosures do not have other annotation, but it has a clear application, machine translation. The unit of our corpus is word like other languages contrary to existing Japanese corpora whose unit is phrase called bunsetsu. Each sentence is manually segmented into words. We first present the specification of our corpus. Then we give a detailed explanation about our standard of word dependency. We also report some preliminary results of an MST-based dependency parser on our corpus.
Anthology ID:
L14-1360
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
753–758
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/42_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Shinsuke Mori, Hideki Ogura, and Tetsuro Sasada. 2014. A Japanese Word Dependency Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 753–758, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
A Japanese Word Dependency Corpus (Mori et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/42_Paper.pdf