Modern Chinese Helps Archaic Chinese Processing: Finding and Exploiting the Shared Properties

Yan Song; Fei Xia

Modern Chinese Helps Archaic Chinese Processing: Finding and Exploiting the Shared Properties

Abstract

Languages change over time and ancient languages have been studied in linguistics and other related fields. A main challenge in this research area is the lack of empirical data; for instance, ancient spoken languages often leave little trace of their linguistic properties. From the perspective of natural language processing (NLP), while the NLP community has created dozens of annotated corpora, very few of them are on ancient languages. As an effort toward bridging the gap, we have created a word segmented and POS tagged corpus for Archaic Chinese using articles from Huainanzi, a book written during Chinas Western Han Dynasty (206 BC-9 AD). We then compare this corpus with the Chinese Penn Treebank (CTB), a well-known corpus for Modern Chinese, and report several interesting differences and similarities between the two corpora. Finally, we demonstrate that the CTB can be used to improve the performance of word segmenters and POS taggers for Archaic Chinese, but only through features that have similar behaviors in the two corpora.

Anthology ID:: L14-1163
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3129–3136
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/138_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Yan Song and Fei Xia. 2014. Modern Chinese Helps Archaic Chinese Processing: Finding and Exploiting the Shared Properties. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3129–3136, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Modern Chinese Helps Archaic Chinese Processing: Finding and Exploiting the Shared Properties (Song & Xia, LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/138_Paper.pdf

PDF Cite Search Fix data