Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components

Jinxing Yu, Xun Jian, Hao Xin, Yangqiu Song


Abstract
Word embeddings have attracted much attention recently. Different from alphabetic writing systems, Chinese characters are often composed of subcharacter components which are also semantically informative. In this work, we propose an approach to jointly embed Chinese words as well as their characters and fine-grained subcharacter components. We use three likelihoods to evaluate whether the context words, characters, and components can predict the current target word, and collected 13,253 subcharacter components to demonstrate the existing approaches of decomposing Chinese characters are not enough. Evaluation on both word similarity and word analogy tasks demonstrates the superior performance of our model.
Anthology ID:
D17-1027
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
286–291
Language:
URL:
https://aclanthology.org/D17-1027
DOI:
10.18653/v1/D17-1027
Bibkey:
Cite (ACL):
Jinxing Yu, Xun Jian, Hao Xin, and Yangqiu Song. 2017. Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 286–291, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components (Yu et al., EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1027.pdf
Attachment:
 D17-1027.Attachment.zip