2024
pdf
bib
abs
CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese
Le Qiu
|
Shanyue Guo
|
Tak-Sum Wong
|
Emmanuele Chersoni
|
John Lee
|
Chu-Ren Huang
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.
2018
pdf
bib
Register-sensitive Translation: a Case Study of Mandarin and Cantonese (Non-archival Extended Abstract)
Tak-sum Wong
|
John Lee
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
2017
pdf
bib
Quantitative Comparative Syntax on the Cantonese-Mandarin Parallel Dependency Treebank
Tak-sum Wong
|
Kim Gerdes
|
Herman Leung
|
John Lee
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)
2016
pdf
bib
abs
Developing Universal Dependencies for Mandarin Chinese
Herman Leung
|
Rafaël Poiret
|
Tak-sum Wong
|
Xinying Chen
|
Kim Gerdes
|
John Lee
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)
This article proposes a Universal Dependency Annotation Scheme for Mandarin Chinese, including POS tags and dependency analysis. We identify cases of idiosyncrasy of Mandarin Chinese that are difficult to fit into the current schema which has mainly been based on the descriptions of various Indo-European languages. We discuss differences between our scheme and those of the Stanford Chinese Dependencies and the Chinese Dependency Treebank.
pdf
bib
abs
A Dependency Treebank of the Chinese Buddhist Canon
Tak-sum Wong
|
John Lee
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present a dependency treebank of the Chinese Buddhist Canon, which contains 1,514 texts with about 50 million Chinese characters. The treebank was created by an automatic parser trained on a smaller treebank, containing four manually annotated sutras (Lee and Kong, 2014). We report results on word segmentation, part-of-speech tagging and dependency parsing, and discuss challenges posed by the processing of medieval Chinese. In a case study, we exploit the treebank to examine verbs frequently associated with Buddha, and to analyze usage patterns of quotative verbs in direct speech. Our results suggest that certain quotative verbs imply status differences between the speaker and the listener.
2012
pdf
bib
Glimpses of Ancient China from Classical Chinese Poems
John Lee
|
Tak-sum Wong
Proceedings of COLING 2012: Posters