Hongzhi Xu


pdf bib
PARSEME corpus release 1.3
Agata Savary | Cherifa Ben Khelil | Carlos Ramisch | Voula Giouli | Verginica Barbu Mititelu | Najet Hadj Mohamed | Cvetana Krstev | Chaya Liebeskind | Hongzhi Xu | Sara Stymne | Tunga Güngör | Thomas Pickard | Bruno Guillaume | Eduard Bejček | Archna Bhatia | Marie Candito | Polona Gantar | Uxoa Iñurrieta | Albert Gatt | Jolanta Kovalevskaite | Timm Lichte | Nikola Ljubešić | Johanna Monti | Carla Parra Escartín | Mehrnoush Shamsfard | Ivelina Stoyanova | Veronika Vincze | Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.


pdf bib
Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions
Carlos Ramisch | Agata Savary | Bruno Guillaume | Jakub Waszczuk | Marie Candito | Ashwini Vaidya | Verginica Barbu Mititelu | Archna Bhatia | Uxoa Iñurrieta | Voula Giouli | Tunga Güngör | Menghan Jiang | Timm Lichte | Chaya Liebeskind | Johanna Monti | Renata Ramisch | Sara Stymne | Abigail Walsh | Hongzhi Xu
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

pdf bib
Modeling Morphological Typology for Unsupervised Learning of Language Morphology
Hongzhi Xu | Jordan Kodner | Mitchell Marcus | Charles Yang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper describes a language-independent model for fully unsupervised morphological analysis that exploits a universal framework leveraging morphological typology. By modeling morphological processes including suffixation, prefixation, infixation, and full and partial reduplication with constrained stem change rules, our system effectively constrains the search space and offers a wide coverage in terms of morphological typology. The system is tested on nine typologically and genetically diverse languages, and shows superior performance over leading systems. We also investigate the effect of an oracle that provides only a handful of bits per language to signal morphological type.

pdf bib
Morphological Segmentation for Low Resource Languages
Justin Mott | Ann Bies | Stephanie Strassel | Jordan Kodner | Caitlin Richter | Hongzhi Xu | Mitchell Marcus
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.


pdf bib
Annotating Chinese Light Verb Constructions according to PARSEME guidelines
Menghan Jiang | Natalia Klyueva | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Unsupervised Morphology Learning with Statistical Paradigms
Hongzhi Xu | Mitchell Marcus | Charles Yang | Lyle Ungar
Proceedings of the 27th International Conference on Computational Linguistics

This paper describes an unsupervised model for morphological segmentation that exploits the notion of paradigms, which are sets of morphological categories (e.g., suffixes) that can be applied to a homogeneous set of words (e.g., nouns or verbs). Our algorithm identifies statistically reliable paradigms from the morphological segmentation result of a probabilistic model, and chooses reliable suffixes from them. The new suffixes can be fed back iteratively to improve the accuracy of the probabilistic model. Finally, the unreliable paradigms are subjected to pruning to eliminate unreliable morphological relations between words. The paradigm-based algorithm significantly improves segmentation accuracy. Our method achieves start-of-the-art results on experiments using the Morpho-Challenge data, including English, Turkish, and Finnish.


pdf bib
Case Studies in the Automatic Characterization of Grammars from Small Wordlists
Jordan Kodner | Spencer Caplan | Hongzhi Xu | Mitchell P. Marcus | Charles Yang
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages


pdf bib
Database of Mandarin Neighborhood Statistics
Karl Neergaard | Hongzhi Xu | Chu-Ren Huang
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In the design of controlled experiments with language stimuli, researchers from psycholinguistic, neurolinguistic, and related fields, require language resources that isolate variables known to affect language processing. This article describes a freely available database that provides word level statistics for words and nonwords of Mandarin, Chinese. The featured lexical statistics include subtitle corpus frequency, phonological neighborhood density, neighborhood frequency, and homophone density. The accompanying word descriptors include pinyin, ascii phonetic transcription (sampa), lexical tone, syllable structure, dominant PoS, and syllable, segment and pinyin lengths for each phonological word. It is designed for researchers particularly concerned with language processing of isolated words and made to accommodate multiple existing hypotheses concerning the structure of the Mandarin syllable. The database is divided into multiple files according to the desired search criteria: 1) the syllable segmentation schema used to calculate density measures, and 2) whether the search is for words or nonwords. The database is open to the research community at https://github.com/karlneergaard/Mandarin-Neighborhood-Statistics.


pdf bib
LLT-PolyU: Identifying Sentiment Intensity in Ironic Tweets
Hongzhi Xu | Enrico Santus | Anna Laszlo | Chu-Ren Huang
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Sentiment Analyzer with Rich Features for Ironic and Sarcastic Tweets
Piyoros Tungthamthiti | Enrico Santus | Hongzhi Xu | Chu-Ren Huang | Kiyoaki Shirai
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Auditory Synaesthesia and Near Synonyms: A Corpus-Based Analysis of sheng1 and yin1 in Mandarin Chinese
Qingqing Zhao | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation


pdf bib
Corpus-based Study and Identification of Mandarin Chinese Light Verb Variations
Chu-Ren Huang | Jingxia Lin | Menghan Jiang | Hongzhi Xu
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf bib
Annotation and Classification of Light Verbs and Light Verb Variations in Mandarin Chinese
Jingxia Lin | Hongzhi Xu | Menghan Jiang | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf bib
Annotate and Identify Modalities, Speech Acts and Finer-Grained Event Types in Chinese Text
Hongzhi Xu | Chu-Ren Huang
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing


pdf bib
A Rule System for Chinese Time Entity Recognition by Comprehensive Linguistic Study
Hongzhi Xu | Chu-Ren Huang
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Primitives of Events and the Semantic Representation
Hongzhi Xu | Chu-Ren Huang
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013)


pdf bib
Compositionality of NN Compounds: A Case Study on [N1+Artifactual-Type Event Nouns]
Shan Wang | Chu-Ren Huang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib
The Headedness of Mandarin Chinese Serial Verb Constructions: A Corpus-Based Study
Jingxia Lin | Chu-Ren Huang | Huarui Zhang | Hongzhi Xu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib
A Grammar-informed Corpus-based Sentence Database for Linguistic and Computational Studies
Hongzhi Xu | Helen Kaiyun Chen | Chu-Ren Huang | Qin Lu | Dingxu Shi | Tin-Shing Chiu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We adopt the corpus-informed approach to example sentence selections for the construction of a reference grammar. In the process, a database containing sentences that are carefully selected by linguistic experts including the full range of linguistic facts covered in an authoritative Chinese Reference Grammar is constructed and structured according to the reference grammar. A search engine system is developed to facilitate the process of finding the most typical examples the users need to study a linguistic problem or prove their hypotheses. The database can also be used as a training corpus by computational linguists to train models for Chinese word segmentation, POS tagging and sentence parsing.


pdf bib
Expanding Chinese Sentiment Dictionaries from Large Scale Unlabeled Corpus
Hongzhi Xu | Kai Zhao | Likun Qiu | Changjian Hu
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation


pdf bib
Discovery of Dependency Tree Patterns for Relation Extraction
Hongzhi Xu | Changjian Hu | Guoyang Shen
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2


pdf bib
Combining Context Features by Canonical Belief Network for Chinese Part-Of-Speech Tagging
Hongzhi Xu | Chunping Li
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II