Xue Zhao

2025

Ancient Chinese books have great values in history and cultural studies. Named en-tities like person, location, time are cru-cial elements, thus automatic Named En-tity Recognition (NER) is considered a ba-sic task in ancient Chinese text processing. This paper introduces EvaHan2025, the first international ancient Chinese Named Entity Recognition bake-off. The evalua-tion introduces a rigorous benchmark for assessing NER performance across histori-cal and medical texts, covering 12 named entity types. A total of 13 teams par-ticipated in the competition, submitting 77 system runs. In the closed modality, where participants were restricted to us-ing only the training data, the highest F1 scores reached 85.04% on TestA and 90.28% on TestB, both derived from his-torical texts, while performance on medi-cal texts (TestC) reached 84.49%. The re-sults indicate that text genre significantly impacts model performance, with histori-cal texts generally yielding higher scores. Additionally, the intrinsic characteristics of named entities also influence recogni-tion performance. These findings high-light the challenges and opportunities in ancient Chinese NER and underscore the importance of domain adaptation and en-tity type diversity in future research.

2023

This paper present the results of the First International Ancient Chinese Transalation Bakeoff (EvaHan), which is a shared task of the Ancient Language Translation Workshop (ALT2023) and a co-located event of the 19th Edition of the Machine Translation Summit 2023 (MTS 2023). We described the motivation for having an international shared contest, as well as the datasets and tracks. The contest consists of two modalities, closed and open. In the closed modality, the participants are only allowed to use the training data, the partic-ipating teams achieved the highest BLEU scores of 27.3315 and 1.1102 in the tasks of translating Ancient Chinese to Modern Chinese and translating Ancient Chinese to English, respectively. In the open mode, contestants can only use any available data and models. The participating teams achieved the highest BLEU scores of 29.6832 and 6.5493 in the ancient Chinese to modern and ancient Chinese to English tasks, respectively.

2022

Controllable paraphrase generation (CPG) incorporates various external conditions to obtain desirable paraphrases. However, existing works only highlight a special condition under two indispensable aspects of CPG (i.e., lexically and syntactically CPG) individually, lacking a unified circumstance to explore and analyze their effectiveness. In this paper, we propose a general controllable paraphrase generation framework (GCPG), which represents both lexical and syntactical conditions as text sequences and uniformly processes them in an encoder-decoder paradigm. Under GCPG, we reconstruct commonly adopted lexical condition (i.e., Keywords) and syntactical conditions (i.e., Part-Of-Speech sequence, Constituent Tree, Masked Template and Sentential Exemplar) and study the combination of the two types. In particular, for Sentential Exemplar condition, we propose a novel exemplar construction method — Syntax-Similarity based Exemplar (SSE). SSE retrieves a syntactically similar but lexically different sentence as the exemplar for each target sentence, avoiding exemplar-side words copying problem. Extensive experiments demonstrate that GCPG with SSE achieves state-of-the-art performance on two popular benchmarks. In addition, the combination of lexical and syntactical conditions shows the significant controllable ability of paraphrase generation, and these empirical results could provide novel insight to user-oriented paraphrasing.

pdf bib abs
Self-supervised Product Title Rewrite for Product Listing Ads
Xue Zhao | Dayiheng Liu | Junwei Ding | Liang Yao | Mahone Yan | Huibo Wang | Wenqing Yao
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

Product Listing Ads (PLAs) are primary online advertisements merchants pay to attract more customers. However, merchants prefer to stack various attributes to the title and neglect the fluency and information priority. These seller-created titles are not suitable for PLAs as they fail to highlight the core information in the visible part in PLAs titles. In this work, we present a title rewrite solution. Specifically, we train a self-supervised language model to generate high-quality titles in terms of fluency and information priority. Extensive offline test and real-world online test have demonstrated that our solution is effective in reducing the cost and gaining more profit as it lowers our CPC, CPB while improving CTR in the online test by a large amount.

2021

Machine reading comprehension (MRC) is a typical natural language processing (NLP)task and has developed rapidly in the last few years. Various reading comprehension datasets have been built to support MRC studies. However large-scale and high-quality datasets are rare due to the high complexity and huge workforce cost of making sucha dataset. Besides most reading comprehension datasets are in English and Chinesedatasets are insufficient. In this paper we propose an automatic method for MRCdataset generation and build the largest Chinese medical reading comprehension dataset presently named CMedRC. Our dataset contains 17k questions generated by our auto-matic method and some seed questions. We obtain the corresponding answers from amedical knowledge graph and manually check all of them. Finally we test BiLSTM andBERT-based pre-trained language models (PLMs) on our dataset and propose a base-line for the following studies. Results show that the automatic MRC dataset generation method is considerable for future model improvements.