Tianyuan Cai


2022

pdf bib
Robustness of Hybrid Models in Cross-domain Readability Assessment
Ho Hung Lim | Tianyuan Cai | John S. Y. Lee | Meichun Liu
Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association

2020

pdf bib
Using Bilingual Patents for Translation Training
John Lee | Benjamin Tsou | Tianyuan Cai
Proceedings of the 28th International Conference on Computational Linguistics

While bilingual corpora have been instrumental for machine translation, their utility for training translators has been less explored. We investigate the use of bilingual corpora as pedagogical tools for translation in the technical domain. In a user study, novice translators revised Chinese translations of English patents through bilingual concordancing. Results show that concordancing with an in-domain bilingual corpus can yield greater improvement in translation quality of technical terms than a general-domain bilingual corpus.

pdf bib
Using Verb Frames for Text Difficulty Assessment
John Lee | Meichun Liu | Tianyuan Cai
Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet

This paper presents the first investigation on using semantic frames to assess text difficulty. Based on Mandarin VerbNet, a verbal semantic database that adopts a frame-based approach, we examine usage patterns of ten verbs in a corpus of graded Chinese texts. We identify a number of characteristics in texts at advanced grades: more frequent use of non-core frame elements; more frequent omission of some core frame elements; increased preference for noun phrases rather than clauses as verb arguments; and more frequent metaphoric usage. These characteristics can potentially be useful for automatic prediction of text readability.

pdf bib
A Counselling Corpus in Cantonese
John Lee | Tianyuan Cai | Wenxiu Xie | Lam Xing
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Virtual agents are increasingly used for delivering health information in general, and mental health assistance in particular. This paper presents a corpus designed for training a virtual counsellor in Cantonese, a variety of Chinese. The corpus consists of a domain-independent subcorpus that supports small talk for rapport building with users, and a domain-specific subcorpus that provides material for a particular area of counselling. The former consists of ELIZA style responses, chitchat expressions, and a dataset of general dialog, all of which are reusable across counselling domains. The latter consists of example user inputs and appropriate chatbot replies relevant to the specific domain. In a case study, we created a chatbot with a domain-specific subcorpus that addressed 25 issues in test anxiety, with 436 inputs solicited from native speakers of Cantonese and 150 chatbot replies harvested from mental health websites. Preliminary evaluations show that Word Mover’s Distance achieved 56% accuracy in identifying the issue in user input, outperforming a number of baselines.