Chuan-Jie Lin


2020

This paper introduced TOCP, a larger dataset of Chinese profanity. This dataset contains natural sentences collected from social media sites, the profane expressions appearing in the sentences, and their rephrasing suggestions which preserve their meanings in a less offensive way. We proposed several baseline systems using neural network models to test this benchmark. We trained embedding models on a profanity-related dataset and proposed several profanity-related features. Our baseline systems achieved an F1-score of 86.37% in profanity detection and an accuracy of 77.32% in profanity rephrasing.

2019

2018

The main goal of Chinese grammatical error diagnosis task is to detect word er-rors in the sentences written by Chinese-learning students. Our previous system would generate error-corrected sentences as candidates and their sentence likeli-hood were measured based on a large scale Chinese n-gram dataset. This year we further tried to identify long frequent-ly-seen subsentences and label them as correct in order to avoid propose too many error candidates. Two new methods for suggesting missing and selection er-rors were also tested.

2017

This paper describes the approaches of sentimental score prediction in the NTOU DSA system participating in DSAP this year. The modules to predict scores for words are adapted from our system last year. The approach to predict scores for phrases is keyword-based machine learning method. The performance of our system is good in predicting scores of phrases.
This paper proposes a system that can detect and rephrase profanity in Chinese text. Rather than just masking detected profanity, we want to revise the input sentence by using inoffensive words while keeping their original meanings. 29 of such rephrasing rules were invented after observing sentences on real-word social websites. The overall accuracy of the proposed system is 85.56%

2016

This paper proposes a new idea that uses Wikipedia categories as answer types and defines candidate sets inside Wikipedia. The focus of a given question is searched in the hierarchy of Wikipedia main pages. Our searching strategy combines head-noun matching and synonym matching provided in semantic resources. The set of answer candidates is determined by the entry hierarchy in Wikipedia and the hyponymy hierarchy in WordNet. The experimental results show that the approach can find candidate sets in a smaller size but achieve better performance especially for ARTIFACT and ORGANIZATION types, where the performance is better than state-of-the-art Chinese factoid QA systems.
Grammatical error diagnosis is an essential part in a language-learning tutoring system. Based on the data sets of Chinese grammar error detection tasks, we proposed a system which measures the likelihood of correction candidates generated by deleting or inserting characters or words, moving substrings to different positions, substituting prepositions with other prepositions, or substituting words with their synonyms or similar strings. Sentence likelihood is measured based on the frequencies of substrings from the space-removed version of Google n-grams. The evaluation on the training set shows that Missing-related and Selection-related candidate generation methods have promising performance. Our final system achieved a precision of 30.28% and a recall of 62.85% in the identification level evaluated on the test set.

2015

2014

2013

2012

2010

2006

2003

2001

2000

1999