Baoxin Wang


2022

pdf bib
CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers
Baoxin Wang | Xingyi Duan | Dayong Wu | Wanxiang Che | Zhigang Chen | Guoping Hu
Proceedings of the 29th International Conference on Computational Linguistics

The Chinese text correction (CTC) focuses on detecting and correcting Chinese spelling errors and grammatical errors. Most existing datasets of Chinese spelling check (CSC) and Chinese grammatical error correction (GEC) are focused on a single sentence written by Chinese-as-a-second-language (CSL) learners. We find that errors caused by native speakers differ significantly from those produced by non-native speakers. These differences make it inappropriate to use the existing test sets directly to evaluate text correction systems for native speakers. Some errors also require the cross-sentence information to be identified and corrected. In this paper, we propose a cross-sentence Chinese text correction dataset for native speakers. Concretely, we manually annotated 1,500 texts written by native speakers. The dataset consists of 30,811 sentences and more than 1,000,000 Chinese characters. It contains four types of errors: spelling errors, redundant words, missing words, and word ordering errors. We also test some state-of-the-art models on the dataset. The experimental results show that even the model with the best performance is 20 points lower than humans, which indicates that there is still much room for improvement. We hope that the new dataset can fill the gap in cross-sentence text correction for native Chinese speakers.

pdf bib
CINO: A Chinese Minority Pre-trained Language Model
Ziqing Yang | Zihang Xu | Yiming Cui | Baoxin Wang | Min Lin | Dayong Wu | Zhigang Chen
Proceedings of the 29th International Conference on Computational Linguistics

Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the current multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Yue Chinese, and six other ethnic minority languages. To evaluate the cross-lingual ability of the multilingual model on ethnic minority languages, we collect documents from Wikipedia and news websites, and construct two text classification datasets, WCM (Wiki-Chinese-Minority) and CMNews (Chinese-Minority-News). We show that CINO notably outperforms the baselines on various classification tasks. The CINO model and the datasets are publicly available at http://cino.hfl-rc.com.

2021

pdf bib
Dynamic Connected Networks for Chinese Spelling Check
Baoxin Wang | Wanxiang Che | Dayong Wu | Shijin Wang | Guoping Hu | Ting Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Combining ResNet and Transformer for Chinese Grammatical Error Diagnosis
Shaolei Wang | Baoxin Wang | Jiefu Gong | Zhongyuan Wang | Xiao Hu | Xingyi Duan | Zizhuo Shen | Gang Yue | Ruiji Fu | Dayong Wu | Wanxiang Che | Shijin Wang | Guoping Hu | Ting Liu
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

Grammatical error diagnosis is an important task in natural language processing. This paper introduces our system at NLPTEA-2020 Task: Chinese Grammatical Error Diagnosis (CGED). CGED aims to diagnose four types of grammatical errors which are missing words (M), redundant words (R), bad word selection (S) and disordered words (W). Our system is built on the model of multi-layer bidirectional transformer encoder and ResNet is integrated into the encoder to improve the performance. We also explore two ensemble strategies including weighted averaging and stepwise ensemble selection from libraries of models to improve the performance of single model. In official evaluation, our system obtains the highest F1 scores at identification level and position level. We also recommend error corrections for specific error types and achieve the second highest F1 score at correction level.

2019

pdf bib
IFlyLegal: A Chinese Legal System for Consultation, Law Searching, and Document Analysis
Ziyue Wang | Baoxin Wang | Xingyi Duan | Dayong Wu | Shijin Wang | Guoping Hu | Ting Liu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

Legal Tech is developed to help people with legal services and solve legal problems via machines. To achieve this, one of the key requirements for machines is to utilize legal knowledge and comprehend legal context. This can be fulfilled by natural language processing (NLP) techniques, for instance, text representation, text categorization, question answering (QA) and natural language inference, etc. To this end, we introduce a freely available Chinese Legal Tech system (IFlyLegal) that benefits from multiple NLP tasks. It is an integrated system that performs legal consulting, multi-way law searching, and legal document analysis by exploiting techniques such as deep contextual representations and various attention mechanisms. To our knowledge, IFlyLegal is the first Chinese legal system that employs up-to-date NLP techniques and caters for needs of different user groups, such as lawyers, judges, procurators, and clients. Since Jan, 2019, we have gathered 2,349 users and 28,238 page views (till June, 23, 2019).

2018

pdf bib
Disconnected Recurrent Neural Networks for Text Categorization
Baoxin Wang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recurrent neural network (RNN) has achieved remarkable performance in text categorization. RNN can model the entire sequence and capture long-term dependencies, but it does not do well in extracting key patterns. In contrast, convolutional neural network (CNN) is good at extracting local and position-invariant features. In this paper, we present a novel model named disconnected recurrent neural network (DRNN), which incorporates position-invariance into RNN. By limiting the distance of information flow in RNN, the hidden state at each time step is restricted to represent words near the current position. The proposed model makes great improvements over RNN and CNN models and achieves the best performance on several benchmark datasets for text categorization.