Liang Xu

2023

pdf bib abs
UMRSpell: Unifying the Detection and Correction Parts of Pre-trained Models towards Chinese Missing, Redundant, and Spelling Correction
Zheyu He | Yujin Zhu | Linlin Wang | Liang Xu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chinese Spelling Correction (CSC) is the task of detecting and correcting misspelled charac- ters in Chinese texts. As an important step for various downstream tasks, CSC confronts two challenges: 1) Character-level errors consist not only of spelling errors but also of missing and redundant ones that cause variable length between input and output texts, for which most CSC methods could not handle well because of the consistence length of texts required by their inherent detection-correction framework. Con- sequently, the two errors are considered out- side the scope and left to future work, despite the fact that they are widely found and bound to CSC task in Chinese industrial scenario, such as Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR). 2) Most existing CSC methods focus on either detector or corrector and train different mod- els for each one, respectively, leading to in- sufficiency of parameters sharing. To address these issues, we propose a novel model UMR- Spell to learn detection and correction parts together at the same time from a multi-task learning perspective by using a detection trans- mission self-attention matrix, and flexibly deal with both missing, redundant, and spelling er- rors through re-tagging rules. Furthermore, we build a new dataset ECMR-2023 containing five kinds of character-level errors to enrich the CSC task closer to real-world applications. Ex- periments on both SIGHAN benchmarks and ECMR-2023 demonstrate the significant effec- tiveness of UMRSpell over previous represen- tative baselines.

2022

pdf bib abs
How NLP Can Strengthen Digital Game Based Language Learning Resources for Less Resourced Languages
Monica Ward | Liang Xu | Elaine Uí Dhonnchadha
Proceedings of the 9th Workshop on Games and Natural Language Processing within the 13th Language Resources and Evaluation Conference

This paper provides an overview of the Cipher engine which enables the development of a Digital Educational Game (DEG) based on noticing ciphers or patterns in texts. The Cipher engine was used to develop the Cipher: Faoi Gheasa, a digital educational game for Irish, which incorporates NLP resources and is informed by Digital Game-Based Language Learning (DGBLL) and Computer-Assisted Language Learning (CALL) research. The paper outlines six phases where NLP has strengthened the Cipher: Faoi Gheasa game. It shows how the Cipher engine can be used to build a Cipher game for other languages, particularly low-resourced and endangered languages in which NLP resources are under-developed or few in number.

pdf bib abs
Faoi Gheasa an adaptive game for Irish language learning
Liang Xu | Elaine Uí Dhonnchadha | Monica Ward
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

In this paper, we present a game with a purpose (GWAP) (Von Ahn 2006). The aim of the game is to promote language learning and ‘noticing’ (Skehan, 2013). The game has been designed for Irish, but the framework could be used for other languages. Irish is a minority language which means that L2 learners have limited opportunities for exposure to the language, and additionally, there are also limited (digital) learning resources available. This research incorporates game development, language pedagogy and ICALL language materials development. This paper will focus on the language materials development as this is a bottleneck in the teaching and learning of minority and endangered languages.

pdf bib abs
Cipher – Faoi Gheasa: A Game-with-a-Purpose for Irish
Elaine Uí Dhonnchadha | Monica Ward | Liang Xu
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

This paper describes Cipher – Faoi Gheasa, a ‘game with a purpose’ designed to support the learning of Irish in a fun and enjoyable way. The aim of the game is to promote language ‘noticing’ and to combine the benefits of reading with the enjoyment of computer game playing, in a pedagogically beneficial way. In this paper we discuss pedagogical challenges for Irish, the development of measures for the selection and ranking of reading materials, as well as initial results of game evaluation. Overall user feedback is positive and further testing and development is envisaged.

pdf bib abs
A Corpus-based Study of Corporate Image Represented in Corporate Social Responsibility Report: A Case Study of China Mobile and Vodafone
Xing Chen | Liang Xu
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference

By examination of the high-frequency nouns, verbs, and keywords, the present study probes into the similarities and differences of corporate images represented in Corporate Social Responsibility (CSR) reports of China Mobile and Vodafone. The results suggest that: 1) both China Mobile and Vodafone prefer using some positive words, like improve, support and service to shape a positive, approachable and easy-going corporate image, and an image of prioritizing the environmental sustainability and the well-being of people; 2) CSR reports of China Mobile contain the keywords poverty and alleviation, which means China Mobile is pragmatic, collaborative and active to assume the responsibility for social events; 3) CSR reports of Vodafone contain keywords like privacy, women and global as well as some other countries, which shows Vodafone is enterprising, globalized and attentive to the development of women; 4) these differences might be related to the ideology and social culture of Chinese and British companies. This study may contribute to understanding the function of CSR report and offer helpful implications for broadening the research of corporate image.

2021

pdf bib abs
An Alignment-Agnostic Model for Chinese Text Error Correction
Liying Zheng | Yue Deng | Weishun Song | Liang Xu | Jing Xiao
Findings of the Association for Computational Linguistics: EMNLP 2021

This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters, which are common for Chinese native speakers. Most existing models based on detect-correct framework can correct mistaken characters, but cannot handle missing or redundant characters due to inconsistency between model inputs and outputs. Although Seq2Seq-based or sequence tagging methods provide solutions to the three error types and achieved relatively good results in English context, they do not perform well in Chinese context according to our experiments. In our work, we propose a novel alignment-agnostic detect-correct framework that can handle both text aligned and non-aligned situations and can serve as a cold start model when no annotation data are provided. Experimental results on three datasets demonstrate that our method is effective and achieves a better performance than most recent published models.

2020

Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g.,SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world’s languages. In this paper, we present the first large-scale NLI dataset (consisting of ~56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation. Instead, we elicit annotations from native speakers specializing in linguistics. We follow closely the annotation protocol used for MNLI, but create new strategies for eliciting diverse hypotheses. We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance (~12% absolute performance gap), making it a challenging new resource that we hope will help to accelerate progress in Chinese NLU. To the best of our knowledge, this is the first human-elicited MNLI-style corpus for a non-English language.

pdf bib abs
Cipher: A Prototype Game-with-a-Purpose for Detecting Errors in Text
Liang Xu | Jon Chamberlain
Workshop on Games and Natural Language Processing

Errors commonly exist in machine-generated documents and publication materials; however, some correction algorithms do not perform well for complex errors and it is costly to employ humans to do the task. To solve the problem, a prototype computer game called Cipher was developed that encourages people to identify errors in text. Gamification is achieved by introducing the idea of steganography as the entertaining game element. People play the game for entertainment while they make valuable annotations to locate text errors. The prototype was tested by 35 players in a evaluation experiment, creating 4,764 annotations. After filtering the data, the system detected manually introduced text errors and also genuine errors in the texts that were not noticed when they were introduced into the game.

The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in natural language processing (NLP). The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue, we introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text. To establish results on these tasks, we report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models (9 in total). We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on Chinese NLU. Our benchmark is released at https://www.cluebenchmarks.com

2019

pdf bib abs
An ensemble CNN method for biomedical entity normalization
Pan Deng | Haipeng Chen | Mengyao Huang | Xiaowen Ruan | Liang Xu
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

Different representations of the same concept could often be seen in scientific reports and publications. Entity normalization (or entity linking) is the task to match the different representations to their standard concepts. In this paper, we present a two-step ensemble CNN method that normalizes microbiology-related entities in free text to concepts in standard dictionaries. The method is capable of linking entities when only a small microbiology-related biomedical corpus is available for training, and achieved reasonable performance in the online test of the BioNLP-OST19 shared task Bacteria Biotope.