Tao Ge


2024

pdf bib
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning
Zhihan Zhang | Tao Ge | Zhenwen Liang | Wenhao Yu | Dian Yu | Mengzhao Jia | Dong Yu | Meng Jiang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks. To maximize such benefits, existing research focuses on *broadening* the training set with various data augmentation techniques, which is effective for standard single-round question-answering settings. Our work introduces a novel technique aimed at cultivating a *deeper* understanding of the training problems at hand, enhancing performance not only in standard settings but also in more complex scenarios that require reflective thinking. Specifically, we propose **reflective augmentation**, a method that embeds problem reflection into each training instance. It trains the model to consider alternative perspectives and engage with abstractions and analogies, thereby fostering a thorough comprehension through reflective reasoning. Extensive experiments validate the achievement of our aim, underscoring the unique advantages of our method and its complementary nature relative to existing augmentation techniques.

pdf bib
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
Heming Xia | Zhe Yang | Qingxiu Dong | Peiyi Wang | Yongqi Li | Tao Ge | Tianyu Liu | Wenjie Li | Zhifang Sui
Findings of the Association for Computational Linguistics: ACL 2024

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.

pdf bib
Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction
Dingyao Yu | Yang An | Wei Ye | Xiongfeng Xiao | Shaoguang Mao | Tao Ge | Shikun Zhang
Findings of the Association for Computational Linguistics: ACL 2024

Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) *Random Replacement* with the guidance of confusion sets and (2) *OCR/ASR-based Generation* that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).

pdf bib
SCALE: Synergized Collaboration of Asymmetric Language Translation Engines
Xin Cheng | Xun Wang | Tao Ge | Si-Qing Chen | Furu Wei | Dongyan Zhao | Rui Yan
Findings of the Association for Computational Linguistics: ACL 2024

In this paper, we introduce SCALE, a collaborative framework that connects a compact Specialized Translation Model (STM) and a general-purpose Large Language Model (LLM) as one unified translation engine. By introducing translation from STM into the triplet in-context demonstrations, SCALE unlocks refinement and pivoting ability of LLM, thus 1) mitigating language bias of LLMs and parallel data bias of STMs, 2) enhancing LLM speciality without sacrificing generality, and 3) facilitating continual learning in a LLM-tuning-free way.Our comprehensive experiments show that SCALE significantly outperforms both LLMs (GPT-4, GPT-3.5) and supervised models (NLLB, M2M) in either high-resource or challenging low-resource settings. Moreover SCALE shows great scalability by only updating the lightweight STM and witness consistent system improvement, an averaged 4 BLEURT score across 4 languages without tuning LLM. Interestingly, SCALE could also effectively exploit the existing language bias of LLMs by using an English-centric STM as a pivot to conduct translation between any language pairs, outperforming GPT-4 by an average of 6 COMET points across eight translation directions. Furthermore we provide an in-depth analysis of SCALE’s robustness, translation characteristics, latency costs and inherent language bias, providing solid foundation for future studies exploring the potential synergy between LLMs and more specialized models.

pdf bib
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Zhenhailong Wang | Shaoguang Mao | Wenshan Wu | Tao Ge | Furu Wei | Heng Ji
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Human intelligence thrives on cognitive synergy, where collaboration among different minds yield superior outcomes compared to isolated individuals. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist is an intelligent agent that collaboratively combines multiple minds’ strengths and knowledge to enhance problem-solving in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. Our in-depth analysis shows that assigning multiple fine-grained personas in LLMs improves problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, experimental results demonstrate that SPP effectively reduces factual hallucination, and maintains strong reasoning capabilities. Additionally, comparative experiments show that cognitive synergy only emerges in GPT-4 and does not appear in less capable models, such as GPT-3.5-turbo and Llama2-13b-chat, which draws an interesting analogy to human development. Code, data, and prompts can be found at: https://github.com/MikeWangWZHL/Solo-Performance-Prompting.git.

pdf bib
Low-code LLM: Graphical User Interface over Large Language Models
Yuzhe Cai | Shaoguang Mao | Wenshan Wu | Zehua Wang | Yaobo Liang | Tao Ge | Chenfei Wu | WangYou WangYou | Ting Song | Yan Xia | Nan Duan | Furu Wei
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)

Utilizing Large Language Models (LLMs) for complex tasks is challenging, often involving a time-consuming and uncontrollable prompt engineering process. This paper introduces a novel human-LLM interaction framework, Low-code LLM. It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses. Through visual interaction with a graphical user interface, users can incorporate their ideas into the process without writing trivial prompts. The proposed Low-code LLM framework consists of a Planning LLM that designs a structured planning workflow for complex tasks, which can be correspondingly edited and confirmed by users through low-code visual programming operations, and an Executing LLM that generates responses following the user-confirmed workflow. We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability. We demonstrate its benefits using four typical applications. By introducing this framework, we aim to bridge the gap between humans and LLMs, enabling more effective and efficient utilization of LLMs for complex tasks. The code, prompts, and experimental details are available at https://github.com/moymix/TaskMatrix/tree/main/LowCodeLLM. A system demonstration video can be found at https://www.youtube.com/watch?v=jb2C1vaeO3E.

2023

pdf bib
Smart Word Suggestions for Writing Assistance
Chenshuo Wang | Shaoguang Mao | Tao Ge | Wenshan Wu | Xun Wang | Yan Xia | Jonathan Tien | Dongyan Zhao
Findings of the Association for Computational Linguistics: ACL 2023

Enhancing word usage is a desired feature for writing assistance. To further advance research in this area, this paper introduces “Smart Word Suggestions” (SWS) task and benchmark. Unlike other works, SWS emphasizes end-to-end evaluation and presents a more realistic writing assistance scenario. This task involves identifying words or phrases that require improvement and providing substitution suggestions. The benchmark includes human-labeled data for testing, a large distantly supervised dataset for training, and the framework for evaluation. The test data includes 1,000 sentences written by English learners, accompanied by over 16,000 substitution suggestions annotated by 10 native speakers. The training dataset comprises over 3.7 million sentences and 12.7 million suggestions generated through rules. Our experiments with seven baselines demonstrate that SWS is a challenging task. Based on experimental analysis, we suggest potential directions for future research on SWS. The dataset and related codes will be available for research purposes.

pdf bib
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
Heming Xia | Tao Ge | Peiyi Wang | Si-Qing Chen | Furu Wei | Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023

We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter – an independent model specially optimized for efficient and accurate drafting – and Spec-Verification – a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around 5x speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only 1.4x~2x speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.

pdf bib
Overview of CCL23-Eval Task 8: Chinese Essay Fluency Evaluation (CEFE) Task
Xinshu Shen | Hongyi Wu | Xiaopeng Bai | Yuanbin Wu | Aimin Zhou | Shaoguang Mao | Tao Ge | Yan Xia
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“This paper provides a comprehensive review of the CCL23-Eval Task 8, i.e., Chinese EssayFluency Evaluation (CEFE). The primary aim of this task is to systematically identify the typesof grammatical fine-grained errors that affect the readability and coherence of essays writtenby Chinese primary and secondary school students, and then to suggest suitable corrections toenhance the fluidity of their written expression. This task consists of three distinct tracks: (1)Coarse-grained and fine-grained error identification; (2) Character-level error identification andcorrection; (3) Error sentence rewriting. In the end, we received 44 completed registration forms,leading to a total of 130 submissions from 11 dedicated participating teams. We present theresults of all participants and our analysis of these results. Both the dataset and evaluation toolused in this task are available1.”

2022

pdf bib
EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation
Tao Ge | Si-Qing Chen | Furu Wei
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We introduce EdgeFormer – a parameter-efficient Transformer for on-device seq2seq generation under the strict computation and memory constraints. Compared with the previous parameter-efficient Transformers, EdgeFormer applies two novel principles for cost-effective parameterization, allowing it to perform better given the same parameter budget; moreover, EdgeFormer is further enhanced by layer adaptation innovation that is proposed for improving the network with shared layers.Extensive experiments show EdgeFormer can effectively outperform previous parameter-efficient Transformer baselines and achieve competitive results under both the computation and memory constraints. Given the promising results, we release EdgeLM – the pretrained version of EdgeFormer, which is the first publicly available pretrained on-device seq2seq model that can be easily fine-tuned for seq2seq tasks with strong results, facilitating on-device seq2seq generation in practice.

pdf bib
Text Revision by On-the-Fly Representation Optimization
Jingjing Li | Zichao Li | Tao Ge | Irwin King | Michael Lyu
Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)

Text revision refers to a family of natural language generation tasks, where the source and target sequences share moderate resemblance in surface form but differentiate in attributes, such as text formality and simplicity. Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems, which rely on large-scale parallel training corpus. In this paper, we present an iterative inplace editing approach for text revision, which requires no parallel data. In this approach, we simply fine-tune a pre-trained Transformer with masked language modeling and attribute classification. During inference, the editing at each iteration is realized by two-step span replacement. At the first step, the distributed representation of the text optimizes on the fly towards an attribute function. At the second step, a text span is masked and another new one is proposed conditioned on the optimized representation. The empirical experiments on two typical and important text revision tasks, text formalization and text simplification, show the effectiveness of our approach. It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification, and gains better performance than strong unsupervised methods on text formalization.

pdf bib
Plug and Play Knowledge Distillation for kNN-LM with External Logits
Xuyang Jin | Tao Ge | Furu Wei
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Despite the promising evaluation results by knowledge distillation (KD) in natural language understanding (NLU) and sequence-to-sequence (seq2seq) tasks, KD for causal language modeling (LM) remains a challenge. In this paper, we present a novel perspective of knowledge distillation by proposing plug and play knowledge distillation (PP-KD) to improve a (student) kNN-LM that is the state-of-the-art in causal language modeling by leveraging external logits from either a powerful or a heterogeneous (teacher) LM. Unlike conventional logit-based KD where the teacher’s knowledge is built-in during training, PP-KD is plug and play: it stores the teacher’s knowledge (i.e., logits) externally and uses the teacher’s logits of the retrieved k-nearest neighbors during kNN-LM inference at test time. In contrast to marginal perplexity improvement by logit-based KD in conventional neural (causal) LM, PP-KD achieves a significant improvement, enhancing the kNN-LMs in multiple language modeling datasets, showing a novel and promising perspective for causal LM distillation.

2021

pdf bib
Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding
Xin Sun | Tao Ge | Furu Wei | Houfeng Wang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper, we propose Shallow Aggressive Decoding (SAD) to improve the online inference efficiency of the Transformer for instantaneous Grammatical Error Correction (GEC). SAD optimizes the online inference efficiency for GEC by two innovations: 1) it aggressively decodes as many tokens as possible in parallel instead of always decoding only one token in each step to improve computational parallelism; 2) it uses a shallow decoder instead of the conventional Transformer architecture with balanced encoder-decoder depth to reduce the computational cost during inference. Experiments in both English and Chinese GEC benchmarks show that aggressive decoding could yield identical predictions to greedy decoding but with significant speedup for online inference. Its combination with the shallow decoder could offer an even higher online inference speedup over the powerful Transformer baseline without quality loss. Not only does our approach allow a single model to achieve the state-of-the-art results in English GEC benchmarks: 66.4 F0.5 in the CoNLL-14 and 72.9 F0.5 in the BEA-19 test set with an almost 10x online inference speedup over the Transformer-big model, but also it is easily adapted to other languages. Our code is available at https://github.com/AutoTemp/Shallow-Aggressive-Decoding.

pdf bib
Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge
Canwen Xu | Wangchunshu Zhou | Tao Ge | Ke Xu | Julian McAuley | Furu Wei
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Cant is important for understanding advertising, comedies and dog-whistle politics. However, computational research on cant is hindered by a lack of available datasets. In this paper, we propose a large and diverse Chinese dataset for creating and understanding cant from a computational linguistics perspective. We formulate a task for cant understanding and provide both quantitative and qualitative analysis for tested word embedding similarity and pretrained language models. Experiments suggest that such a task requires deep language understanding, common sense, and world knowledge and thus can be a good testbed for pretrained language models and help models perform better on other tasks.

pdf bib
Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting
Wangchunshu Zhou | Tao Ge | Canwen Xu | Ke Xu | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose Sequence Span Rewriting (SSR), a self-supervised task for sequence-to-sequence (Seq2Seq) pre-training. SSR learns to refine the machine-generated imperfect text spans into ground truth text. SSR provides more fine-grained and informative supervision in addition to the original text-infilling objective. Compared to the prevalent text infilling objectives for Seq2Seq pre-training, SSR is naturally more consistent with many downstream generation tasks that require sentence rewriting (e.g., text summarization, question generation, grammatical error correction, and paraphrase generation). We conduct extensive experiments by using SSR to improve the typical Seq2Seq pre-trained model T5 in a continual pre-training setting and show substantial improvements over T5 on various natural language generation tasks.

pdf bib
Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression
Canwen Xu | Wangchunshu Zhou | Tao Ge | Ke Xu | Julian McAuley | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent studies on compression of pretrained language models (e.g., BERT) usually use preserved accuracy as the metric for evaluation. In this paper, we propose two new metrics, label loyalty and probability loyalty that measure how closely a compressed model (i.e., student) mimics the original model (i.e., teacher). We also explore the effect of compression with regard to robustness under adversarial attacks. We benchmark quantization, pruning, knowledge distillation and progressive module replacing with loyalty and robustness. By combining multiple compression techniques, we provide a practical strategy to achieve better accuracy, loyalty and robustness.

2020

pdf bib
UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database
Canwen Xu | Tao Ge | Chenliang Li | Furu Wei
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Chinese and Japanese share many characters with similar surface morphology. To better utilize the shared knowledge across the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel two-stage coarse-to-fine training approach. We exploit Unihan, a ready-made database constructed by linguistic experts to first merge morphologically similar characters into clusters. The resulting clusters are used to replace the original characters in sentences for the coarse-grained pretraining of the MLM. Then, we restore the clusters back to the original characters in sentences for the fine-grained pretraining to learn the representation of the specific characters. We conduct extensive experiments on a variety of Chinese and Japanese NLP benchmarks, showing that our proposed UnihanLM is effective on both mono- and cross-lingual Chinese and Japanese tasks, shedding light on a new path to exploit the homology of languages.

pdf bib
Parallel Data Augmentation for Formality Style Transfer
Yi Zhang | Tao Ge | Xu Sun
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The main barrier to progress in the task of Formality Style Transfer is the inadequacy of training data. In this paper, we study how to augment parallel data and propose novel and simple data augmentation methods for this task to obtain useful sentence pairs with easily accessible models and systems. Experiments demonstrate that our augmented parallel data largely helps improve formality style transfer when it is used to pre-train the model, leading to the state-of-the-art results in the GYAFC benchmark dataset.

pdf bib
Improving Grammatical Error Correction with Machine Translation Pairs
Wangchunshu Zhou | Tao Ge | Chang Mu | Ke Xu | Furu Wei | Ming Zhou
Findings of the Association for Computational Linguistics: EMNLP 2020

We propose a novel data synthesis method to generate diverse error-corrected sentence pairs for improving grammatical error correction, which is based on a pair of machine translation models (e.g., Chinese to English) of different qualities (i.e., poor and good). The poor translation model can resemble the ESL (English as a second language) learner and tends to generate translations of low quality in terms of fluency and grammaticality, while the good translation model generally generates fluent and grammatically correct translations. With the pair of translation models, we can generate unlimited numbers of poor to good English sentence pairs from text in the source language (e.g., Chinese) of the translators. Our approach can generate various error-corrected patterns and nicely complement the other data synthesis approaches for GEC. Experimental results demonstrate the data generated by our approach can effectively help a GEC model to improve the performance and achieve the state-of-the-art single-model performance in BEA-19 and CoNLL-14 benchmark datasets.

pdf bib
Pseudo-Bidirectional Decoding for Local Sequence Transduction
Wangchunshu Zhou | Tao Ge | Ke Xu
Findings of the Association for Computational Linguistics: EMNLP 2020

Local sequence transduction (LST) tasks are sequence transduction tasks where there exists massive overlapping between the source and target sequences, such as grammatical error correction and spell or OCR correction. Motivated by this characteristic of LST tasks, we propose Pseudo-Bidirectional Decoding (PBD), a simple but versatile approach for LST tasks. PBD copies the representation of source tokens to the decoder as pseudo future context that enables the decoder self-attention to attends to its bi-directional context. In addition, the bidirectional decoding scheme and the characteristic of LST tasks motivate us to share the encoder and the decoder of LST models. Our approach provides right-side context information for the decoder, reduces the number of parameters by half, and provides good regularization effects. Experimental results on several benchmark datasets show that our approach consistently improves the performance of standard seq2seq models on LST tasks.

pdf bib
Scheduled DropHead: A Regularization Method for Transformer Models
Wangchunshu Zhou | Tao Ge | Furu Wei | Ming Zhou | Ke Xu
Findings of the Association for Computational Linguistics: EMNLP 2020

We introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism which is a key component of transformer. In contrast to the conventional dropout mechanism which randomly drops units or connections, DropHead drops entire attention heads during training to prevent the multi-head attention model from being dominated by a small portion of attention heads. It can help reduce the risk of overfitting and allow the models to better benefit from the multi-head attention. Given the interaction between multi-headedness and training dynamics, we further propose a novel dropout rate scheduler to adjust the dropout rate of DropHead throughout training, which results in a better regularization effect. Experimental results demonstrate that our proposed approach can improve transformer models by 0.9 BLEU score on WMT14 En-De translation task and around 1.0 accuracy for various text classification tasks.

pdf bib
Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction
Mengyun Chen | Tao Ge | Xingxing Zhang | Furu Wei | Ming Zhou
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection (ESD) and Erroneous Span Correction (ESC). ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. Then, ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans. Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.

pdf bib
BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
Canwen Xu | Wangchunshu Zhou | Tao Ge | Furu Wei | Ming Zhou
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models. Compared to the previous knowledge distillation approaches for BERT compression, our approach does not introduce any additional loss function. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

2019

pdf bib
BERT-based Lexical Substitution
Wangchunshu Zhou | Tao Ge | Ke Xu | Furu Wei | Ming Zhou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Previous studies on lexical substitution tend to obtain substitute candidates by finding the target word’s synonyms from lexical resources (e.g., WordNet) and then rank the candidates based on its contexts. These approaches have two limitations: (1) They are likely to overlook good substitute candidates that are not the synonyms of the target words in the lexical resources; (2) They fail to take into account the substitution’s influence on the global context of the sentence. To address these issues, we propose an end-to-end BERT-based lexical substitution approach which can propose and validate substitute candidates without using any annotated data or manually curated resources. Our approach first applies dropout to the target word’s embedding for partially masking the word, allowing BERT to take balanced consideration of the target word’s semantics and contexts for proposing substitute candidates, and then validates the candidates based on their substitution’s influence on the global contextualized representation of the sentence. Experiments show our approach performs well in both proposing and ranking substitute candidates, achieving the state-of-the-art results in both LS07 and LS14 benchmarks.

pdf bib
Automatic Grammatical Error Correction for Sequence-to-sequence Text Generation: An Empirical Study
Tao Ge | Xingxing Zhang | Furu Wei | Ming Zhou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Sequence-to-sequence (seq2seq) models have achieved tremendous success in text generation tasks. However, there is no guarantee that they can always generate sentences without grammatical errors. In this paper, we present a preliminary empirical study on whether and how much automatic grammatical error correction can help improve seq2seq text generation. We conduct experiments across various seq2seq text generation tasks including machine translation, formality style transfer, sentence compression and simplification. Experiments show the state-of-the-art grammatical error correction system can improve the grammaticality of generated text and can bring task-oriented improvements in the tasks where target sentences are in a formal style.

2018

pdf bib
Fluency Boost Learning and Inference for Neural Grammatical Error Correction
Tao Ge | Furu Wei | Ming Zhou
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most of the neural sequence-to-sequence (seq2seq) models for grammatical error correction (GEC) have two limitations: (1) a seq2seq model may not be well generalized with only limited error-corrected data; (2) a seq2seq model may fail to completely correct a sentence with multiple errors through normal seq2seq inference. We attempt to address these limitations by proposing a fluency boost learning and inference mechanism. Fluency boosting learning generates fluency-boost sentence pairs during training, enabling the error correction model to learn how to improve a sentence’s fluency from more instances, while fluency boosting inference allows the model to correct a sentence incrementally with multiple inference steps until the sentence’s fluency stops increasing. Experiments show our approaches improve the performance of seq2seq models for GEC, achieving state-of-the-art results on both CoNLL-2014 and JFLEG benchmark datasets.

pdf bib
EventWiki: A Knowledge Base of Major Events
Tao Ge | Lei Cui | Baobao Chang | Zhifang Sui | Furu Wei | Ming Zhou
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition
Tao Ge | Qing Dou | Heng Ji | Lei Cui | Baobao Chang | Zhifang Sui | Furu Wei | Ming Zhou
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper proposes to study fine-grained coordinated cross-lingual text stream alignment through a novel information network decipherment paradigm. We use Burst Information Networks as media to represent text streams and present a simple yet effective network decipherment algorithm with diverse clues to decipher the networks for accurate text stream alignment. Experiments on Chinese-English news streams show our approach not only outperforms previous approaches on bilingual lexicon extraction from coordinated text streams but also can harvest high-quality alignments from large amounts of streaming data for endless language knowledge mining, which makes it promising to be a new paradigm for automatic language knowledge acquisition.

2016

pdf bib
News Stream Summarization using Burst Information Networks
Tao Ge | Lei Cui | Baobao Chang | Sujian Li | Ming Zhou | Zhifang Sui
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Encoding Temporal Information for Time-Aware Link Prediction
Tingsong Jiang | Tianyu Liu | Tao Ge | Lei Sha | Sujian Li | Baobao Chang | Zhifang Sui
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Towards Time-Aware Knowledge Graph Completion
Tingsong Jiang | Tianyu Liu | Tao Ge | Lei Sha | Baobao Chang | Sujian Li | Zhifang Sui
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Knowledge graph (KG) completion adds new facts to a KG by making inferences from existing facts. Most existing methods ignore the time information and only learn from time-unknown fact triples. In dynamic environments that evolve over time, it is important and challenging for knowledge graph completion models to take into account the temporal aspects of facts. In this paper, we present a novel time-aware knowledge graph completion model that is able to predict links in a KG using both the existing facts and the temporal information of the facts. To incorporate the happening time of facts, we propose a time-aware KG embedding model using temporal order information among facts. To incorporate the valid time of facts, we propose a joint time-aware inference model based on Integer Linear Programming (ILP) using temporal consistencyinformationasconstraints. Wefurtherintegratetwomodelstomakefulluseofglobal temporal information. We empirically evaluate our models on time-aware KG completion task. Experimental results show that our time-aware models achieve the state-of-the-art on temporal facts consistently.

pdf bib
Event Detection with Burst Information Networks
Tao Ge | Lei Cui | Baobao Chang | Zhifang Sui | Ming Zhou
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Retrospective event detection is an important task for discovering previously unidentified events in a text stream. In this paper, we propose two fast centroid-aware event detection models based on a novel text stream representation – Burst Information Networks (BINets) for addressing the challenge. The BINets are time-aware, efficient and can be easily analyzed for identifying key information (centroids). These advantages allow the BINet-based approaches to achieve the state-of-the-art performance on multiple datasets, demonstrating the efficacy of BINets for the task of event detection.

2015

pdf bib
Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion
Yue Liu | Tao Ge | Kusum Mathews | Heng Ji | Deborah McGuinness
Proceedings of BioNLP 15

pdf bib
An Effective Neural Network Model for Graph-based Dependency Parsing
Wenzhe Pei | Tao Ge | Baobao Chang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Bring you to the past: Automatic Generation of Topically Relevant Event Chronicles
Tao Ge | Wenzhe Pei | Heng Ji | Sujian Li | Baobao Chang | Zhifang Sui
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
One Tense per Scene: Predicting Tense in Chinese Conversations
Tao Ge | Heng Ji | Baobao Chang | Zhifang Sui
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
The CIPS-SIGHAN CLP 2014 Chinese Word Segmentation Bake-off
Huiming Duan | Zhifang Sui | Tao Ge
Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
Max-Margin Tensor Neural Network for Chinese Word Segmentation
Wenzhe Pei | Tao Ge | Baobao Chang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Event-Based Time Label Propagation for Automatic Dating of News Articles
Tao Ge | Baobao Chang | Sujian Li | Zhifang Sui
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing