Shaohan Huang


pdf bib
XLM-E: Cross-lingual Language Model Pre-training via ELECTRA
Zewen Chi | Shaohan Huang | Li Dong | Shuming Ma | Bo Zheng | Saksham Singhal | Payal Bajaj | Xia Song | Xian-Ling Mao | Heyan Huang | Furu Wei
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we introduce ELECTRA-style tasks to cross-lingual language model pre-training. Specifically, we present two pre-training tasks, namely multilingual replaced token detection, and translation replaced token detection. Besides, we pretrain the model, named as XLM-E, on both multilingual and parallel corpora. Our model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Moreover, analysis shows that XLM-E tends to obtain better cross-lingual transferability.

pdf bib
THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption
Tianyu Chen | Hangbo Bao | Shaohan Huang | Li Dong | Binxing Jiao | Daxin Jiang | Haoyi Zhou | Jianxin Li | Furu Wei
Findings of the Association for Computational Linguistics: ACL 2022

As more and more pre-trained language models adopt on-cloud deployment, the privacy issues grow quickly, mainly for the exposure of plain-text user data (e.g., search history, medical record, bank account). Privacy-preserving inference of transformer models is on the demand of cloud service users. To protect privacy, it is an attractive choice to compute only with ciphertext in homomorphic encryption (HE). However, enabling pre-trained models inference on ciphertext data is difficult due to the complex computations in transformer blocks, which are not supported by current HE tools yet. In this work, we introduce THE-X, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models developed by popular frameworks. THE-X proposes a workflow to deal with complex computation in transformer networks, including all the non-polynomial functions like GELU, softmax, and LayerNorm. Experiments reveal our proposed THE-X can enable transformer inference on encrypted data for different downstream tasks, all with negligible performance drop but enjoying the theory-guaranteed privacy-preserving advantage.


pdf bib
Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task
Jian Yang | Shuming Ma | Haoyang Huang | Dongdong Zhang | Li Dong | Shaohan Huang | Alexandre Muzio | Saksham Singhal | Hany Hassan | Xia Song | Furu Wei
Proceedings of the Sixth Conference on Machine Translation

This report describes Microsoft’s machine translation systems for the WMT21 shared task on large-scale multilingual machine translation. We participated in all three evaluation tracks including Large Track and two Small Tracks where the former one is unconstrained and the latter two are fully constrained. Our model submissions to the shared task were initialized with DeltaLM, a generic pre-trained multilingual encoder-decoder model, and fine-tuned correspondingly with the vast collected parallel data and allowed data sources according to track settings, together with applying progressive learning and iterative back-translation approaches to further improve the performance. Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.

pdf bib
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
Zewen Chi | Li Dong | Shuming Ma | Shaohan Huang | Saksham Singhal | Xian-Ling Mao | Heyan Huang | Xia Song | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multilingual T5 pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on seven multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

pdf bib
Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training
Bo Zheng | Li Dong | Shaohan Huang | Saksham Singhal | Wanxiang Che | Ting Liu | Xia Song | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at

pdf bib
Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains
Yunzhi Yao | Shaohan Huang | Wenhui Wang | Li Dong | Furu Wei
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
Wenhui Wang | Hangbo Bao | Shaohan Huang | Li Dong | Furu Wei
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Pseudo-Label Guided Unsupervised Domain Adaptation of Contextual Embeddings
Tianyu Chen | Shaohan Huang | Furu Wei | Jianxin Li
Proceedings of the Second Workshop on Domain Adaptation for NLP

Contextual embedding models such as BERT can be easily fine-tuned on labeled samples to create a state-of-the-art model for many downstream tasks. However, the fine-tuned BERT model suffers considerably from unlabeled data when applied to a different domain. In unsupervised domain adaptation, we aim to train a model that works well on a target domain when provided with labeled source samples and unlabeled target samples. In this paper, we propose a pseudo-label guided method for unsupervised domain adaptation. Two models are fine-tuned on labeled source samples as pseudo labeling models. To learn representations for the target domain, one of those models is adapted by masked language modeling from the target domain. Then those models are used to assign pseudo-labels to target samples. We train the final model with those samples. We evaluate our method on named entity segmentation and sentiment analysis tasks. These experiments show that our approach outperforms baseline methods.

pdf bib
Consistency Regularization for Cross-Lingual Fine-Tuning
Bo Zheng | Li Dong | Shaohan Huang | Wenhui Wang | Zewen Chi | Saksham Singhal | Wanxiang Che | Ting Liu | Xia Song | Furu Wei
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.

pdf bib
Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment
Zewen Chi | Li Dong | Bo Zheng | Shaohan Huang | Xian-Ling Mao | Heyan Huang | Furu Wei
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-label word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rate on the alignment benchmarks. The code and pretrained parameters are available at


pdf bib
DocBank: A Benchmark Dataset for Document Layout Analysis
Minghao Li | Yiheng Xu | Lei Cui | Shaohan Huang | Furu Wei | Zhoujun Li | Ming Zhou
Proceedings of the 28th International Conference on Computational Linguistics

Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present DocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the LaTeX documents available on the With DocBank, models from different modalities can be compared fairly and multi-modal approaches will be further investigated and boost the performance of document layout analysis. We build several strong baselines and manually split train/dev/test sets for evaluation. Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents. The DocBank dataset is publicly available at

pdf bib
Unsupervised Fine-tuning for Text Clustering
Shaohan Huang | Furu Wei | Lei Cui | Xingxing Zhang | Ming Zhou
Proceedings of the 28th International Conference on Computational Linguistics

Fine-tuning with pre-trained language models (e.g. BERT) has achieved great success in many language understanding tasks in supervised settings (e.g. text classification). However, relatively little work has been focused on applying pre-trained models in unsupervised settings, such as text clustering. In this paper, we propose a novel method to fine-tune pre-trained models unsupervisedly for text clustering, which simultaneously learns text representations and cluster assignments using a clustering oriented loss. Experiments on three text clustering datasets (namely TREC-6, Yelp, and DBpedia) show that our model outperforms the baseline methods and achieves state-of-the-art results.

pdf bib
Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph
Haozhe Ji | Pei Ke | Shaohan Huang | Furu Wei | Xiaoyan Zhu | Minlie Huang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Despite the success of generative pre-trained language models on a series of text generation tasks, they still suffer in cases where reasoning over underlying commonsense knowledge is required during generation. Existing approaches that integrate commonsense knowledge into generative pre-trained language models simply transfer relational knowledge by post-training on individual knowledge triples while ignoring rich connections within the knowledge graph. We argue that exploiting both the structural and semantic information of the knowledge graph facilitates commonsense-aware text generation. In this paper, we propose Generation with Multi-Hop Reasoning Flow (GRF) that enables pre-trained models with dynamic multi-hop reasoning on multi-relational paths extracted from the external commonsense knowledge graph. We empirically show that our model outperforms existing baselines on three text generation tasks that require reasoning over commonsense knowledge. We also demonstrate the effectiveness of the dynamic multi-hop reasoning module with reasoning paths inferred by the model that provide rationale to the generation.

pdf bib
Generating Commonsense Explanation by Extracting Bridge Concepts from Reasoning Paths
Haozhe Ji | Pei Ke | Shaohan Huang | Furu Wei | Minlie Huang
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Commonsense explanation generation aims to empower the machine’s sense-making capability by generating plausible explanations to statements against commonsense. While this task is easy to human, the machine still struggles to generate reasonable and informative explanations. In this work, we propose a method that first extracts the underlying concepts which are served as bridges in the reasoning chain and then integrates these concepts to generate the final explanation. To facilitate the reasoning process, we utilize external commonsense knowledge to build the connection between a statement and the bridge concepts by extracting and pruning multi-hop paths to build a subgraph. We design a bridge concept extraction model that first scores the triples, routes the paths in the subgraph, and further selects bridge concepts with weak supervision at both the triple level and the concept level. We conduct experiments on the commonsense explanation generation task and our model outperforms the state-of-the-art baselines in both automatic and human evaluation.

pdf bib
TableBank: Table Benchmark for Image-based Table Detection and Recognition
Minghao Li | Lei Cui | Shaohan Huang | Furu Wei | Ming Zhou | Zhoujun Li
Proceedings of the 12th Language Resources and Evaluation Conference

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models can be downloaded from


pdf bib
Neural Document Summarization by Jointly Learning to Score and Select Sentences
Qingyu Zhou | Nan Yang | Furu Wei | Shaohan Huang | Ming Zhou | Tiejun Zhao
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sentence scoring and sentence selection are two main steps in extractive document summarization systems. However, previous works treat them as two separated subtasks. In this paper, we present a novel end-to-end neural network framework for extractive document summarization by jointly learning to score and select sentences. It first reads the document sentences with a hierarchical encoder to obtain the representation of sentences. Then it builds the output summary by extracting sentences one by one. Different from previous methods, our approach integrates the selection strategy into the scoring model, which directly predicts the relative importance given previously selected sentences. Experiments on the CNN/Daily Mail dataset show that the proposed framework significantly outperforms the state-of-the-art extractive summarization models.


pdf bib
Learning to Generate Product Reviews from Attributes
Li Dong | Shaohan Huang | Furu Wei | Mirella Lapata | Ming Zhou | Ke Xu
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Automatically generating product reviews is a meaningful, yet not well-studied task in sentiment analysis. Traditional natural language generation methods rely extensively on hand-crafted rules and predefined templates. This paper presents an attention-enhanced attribute-to-sequence model to generate product reviews for given attribute information, such as user, product, and rating. The attribute encoder learns to represent input attributes as vectors. Then, the sequence decoder generates reviews by conditioning its output on these vectors. We also introduce an attention mechanism to jointly generate reviews and align words with input attributes. The proposed model is trained end-to-end to maximize the likelihood of target product reviews given the attributes. We build a publicly available dataset for the review generation task by leveraging the Amazon book reviews and their metadata. Experiments on the dataset show that our approach outperforms baseline methods and the attention mechanism significantly improves the performance of our model.

pdf bib
SuperAgent: A Customer Service Chatbot for E-commerce Websites
Lei Cui | Shaohan Huang | Furu Wei | Chuanqi Tan | Chaoqun Duan | Ming Zhou
Proceedings of ACL 2017, System Demonstrations