Shuming Ma


2021

pdf bib
Zero-Shot Cross-Lingual Transfer of Neural Machine Translation with Multilingual Pretrained Encoders
Guanhua Chen | Shuming Ma | Yun Chen | Li Dong | Dongdong Zhang | Jia Pan | Wenping Wang | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Previous work mainly focuses on improving cross-lingual transfer for NLU tasks with a multilingual pretrained encoder (MPE), or improving the performance on supervised machine translation with BERT. However, it is under-explored that whether the MPE can help to facilitate the cross-lingual transferability of NMT model. In this paper, we focus on a zero-shot cross-lingual transfer task in NMT. In this task, the NMT model is trained with parallel dataset of only one language pair and an off-the-shelf MPE, then it is directly tested on zero-shot language pairs. We propose SixT, a simple yet effective model for this task. SixT leverages the MPE with a two-stage training schedule and gets further improvement with a position disentangled encoder and a capacity-enhanced decoder. Using this method, SixT significantly outperforms mBART, a pretrained multilingual encoder-decoder model explicitly designed for NMT, with an average improvement of 7.1 BLEU on zero-shot any-to-English test sets across 14 source languages. Furthermore, with much less training computation cost and training data, our model achieves better performance on 15 any-to-English test sets than CRISS and m2m-100, two strong multilingual NMT baselines.

pdf bib
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
Zewen Chi | Li Dong | Shuming Ma | Shaohan Huang | Saksham Singhal | Xian-Ling Mao | Heyan Huang | Xia Song | Furu Wei
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multilingual T5 pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on seven multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

pdf bib
How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?
Weijia Xu | Shuming Ma | Dongdong Zhang | Marine Carpuat
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Improving Multilingual Neural Machine Translation with Auxiliary Source Languages
Weijia Xu | Yuwei Yin | Shuming Ma | Dongdong Zhang | Haoyang Huang
Findings of the Association for Computational Linguistics: EMNLP 2021

Multilingual neural machine translation models typically handle one source language at a time. However, prior work has shown that translating from multiple source languages improves translation quality. Different from existing approaches on multi-source translation that are limited to the test scenario where parallel source sentences from multiple languages are available at inference time, we propose to improve multilingual translation in a more common scenario by exploiting synthetic source sentences from auxiliary languages. We train our model on synthetic multi-source corpora and apply random masking to enable flexible inference with single-source or bi-source inputs. Extensive experiments on Chinese/English-Japanese and a large-scale multilingual translation benchmark show that our model outperforms the multilingual baseline significantly by up to +4.0 BLEU with the largest improvements on low-resource or distant language pairs.

pdf bib
Smart-Start Decoding for Neural Machine Translation
Jian Yang | Shuming Ma | Dongdong Zhang | Juncheng Wan | Zhoujun Li | Ming Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Most current neural machine translation models adopt a monotonic decoding order of either left-to-right or right-to-left. In this work, we propose a novel method that breaks up the limitation of these decoding orders, called Smart-Start decoding. More specifically, our method first predicts a median word. It starts to decode the words on the right side of the median word and then generates words on the left. We evaluate the proposed Smart-Start decoding method on three datasets. Experimental results show that the proposed method can significantly outperform strong baseline models.

pdf bib
Multilingual Agreement for Multilingual Neural Machine Translation
Jian Yang | Yuwei Yin | Shuming Ma | Haoyang Huang | Dongdong Zhang | Zhoujun Li | Furu Wei
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Although multilingual neural machine translation (MNMT) enables multiple language translations, the training process is based on independent multilingual objectives. Most multilingual models can not explicitly exploit different language pairs to assist each other, ignoring the relationships among them. In this work, we propose a novel agreement-based method to encourage multilingual agreement among different translation directions, which minimizes the differences among them. We combine the multilingual training objectives with the agreement term by randomly substituting some fragments of the source language with their counterpart translations of auxiliary languages. To examine the effectiveness of our method, we conduct experiments on the multilingual translation task of 10 language pairs. Experimental results show that our method achieves significant improvements over the previous multilingual baselines.

2020

pdf bib
A Simple and Effective Unified Encoder for Document-Level Machine Translation
Shuming Ma | Dongdong Zhang | Ming Zhou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Most of the existing models for document-level machine translation adopt dual-encoder structures. The representation of the source sentences and the document-level contexts are modeled with two separate encoders. Although these models can make use of the document-level contexts, they do not fully model the interaction between the contexts and the source sentences, and can not directly adapt to the recent pre-training models (e.g., BERT) which encodes multiple sentences with a single encoder. In this work, we propose a simple and effective unified encoder that can outperform the baseline models of dual-encoder models in terms of BLEU and METEOR scores. Moreover, the pre-training models can further boost the performance of our proposed model.

pdf bib
Improving Neural Machine Translation with Soft Template Prediction
Jian Yang | Shuming Ma | Dongdong Zhang | Zhoujun Li | Ming Zhou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Although neural machine translation (NMT) has achieved significant progress in recent years, most previous NMT models only depend on the source text to generate translation. Inspired by the success of template-based and syntax-based approaches in other fields, we propose to use extracted templates from tree structures as soft target templates to guide the translation procedure. In order to learn the syntactic structure of the target sentences, we adopt constituency-based parse tree to generate candidate templates. We incorporate the template information into the encoder-decoder framework to jointly utilize the templates and source text. Experiments show that our model significantly outperforms the baseline models on four benchmarks and demonstrates the effectiveness of soft target templates.

2019

pdf bib
Group, Extract and Aggregate: Summarizing a Large Amount of Finance News for Forex Movement Prediction
Deli Chen | Shuming Ma | Keiko Harimoto | Ruihan Bao | Qi Su | Xu Sun
Proceedings of the Second Workshop on Economics and Natural Language Processing

Incorporating related text information has proven successful in stock market prediction. However, it is a huge challenge to utilize texts in the enormous forex (foreign currency exchange) market because the associated texts are too redundant. In this work, we propose a BERT-based Hierarchical Aggregation Model to summarize a large amount of finance news to predict forex movement. We firstly group news from different aspects: time, topic and category. Then we extract the most crucial news in each group by the SOTA extractive summarization method. Finally, we conduct interaction between the news and the trade data with attention to predict the forex movement. The experimental results show that the category based method performs best among three grouping methods and outperforms all the baselines. Besides, we study the influence of essential news attributes (category and region) by statistical analysis and summarize the influence patterns for different currency pairs.

pdf bib
Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation
Shuming Ma | Pengcheng Yang | Tianyu Liu | Peng Li | Jie Zhou | Xu Sun
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Table-to-text generation aims to translate the structured data into the unstructured text. Most existing methods adopt the encoder-decoder framework to learn the transformation, which requires large-scale training samples. However, the lack of large parallel data is a major practical problem for many domains. In this work, we consider the scenario of low resource table-to-text generation, where only limited parallel data is available. We propose a novel model to separate the generation into two stages: key fact prediction and surface realization. It first predicts the key facts from the tables, and then generates the text with the key facts. The training of key fact prediction needs much fewer annotated data, while surface realization can be trained with pseudo parallel corpus. We evaluate our model on a biography generation dataset. Our model can achieve 27.34 BLEU score with only 1,000 parallel data, while the baseline model only obtain the performance of 9.71 BLEU score.

pdf bib
A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification
Pengcheng Yang | Fuli Luo | Shuming Ma | Junyang Lin | Xu Sun
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Multi-label classification (MLC) aims to predict a set of labels for a given instance. Based on a pre-defined label order, the sequence-to-sequence (Seq2Seq) model trained via maximum likelihood estimation method has been successfully applied to the MLC task and shows powerful ability to capture high-order correlations between labels. However, the output labels are essentially an unordered set rather than an ordered sequence. This inconsistency tends to result in some intractable problems, e.g., sensitivity to the label order. To remedy this, we propose a simple but effective sequence-to-set model. The proposed model is trained via reinforcement learning, where reward feedback is designed to be independent of the label order. In this way, we can reduce the dependence of the model on the label order, as well as capture high-order correlations between labels. Extensive experiments show that our approach can substantially outperform competitive baselines, as well as effectively reduce the sensitivity to the label order.

2018

pdf bib
Phrase-level Self-Attention Networks for Universal Sentence Encoding
Wei Wu | Houfeng Wang | Tianyu Liu | Shuming Ma
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Universal sentence encoding is a hot topic in recent NLP research. Attention mechanism has been an integral part in many sentence encoding models, allowing the models to capture context dependencies regardless of the distance between the elements in the sequence. Fully attention-based models have recently attracted enormous interest due to their highly parallelizable computation and significantly less training time. However, the memory consumption of their models grows quadratically with the sentence length, and the syntactic information is neglected. To this end, we propose Phrase-level Self-Attention Networks (PSAN) that perform self-attention across words inside a phrase to capture context dependencies at the phrase level, and use the gated memory updating mechanism to refine each word’s representation hierarchically with longer-term context dependencies captured in a larger phrase. As a result, the memory consumption can be reduced because the self-attention is performed at the phrase level instead of the sentence level. At the same time, syntactic information can be easily integrated in the model. Experiment results show that PSAN can achieve the state-of-the-art performance across a plethora of NLP tasks including binary and multi-class classification, natural language inference and sentence similarity.

pdf bib
Semantic-Unit-Based Dilated Convolution for Multi-Label Text Classification
Junyang Lin | Qi Su | Pengcheng Yang | Shuming Ma | Xu Sun
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose a novel model for multi-label text classification, which is based on sequence-to-sequence learning. The model generates higher-level semantic unit representations with multi-level dilated convolution as well as a corresponding hybrid attention mechanism that extracts both the information at the word-level and the level of the semantic unit. Our designed dilated convolution effectively reduces dimension and supports an exponential expansion of receptive fields without loss of local information, and the attention-over-attention mechanism is able to capture more summary relevant information from the source context. Results of our experiments show that the proposed model has significant advantages over the baseline models on the dataset RCV1-V2 and Ren-CECps, and our analysis demonstrates that our model is competitive to the deterministic hierarchical models and it is more robust to classifying low-frequency labels

pdf bib
Global Encoding for Abstractive Summarization
Junyang Lin | Xu Sun | Shuming Ma | Qi Su
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In neural abstractive summarization, the conventional sequence-to-sequence (seq2seq) model often suffers from repetition and semantic irrelevance. To tackle the problem, we propose a global encoding framework, which controls the information flow from the encoder to the decoder based on the global information of the source context. It consists of a convolutional gated unit to perform global encoding to improve the representations of the source-side information. Evaluations on the LCSTS and the English Gigaword both demonstrate that our model outperforms the baseline models, and the analysis shows that our model is capable of generating summary of higher quality and reducing repetition.

pdf bib
Bag-of-Words as Target for Neural Machine Translation
Shuming Ma | Xu Sun | Yizhong Wang | Junyang Lin
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A sentence can be translated into more than one correct sentences. However, most of the existing neural machine translation models only use one of the correct translations as the targets, and the other correct sentences are punished as the incorrect sentences in the training stage. Since most of the correct translations for one sentence share the similar bag-of-words, it is possible to distinguish the correct translations from the incorrect ones by the bag-of-words. In this paper, we propose an approach that uses both the sentences and the bag-of-words as targets in the training stage, in order to encourage the model to generate the potentially correct sentences that are not appeared in the training set. We evaluate our model on a Chinese-English translation dataset, and experiments show our model outperforms the strong baselines by the BLEU score of 4.55.

pdf bib
Automatic Academic Paper Rating Based on Modularized Hierarchical Convolutional Neural Network
Pengcheng Yang | Xu Sun | Wei Li | Shuming Ma
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

As more and more academic papers are being submitted to conferences and journals, evaluating all these papers by professionals is time-consuming and can cause inequality due to the personal factors of the reviewers. In this paper, in order to assist professionals in evaluating academic papers, we propose a novel task: automatic academic paper rating (AAPR), which automatically determine whether to accept academic papers. We build a new dataset for this task and propose a novel modularized hierarchical convolutional neural network to achieve automatic academic paper rating. Evaluation results show that the proposed model outperforms the baselines by a large margin. The dataset and code are available at https://github.com/lancopku/AAPR

pdf bib
Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization
Shuming Ma | Xu Sun | Junyang Lin | Houfeng Wang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Most of the current abstractive text summarization models are based on the sequence-to-sequence model (Seq2Seq). The source content of social media is long and noisy, so it is difficult for Seq2Seq to learn an accurate semantic representation. Compared with the source content, the annotated summary is short and well written. Moreover, it shares the same meaning as the source content. In this work, we supervise the learning of the representation of the source content with that of the summary. In implementation, we regard a summary autoencoder as an assistant supervisor of Seq2Seq. Following previous work, we evaluate our model on a popular Chinese social media dataset. Experimental results show that our model achieves the state-of-the-art performances on the benchmark dataset.

pdf bib
Does Higher Order LSTM Have Better Accuracy for Segmenting and Labeling Sequence Data?
Yi Zhang | Xu Sun | Shuming Ma | Yang Yang | Xuancheng Ren
Proceedings of the 27th International Conference on Computational Linguistics

Existing neural models usually predict the tag of the current token independent of the neighboring tags. The popular LSTM-CRF model considers the tag dependencies between every two consecutive tags. However, it is hard for existing neural models to take longer distance dependencies between tags into consideration. The scalability is mainly limited by the complex model structures and the cost of dynamic programming during training. In our work, we first design a new model called “high order LSTM” to predict multiple tags for the current token which contains not only the current tag but also the previous several tags. We call the number of tags in one prediction as “order”. Then we propose a new method called Multi-Order BiLSTM (MO-BiLSTM) which combines low order and high order LSTMs together. MO-BiLSTM keeps the scalability to high order models with a pruning technique. We evaluate MO-BiLSTM on all-phrase chunking and NER datasets. Experiment results show that MO-BiLSTM achieves the state-of-the-art result in chunking and highly competitive results in two NER datasets.

pdf bib
A Neural Question Answering Model Based on Semi-Structured Tables
Hao Wang | Xiaodong Zhang | Shuming Ma | Xu Sun | Houfeng Wang | Mengxiang Wang
Proceedings of the 27th International Conference on Computational Linguistics

Most question answering (QA) systems are based on raw text and structured knowledge graph. However, raw text corpora are hard for QA system to understand, and structured knowledge graph needs intensive manual work, while it is relatively easy to obtain semi-structured tables from many sources directly, or build them automatically. In this paper, we build an end-to-end system to answer multiple choice questions with semi-structured tables as its knowledge. Our system answers queries by two steps. First, it finds the most similar tables. Then the system measures the relevance between each question and candidate table cells, and choose the most related cell as the source of answer. The system is evaluated with TabMCQ dataset, and gets a huge improvement compared to the state of the art.

pdf bib
Deconvolution-Based Global Decoding for Neural Machine Translation
Junyang Lin | Xu Sun | Xuancheng Ren | Shuming Ma | Jinsong Su | Qi Su
Proceedings of the 27th International Conference on Computational Linguistics

A great proportion of sequence-to-sequence (Seq2Seq) models for Neural Machine Translation (NMT) adopt Recurrent Neural Network (RNN) to generate translation word by word following a sequential order. As the studies of linguistics have proved that language is not linear word sequence but sequence of complex structure, translation at each step should be conditioned on the whole target-side context. To tackle the problem, we propose a new NMT model that decodes the sequence with the guidance of its structural prediction of the context of the target sequence. Our model generates translation based on the structural prediction of the target-side context so that the translation can be freed from the bind of sequential order. Experimental results demonstrate that our model is more competitive compared with the state-of-the-art methods, and the analysis reflects that our model is also robust to translating sentences of different lengths and it also reduces repetition with the instruction from the target-side context for decoding.

pdf bib
SGM: Sequence Generation Model for Multi-label Classification
Pengcheng Yang | Xu Sun | Wei Li | Shuming Ma | Wei Wu | Houfeng Wang
Proceedings of the 27th International Conference on Computational Linguistics

Multi-label classification is an important yet challenging task in natural language processing. It is more complex than single-label classification in that the labels tend to be correlated. Existing methods tend to ignore the correlations between labels. Besides, different parts of the text can contribute differently for predicting different labels, which is not considered by existing models. In this paper, we propose to view the multi-label classification task as a sequence generation problem, and apply a sequence generation model with a novel decoder structure to solve it. Extensive experimental results show that our proposed methods outperform previous work by a substantial margin. Further analysis of experimental results demonstrates that the proposed methods not only capture the correlations between labels, but also select the most informative words automatically when predicting different labels.

pdf bib
Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation
Shuming Ma | Xu Sun | Wei Li | Sujian Li | Wenjie Li | Xuancheng Ren
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Most recent approaches use the sequence-to-sequence model for paraphrase generation. The existing sequence-to-sequence model tends to memorize the words and the patterns in the training dataset instead of learning the meaning of the words. Therefore, the generated sentences are often grammatically correct but semantically improper. In this work, we introduce a novel model based on the encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our proposed model generates the words by querying distributed word representations (i.e. neural word embeddings), hoping to capturing the meaning of the according words. Following previous work, we evaluate our model on two paraphrase-oriented tasks, namely text simplification and short text abstractive summarization. Experimental results show that our model outperforms the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a Chinese summarization dataset. Moreover, our model achieves state-of-the-art performances on these three benchmark datasets.

2017

pdf bib
Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization
Shuming Ma | Xu Sun | Jingjing Xu | Houfeng Wang | Wenjie Li | Qi Su
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Current Chinese social media text summarization models are based on an encoder-decoder framework. Although its generated summaries are similar to source texts literally, they have low semantic relevance. In this work, our goal is to improve semantic relevance between source texts and summaries for Chinese social media summarization. We introduce a Semantic Relevance Based neural model to encourage high semantic similarity between texts and summaries. In our model, the source text is represented by a gated attention encoder, while the summary representation is produced by a decoder. Besides, the similarity score between the representations is maximized during training. Our experiments show that the proposed model outperforms baseline systems on a social media corpus.