Zhaopeng Tu


2021

pdf bib
Multi-Task Learning with Shared Encoder for Non-Autoregressive Machine Translation
Yongchang Hao | Shilin He | Wenxiang Jiao | Zhaopeng Tu | Michael Lyu | Xing Wang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Non-Autoregressive machine Translation (NAT) models have demonstrated significant inference speedup but suffer from inferior translation accuracy. The common practice to tackle the problem is transferring the Autoregressive machine Translation (AT) knowledge to NAT models, e.g., with knowledge distillation. In this work, we hypothesize and empirically verify that AT and NAT encoders capture different linguistic properties of source sentences. Therefore, we propose to adopt multi-task learning to transfer the AT knowledge to NAT models through encoder sharing. Specifically, we take the AT model as an auxiliary task to enhance NAT model performance. Experimental results on WMT14 En-De and WMT16 En-Ro datasets show that the proposed Multi-Task NAT achieves significant improvements over the baseline NAT models. Furthermore, the performance on large-scale WMT19 and WMT20 En-De datasets confirm the consistency of our proposed method. In addition, experimental results demonstrate that our Multi-Task NAT is complementary to knowledge distillation, the standard knowledge transfer method for NAT.

pdf bib
Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation
Wenxiang Jiao | Xing Wang | Zhaopeng Tu | Shuming Shi | Michael Lyu | Irwin King
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Self-training has proven effective for improving NMT performance by augmenting model training with synthetic parallel data. The common practice is to construct synthetic data based on a randomly sampled subset of large-scale monolingual data, which we empirically show is sub-optimal. In this work, we propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. To this end, we compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Intuitively, monolingual sentences with lower uncertainty generally correspond to easy-to-translate patterns which may not provide additional gains. Accordingly, we design an uncertainty-based sampling strategy to efficiently exploit the monolingual data for self-training, in which monolingual sentences with higher uncertainty would be sampled with higher probability. Experimental results on large-scale WMT English⇒German and English⇒Chinese datasets demonstrate the effectiveness of the proposed approach. Extensive analyses suggest that emphasizing the learning on uncertain monolingual sentences by our approach does improve the translation quality of high-uncertainty sentences and also benefits the prediction of low-frequency words at the target side.

pdf bib
Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation
Liang Ding | Longyue Wang | Xuebo Liu | Derek F. Wong | Dacheng Tao | Zhaopeng Tu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. However, there exists a discrepancy on low-frequency words between the distilled and the original data, leading to more errors on predicting low-frequency words. To alleviate the problem, we directly expose the raw data into NAT by leveraging pretraining. By analyzing directed alignments, we found that KD makes low-frequency source words aligned with targets more deterministically but fails to align sufficient low-frequency words from target to source. Accordingly, we propose reverse KD to rejuvenate more alignments for low-frequency target words. To make the most of authentic and synthetic data, we combine these complementary approaches as a new training strategy for further boosting NAT performance. We conduct experiments on five translation benchmarks over two advanced architectures. Results demonstrate that the proposed approach can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Encouragingly, our approach achieves 28.2 and 33.9 BLEU points on the WMT14 English-German and WMT16 Romanian-English datasets, respectively. Our code, data, and trained models are available at https://github.com/longyuewangdcu/RLFW-NAT.

pdf bib
Progressive Multi-Granularity Training for Non-Autoregressive Translation
Liang Ding | Longyue Wang | Xuebo Liu | Derek F. Wong | Dacheng Tao | Zhaopeng Tu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
On the Copying Behaviors of Pre-Training for Neural Machine Translation
Xuebo Liu | Longyue Wang | Derek F. Wong | Liang Ding | Lidia S. Chao | Shuming Shi | Zhaopeng Tu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
On the Language Coverage Bias for Neural Machine Translation
Shuo Wang | Zhaopeng Tu | Zhixing Tan | Shuming Shi | Maosong Sun | Yang Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Emotion Classification by Jointly Learning to Lexiconize and Classify
Deyu Zhou | Shuangzhi Wu | Qing Wang | Jun Xie | Zhaopeng Tu | Mu Li
Proceedings of the 28th International Conference on Computational Linguistics

Emotion lexicons have been shown effective for emotion classification (Baziotis et al., 2018). Previous studies handle emotion lexicon construction and emotion classification separately. In this paper, we propose an emotional network (EmNet) to jointly learn sentence emotions and construct emotion lexicons which are dynamically adapted to a given context. The dynamic emotion lexicons are useful for handling words with multiple emotions based on different context, which can effectively improve the classification accuracy. We validate the approach on two representative architectures – LSTM and BERT, demonstrating its superiority on identifying emotions in Tweets. Our model outperforms several approaches proposed in previous studies and achieves new state-of-the-art on the benchmark Twitter dataset.

pdf bib
Context-Aware Cross-Attention for Non-Autoregressive Translation
Liang Ding | Longyue Wang | Di Wu | Dacheng Tao | Zhaopeng Tu
Proceedings of the 28th International Conference on Computational Linguistics

Non-autoregressive translation (NAT) significantly accelerates the inference process by predicting the entire target sequence. However, due to the lack of target dependency modelling in the decoder, the conditional generation process heavily depends on the cross-attention. In this paper, we reveal a localness perception problem in NAT cross-attention, for which it is difficult to adequately capture source context. To alleviate this problem, we propose to enhance signals of neighbour source tokens into conventional cross-attention. Experimental results on several representative datasets show that our approach can consistently improve translation quality over strong NAT baselines. Extensive analyses demonstrate that the enhanced cross-attention achieves better exploitation of source contexts by leveraging both local and global information.

pdf bib
EmpDG: Multi-resolution Interactive Empathetic Dialogue Generation
Qintong Li | Hongshen Chen | Zhaochun Ren | Pengjie Ren | Zhaopeng Tu | Zhumin Chen
Proceedings of the 28th International Conference on Computational Linguistics

A humanized dialogue system is expected to generate empathetic replies, which should be sensitive to the users’ expressed emotion. The task of empathetic dialogue generation is proposed to address this problem. The essential challenges lie in accurately capturing the nuances of human emotion and considering the potential of user feedback, which are overlooked by the majority of existing work. In response to this problem, we propose a multi-resolution adversarial model – EmpDG, to generate more empathetic responses. EmpDG exploits both the coarse-grained dialogue-level and fine-grained token-level emotions, the latter of which helps to better capture the nuances of user emotion. In addition, we introduce an interactive adversarial learning framework which exploits the user feedback, to identify whether the generated responses evoke emotion perceptivity in dialogues. Experimental results show that the proposed approach significantly outperforms the state-of-the-art baselines in both content quality and emotion perceptivity.

pdf bib
Rethinking the Value of Transformer Components
Wenxuan Wang | Zhaopeng Tu
Proceedings of the 28th International Conference on Computational Linguistics

Transformer becomes the state-of-the-art translation model, while it is not well studied how each intermediate component contributes to the model performance, which poses significant challenges for designing optimal architectures. In this work, we bridge this gap by evaluating the impact of individual component (sub-layer) in trained Transformer models from different perspectives. Experimental results across language pairs, training strategies, and model capacities show that certain components are consistently more important than the others. We also report a number of interesting findings that might help humans better analyze, understand and improve Transformer models. Based on these observations, we further propose a new training strategy that can improves translation performance by distinguishing the unimportant components in training.

pdf bib
On the Sub-layer Functionalities of Transformer Decoder
Yilin Yang | Longyue Wang | Shuming Shi | Prasad Tadepalli | Stefan Lee | Zhaopeng Tu
Findings of the Association for Computational Linguistics: EMNLP 2020

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages – developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance – a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.

pdf bib
How Does Selective Mechanism Improve Self-Attention Networks?
Xinwei Geng | Longyue Wang | Xing Wang | Bing Qin | Ting Liu | Zhaopeng Tu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self-attention networks (SANs) with selective mechanism has produced substantial improvements in various NLP tasks by concentrating on a subset of input words. However, the underlying reasons for their strong performance have not been well explained. In this paper, we bridge the gap by assessing the strengths of selective SANs (SSANs), which are implemented with a flexible and universal Gumbel-Softmax. Experimental results on several representative NLP tasks, including natural language inference, semantic role labelling, and machine translation, show that SSANs consistently outperform the standard SANs. Through well-designed probing experiments, we empirically validate that the improvement of SSANs can be attributed in part to mitigating two commonly-cited weaknesses of SANs: word order encoding and structure modeling. Specifically, the selective mechanism improves SANs by paying more attention to content words that contribute to the meaning of the sentence.

pdf bib
On the Inference Calibration of Neural Machine Translation
Shuo Wang | Zhaopeng Tu | Shuming Shi | Yang Liu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Confidence calibration, which aims to make model predictions equal to the true correctness measures, is important for neural machine translation (NMT) because it is able to offer useful indicators of translation errors in the generated output. While prior studies have shown that NMT models trained with label smoothing are well-calibrated on the ground-truth training data, we find that miscalibration still remains a severe challenge for NMT during inference due to the discrepancy between training and inference. By carefully designing experiments on three language pairs, our work provides in-depth analyses of the correlation between calibration and translation performance as well as linguistic properties of miscalibration and reports a number of interesting findings that might help humans better analyze, understand and improve NMT models. Based on these observations, we further propose a new graduated label smoothing method that can improve both inference calibration and translation performance.

pdf bib
Tencent Neural Machine Translation Systems for the WMT20 News Translation Task
Shuangzhi Wu | Xing Wang | Longyue Wang | Fangxu Liu | Jun Xie | Zhaopeng Tu | Shuming Shi | Mu Li
Proceedings of the Fifth Conference on Machine Translation

This paper describes Tencent Neural Machine Translation systems for the WMT 2020 news translation tasks. We participate in the shared news translation task on English Chinese and English German language pairs. Our systems are built on deep Transformer and several data augmentation methods. We propose a boosted in-domain finetuning method to improve single models. Ensemble is used to combine single models and we propose an iterative transductive ensemble method which can further improve the translation performance based on the ensemble results. We achieve a BLEU score of 36.8 and the highest chrF score of 0.648 on Chinese English task.

pdf bib
Tencent AI Lab Machine Translation Systems for WMT20 Chat Translation Task
Longyue Wang | Zhaopeng Tu | Xing Wang | Li Ding | Liang Ding | Shuming Shi
Proceedings of the Fifth Conference on Machine Translation

This paper describes the Tencent AI Lab’s submission of the WMT 2020 shared task on chat translation in English-German. Our neural machine translation (NMT) systems are built on sentence-level, document-level, non-autoregressive (NAT) and pretrained models. We integrate a number of advanced techniques into our systems, including data selection, back/forward translation, larger batch learning, model ensemble, finetuning as well as system combination. Specifically, we proposed a hybrid data selection method to select high-quality and in-domain sentences from out-of-domain data. To better capture the source contexts, we exploit to augment NAT models with evolved cross-attention. Furthermore, we explore to transfer general knowledge from four different pre-training language models to the downstream translation task. In general, we present extensive experimental results for this new translation task. Among all the participants, our German-to-English primary system is ranked the second in terms of BLEU scores.

pdf bib
Tencent AI Lab Machine Translation Systems for the WMT20 Biomedical Translation Task
Xing Wang | Zhaopeng Tu | Longyue Wang | Shuming Shi
Proceedings of the Fifth Conference on Machine Translation

This paper describes the Tencent AI Lab submission of the WMT2020 shared task on biomedical translation in four language directions: German<->English, English<->German, Chinese<->English and English<->Chinese. We implement our system with model ensemble technique on different transformer architectures (Deep, Hybrid, Big, Large Transformers). To enlarge the in-domain bilingual corpus, we use back-translation of monolingual in-domain data in the target language as additional in-domain training data. Our systems in German->English and English->German are ranked 1st and 3rd respectively according to the official evaluation results in terms of BLEU scores.

pdf bib
On the Sparsity of Neural Machine Translation Models
Yong Wang | Longyue Wang | Victor Li | Zhaopeng Tu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Modern neural machine translation (NMT) models employ a large number of parameters, which leads to serious over-parameterization and typically causes the underutilization of computational resources. In response to this problem, we empirically investigate whether the redundant parameters can be reused to achieve better performance. Experiments and analyses are systematically conducted on different datasets and NMT architectures. We show that: 1) the pruned parameters can be rejuvenated to improve the baseline model by up to +0.8 BLEU points; 2) the rejuvenated parameters are reallocated to enhance the ability of modeling low-level lexical information.

pdf bib
Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation
Wenxiang Jiao | Xing Wang | Shilin He | Irwin King | Michael Lyu | Zhaopeng Tu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large-scale training datasets lie at the core of the recent success of neural machine translation (NMT) models. However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. In this work, we explore to identify the inactive training examples which contribute less to the model performance, and show that the existence of inactive examples depends on the data distribution. We further introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. The proposed framework consists of three phases. First, we train an identification model on the original training data, and use it to distinguish inactive examples and active examples by their sentence-level output probabilities. Then, we train a rejuvenation model on the active examples, which is used to re-label the inactive examples with forward- translation. Finally, the rejuvenated examples and the active examples are combined to train the final NMT model. Experimental results on WMT14 English-German and English-French datasets show that the proposed data rejuvenation consistently and significantly improves performance for several strong NMT models. Extensive analyses reveal that our approach stabilizes and accelerates the training process of NMT models, resulting in final models with better generalization capability.

2019

pdf bib
Multi-Granularity Self-Attention for Neural Machine Translation
Jie Hao | Xing Wang | Shuming Shi | Jinfeng Zhang | Zhaopeng Tu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Current state-of-the-art neural machine translation (NMT) uses a deep multi-head self-attention network with no explicit phrase information. However, prior work on statistical machine translation has shown that extending the basic translation unit from words to phrases has produced substantial improvements, suggesting the possibility of improving NMT performance from explicit modeling of phrases. In this work, we present multi-granularity self-attention (Mg-Sa): a neural network that combines multi-head self-attention and phrase modeling. Specifically, we train several attention heads to attend to phrases in either n-gram or syntactic formalisms. Moreover, we exploit interactions among phrases to enhance the strength of structure modeling – a commonly-cited weakness of self-attention. Experimental results on WMT14 English-to-German and NIST Chinese-to-English translation tasks show the proposed approach consistently improves performance. Targeted linguistic analysis reveal that Mg-Sa indeed captures useful phrase information at various levels of granularities.

pdf bib
One Model to Learn Both: Zero Pronoun Prediction and Translation
Longyue Wang | Zhaopeng Tu | Xing Wang | Shuming Shi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Zero pronouns (ZPs) are frequently omitted in pro-drop languages, but should be recalled in non-pro-drop languages. This discourse phenomenon poses a significant challenge for machine translation (MT) when translating texts from pro-drop to non-pro-drop languages. In this paper, we propose a unified and discourse-aware ZP translation approach for neural MT models. Specifically, we jointly learn to predict and translate ZPs in an end-to-end manner, allowing both components to interact with each other. In addition, we employ hierarchical neural networks to exploit discourse-level context, which is beneficial for ZP prediction and thus translation. Experimental results on both Chinese-English and Japanese-English data show that our approach significantly and accumulatively improves both translation performance and ZP prediction accuracy over not only baseline but also previous works using external ZP prediction models. Extensive analyses confirm that the performance improvement comes from the alleviation of different kinds of errors especially caused by subjective ZPs.

pdf bib
Dynamic Past and Future for Neural Machine Translation
Zaixiang Zheng | Shujian Huang | Zhaopeng Tu | Xin-Yu Dai | Jiajun Chen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Previous studies have shown that neural machine translation (NMT) models can benefit from explicitly modeling translated () and untranslated () source contents as recurrent states (CITATION). However, this less interpretable recurrent process hinders its power to model the dynamic updating of and contents during decoding. In this paper, we propose to model the dynamic principles by explicitly separating source words into groups of translated and untranslated contents through parts-to-wholes assignment. The assignment is learned through a novel variant of routing-by-agreement mechanism (CITATION), namely Guided Dynamic Routing, where the translating status at each decoding step guides the routing process to assign each source word to its associated group (i.e., translated or untranslated content) represented by a capsule, enabling translation to be made from holistic context. Experiments show that our approach achieves substantial improvements over both Rnmt and Transformer by producing more adequate translations. Extensive analysis demonstrates that our method is highly interpretable, which is able to recognize the translated and untranslated contents as expected.

pdf bib
Towards Understanding Neural Machine Translation with Word Importance
Shilin He | Zhaopeng Tu | Xing Wang | Longyue Wang | Michael Lyu | Shuming Shi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Although neural machine translation (NMT) has advanced the state-of-the-art on various language pairs, the interpretability of NMT remains unsatisfactory. In this work, we propose to address this gap by focusing on understanding the input-output behavior of NMT models. Specifically, we measure the word importance by attributing the NMT output to every input word through a gradient-based method. We validate the approach on a couple of perturbation operations, language pairs, and model architectures, demonstrating its superiority on identifying input words with higher influence on translation performance. Encouragingly, the calculated importance can serve as indicators of input words that are under-translated by NMT models. Furthermore, our analysis reveals that words of certain syntactic categories have higher importance while the categories vary across language pairs, which can inspire better design principles of NMT architectures for multi-lingual translation.

pdf bib
Towards Better Modeling Hierarchical Structure for Self-Attention with Ordered Neurons
Jie Hao | Xing Wang | Shuming Shi | Jinfeng Zhang | Zhaopeng Tu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recent studies have shown that a hybrid of self-attention networks (SANs) and recurrent neural networks RNNs outperforms both individual architectures, while not much is known about why the hybrid models work. With the belief that modeling hierarchical structure is an essential complementary between SANs and RNNs, we propose to further enhance the strength of hybrid models with an advanced variant of RNNs – Ordered Neurons LSTM (ON-LSTM), which introduces a syntax-oriented inductive bias to perform tree-like composition. Experimental results on the benchmark machine translation task show that the proposed approach outperforms both individual architectures and a standard hybrid model. Further analyses on targeted linguistic evaluation and logical inference tasks demonstrate that the proposed approach indeed benefits from a better modeling of hierarchical structure.

pdf bib
Self-Attention with Structural Position Representations
Xing Wang | Zhaopeng Tu | Longyue Wang | Shuming Shi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Although self-attention networks (SANs) have advanced the state-of-the-art on various NLP tasks, one criticism of SANs is their ability of encoding positions of input words (Shaw et al., 2018). In this work, we propose to augment SANs with structural position representations to model the latent structure of the input sentence, which is complementary to the standard sequential positional representations. Specifically, we use dependency tree to represent the grammatical structure of a sentence, and propose two strategies to encode the positional relationships among words in the dependency tree. Experimental results on NIST Chinese-to-English and WMT14 English-to-German translation tasks show that the proposed approach consistently boosts performance over both the absolute and relative sequential position representations.

pdf bib
Retrieval-guided Dialogue Response Generation via a Matching-to-Generation Framework
Deng Cai | Yan Wang | Wei Bi | Zhaopeng Tu | Xiaojiang Liu | Shuming Shi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

End-to-end sequence generation is a popular technique for developing open domain dialogue systems, though they suffer from the safe response problem. Researchers have attempted to tackle this problem by incorporating generative models with the returns of retrieval systems. Recently, a skeleton-then-response framework has been shown promising results for this task. Nevertheless, how to precisely extract a skeleton and how to effectively train a retrieval-guided response generator are still challenging. This paper presents a novel framework in which the skeleton extraction is made by an interpretable matching model and the following skeleton-guided response generation is accomplished by a separately trained generator. Extensive experiments demonstrate the effectiveness of our model designs.

pdf bib
Modeling Recurrence for Transformer
Jie Hao | Xing Wang | Baosong Yang | Longyue Wang | Jinfeng Zhang | Zhaopeng Tu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recently, the Transformer model that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence modeling hinders its further improvement of translation capacity. In response to this problem, we propose to directly model recurrence for Transformer with an additional recurrence encoder. In addition to the standard recurrent neural network, we introduce a novel attentive recurrent network to leverage the strengths of both attention models and recurrent networks. Experimental results on the widely-used WMT14 English⇒German and WMT17 Chinese⇒English translation tasks demonstrate the effectiveness of the proposed approach. Our studies also reveal that the proposed model benefits from a short-cut that bridges the source and target sequences with a single recurrent layer, which outperforms its deep counterpart.

pdf bib
Skeleton-to-Response: Dialogue Generation Guided by Retrieval Memory
Deng Cai | Yan Wang | Wei Bi | Zhaopeng Tu | Xiaojiang Liu | Wai Lam | Shuming Shi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Traditional generative dialogue models generate responses solely from input queries. Such information is insufficient for generating a specific response since a certain query could be answered in multiple ways. Recently, researchers have attempted to fill the information gap by exploiting information retrieval techniques. For a given query, similar dialogues are retrieved from the entire training data and considered as an additional knowledge source. While the use of retrieval may harvest extensive information, the generative models could be overwhelmed, leading to unsatisfactory performance. In this paper, we propose a new framework which exploits retrieval results via a skeleton-to-response paradigm. At first, a skeleton is extracted from the retrieved dialogues. Then, both the generated skeleton and the original query are used for response generation via a novel response generator. Experimental results show that our approach significantly improves the informativeness of the generated responses

pdf bib
Information Aggregation for Multi-Head Attention with Routing-by-Agreement
Jian Li | Baosong Yang | Zi-Yi Dou | Xing Wang | Michael R. Lyu | Zhaopeng Tu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Multi-head attention is appealing for its ability to jointly extract different types of information from multiple representation subspaces. Concerning the information aggregation, a common practice is to use a concatenation followed by a linear transformation, which may not fully exploit the expressiveness of multi-head attention. In this work, we propose to improve the information aggregation for multi-head attention with a more powerful routing-by-agreement algorithm. Specifically, the routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on linguistic probing tasks and machine translation tasks prove the superiority of the advanced information aggregation over the standard linear transformation.

pdf bib
Convolutional Self-Attention Networks
Baosong Yang | Longyue Wang | Derek F. Wong | Lidia S. Chao | Zhaopeng Tu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Self-attention networks (SANs) have drawn increasing interest due to their high parallelization in computation and flexibility in modeling dependencies. SANs can be further enhanced with multi-head attention by allowing the model to attend to information from different representation subspaces. In this work, we propose novel convolutional self-attention networks, which offer SANs the abilities to 1) strengthen dependencies among neighboring elements, and 2) model the interaction between features extracted by multiple attention heads. Experimental results of machine translation on different language pairs and model settings show that our approach outperforms both the strong Transformer baseline and other existing models on enhancing the locality of SANs. Comparing with prior studies, the proposed model is parameter free in terms of introducing no more parameters.

pdf bib
Assessing the Ability of Self-Attention Networks to Learn Word Order
Baosong Yang | Longyue Wang | Derek F. Wong | Lidia S. Chao | Zhaopeng Tu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self-attention networks (SAN) have attracted a lot of interests due to their high parallelization and strong performance on a variety of NLP tasks, e.g. machine translation. Due to the lack of recurrence structure such as recurrent neural networks (RNN), SAN is ascribed to be weak at learning positional information of words for sequence modeling. However, neither this speculation has been empirically confirmed, nor explanations for their strong performances on machine translation tasks when “lacking positional information” have been explored. To this end, we propose a novel word reordering detection task to quantify how well the word order information learned by SAN and RNN. Specifically, we randomly move one word to another position, and examine whether a trained model can detect both the original and inserted positions. Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SAN trained on machine translation learns better positional information than its RNN counterpart, in which position embedding plays a critical role. Although recurrence structure make the model more universally-effective on learning word order, learning objectives matter more in the downstream tasks such as machine translation.

pdf bib
Exploiting Sentential Context for Neural Machine Translation
Xing Wang | Zhaopeng Tu | Longyue Wang | Shuming Shi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this work, we present novel approaches to exploit sentential context for neural machine translation (NMT). Specifically, we show that a shallow sentential context extracted from the top encoder layer only, can improve translation performance via contextualizing the encoding representations of individual words. Next, we introduce a deep sentential context, which aggregates the sentential context representations from all of the internal layers of the encoder to form a more comprehensive context representation. Experimental results on the WMT14 English-German and English-French benchmarks show that our model consistently improves performance over the strong Transformer model, demonstrating the necessity and effectiveness of exploiting sentential context for NMT.

2018

pdf bib
Target Foresight Based Attention for Neural Machine Translation
Xintong Li | Lemao Liu | Zhaopeng Tu | Shuming Shi | Max Meng
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

In neural machine translation, an attention model is used to identify the aligned source words for a target word (target foresight word) in order to select translation context, but it does not make use of any information of this target foresight word at all. Previous work proposed an approach to improve the attention model by explicitly accessing this target foresight word and demonstrated the substantial gains in alignment task. However, this approach is useless in machine translation task on which the target foresight word is unavailable. In this paper, we propose a new attention model enhanced by the implicit information of target foresight word oriented to both alignment and translation tasks. Empirical experiments on Chinese-to-English and Japanese-to-English datasets show that the proposed attention model delivers significant improvements in terms of both alignment error rate and BLEU.

pdf bib
Towards Robust Neural Machine Translation
Yong Cheng | Zhaopeng Tu | Fandong Meng | Junjie Zhai | Yang Liu
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Small perturbations in the input can severely distort intermediate representations and thus impact translation quality of neural machine translation (NMT) models. In this paper, we propose to improve the robustness of NMT models with adversarial stability training. The basic idea is to make both the encoder and decoder in NMT models robust against input perturbations by enabling them to behave similarly for the original input and its perturbed counterpart. Experimental results on Chinese-English, English-German and English-French translation tasks show that our approaches can not only achieve significant improvements over strong NMT systems but also improve the robustness of NMT models.

pdf bib
Modeling Past and Future for Neural Machine Translation
Zaixiang Zheng | Hao Zhou | Shujian Huang | Lili Mou | Xinyu Dai | Jiajun Chen | Zhaopeng Tu
Transactions of the Association for Computational Linguistics, Volume 6

Existing neural machine translation systems do not explicitly model what has been translated and what has not during the decoding phase. To address this problem, we propose a novel mechanism that separates the source information into two parts: translated Past contents and untranslated Future contents, which are modeled by two additional recurrent layers. The Past and Future contents are fed to both the attention model and the decoder states, which provides Neural Machine Translation (NMT) systems with the knowledge of translated and untranslated contents. Experimental results show that the proposed approach significantly improves the performance in Chinese-English, German-English, and English-German translation tasks. Specifically, the proposed model outperforms the conventional coverage model in terms of both the translation quality and the alignment error rate.

pdf bib
Learning to Remember Translation History with a Continuous Cache
Zhaopeng Tu | Yang Liu | Shuming Shi | Tong Zhang
Transactions of the Association for Computational Linguistics, Volume 6

Existing neural machine translation (NMT) models generally translate sentences in isolation, missing the opportunity to take advantage of document-level information. In this work, we propose to augment NMT models with a very light-weight cache-like memory network, which stores recent hidden representations as translation history. The probability distribution over generated words is updated online depending on the translation history retrieved from the memory, endowing NMT models with the capability to dynamically adapt over time. Experiments on multiple domains with different topics and styles show the effectiveness of the proposed approach with negligible impact on the computational cost.

pdf bib
Multi-Head Attention with Disagreement Regularization
Jian Li | Zhaopeng Tu | Baosong Yang | Michael R. Lyu | Tong Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.

pdf bib
Learning to Jointly Translate and Predict Dropped Pronouns with a Shared Reconstruction Mechanism
Longyue Wang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to significant challenges with respect to the production of complete translations. Recently, Wang et al. (2018) proposed a novel reconstruction-based approach to alleviating dropped pronoun (DP) translation problems for neural machine translation models. In this work, we improve the original model from two perspectives. First, we employ a shared reconstructor to better exploit encoder and decoder representations. Second, we jointly learn to translate and predict DPs in an end-to-end manner, to avoid the errors propagated from an external DP prediction model. Experimental results show that our approach significantly improves both translation performance and DP prediction accuracy.

pdf bib
Exploiting Deep Representations for Neural Machine Translation
Zi-Yi Dou | Zhaopeng Tu | Xing Wang | Shuming Shi | Tong Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Advanced neural machine translation (NMT) models generally implement encoder and decoder as multiple layers, which allows systems to model complex functions and capture complicated linguistic structures. However, only the top layers of encoder and decoder are leveraged in the subsequent process, which misses the opportunity to exploit the useful information embedded in other layers. In this work, we propose to simultaneously expose all of these signals with layer aggregation and multi-layer attention mechanisms. In addition, we introduce an auxiliary regularization term to encourage different layers to capture diverse information. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation data demonstrate the effectiveness and universality of the proposed approach.

pdf bib
Modeling Localness for Self-Attention Networks
Baosong Yang | Zhaopeng Tu | Derek F. Wong | Fandong Meng | Lidia S. Chao | Tong Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self-attention networks have proven to be of profound value for its strength of capturing global dependencies. In this work, we propose to model localness for self-attention networks, which enhances the ability of capturing useful local context. We cast localness modeling as a learnable Gaussian bias, which indicates the central and scope of the local region to be paid more attention. The bias is then incorporated into the original attention distribution to form a revised distribution. To maintain the strength of capturing long distance dependencies while enhance the ability of capturing short-range dependencies, we only apply localness modeling to lower layers of self-attention networks. Quantitative and qualitative analyses on Chinese-English and English-German translation tasks demonstrate the effectiveness and universality of the proposed approach.

2017

pdf bib
Modeling Source Syntax for Neural Machine Translation
Junhui Li | Deyi Xiong | Zhaopeng Tu | Muhua Zhu | Min Zhang | Guodong Zhou
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Even though a linguistics-free sequence to sequence model in neural machine translation (NMT) has certain capability of implicitly learning syntactic information of source sentences, this paper shows that source syntax can be explicitly incorporated into NMT effectively to provide further improvements. Specifically, we linearize parse trees of source sentences to obtain structural label sequences. On the basis, we propose three different sorts of encoders to incorporate source syntax into NMT: 1) Parallel RNN encoder that learns word and label annotation vectors parallelly; 2) Hierarchical RNN encoder that learns word and label annotation vectors in a two-level hierarchy; and 3) Mixed RNN encoder that stitchingly learns word and label annotation vectors over sequences where words and labels are mixed. Experimentation on Chinese-to-English translation demonstrates that all the three proposed syntactic encoders are able to improve translation accuracy. It is interesting to note that the simplest RNN encoder, i.e., Mixed RNN encoder yields the best performance with an significant improvement of 1.4 BLEU points. Moreover, an in-depth analysis from several perspectives is provided to reveal how source syntax benefits NMT.

pdf bib
Chunk-Based Bi-Scale Decoder for Neural Machine Translation
Hao Zhou | Zhaopeng Tu | Shujian Huang | Xiaohua Liu | Hang Li | Jiajun Chen
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In typical neural machine translation (NMT), the decoder generates a sentence word by word, packing all linguistic granularities in the same time-scale of RNN. In this paper, we propose a new type of decoder for NMT, which splits the decode state into two parts and updates them in two different time-scales. Specifically, we first predict a chunk time-scale state for phrasal modeling, on top of which multiple word time-scale states are generated. In this way, the target sentence is translated hierarchically from chunks to words, with information in different granularities being leveraged. Experiments show that our proposed model significantly improves the translation performance over the state-of-the-art NMT model.

pdf bib
Translating Phrases in Neural Machine Translation
Xing Wang | Zhaopeng Tu | Deyi Xiong | Min Zhang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Phrases play an important role in natural language understanding and machine translation (Sag et al., 2002; Villavicencio et al., 2005). However, it is difficult to integrate them into current neural machine translation (NMT) which reads and generates sentences word by word. In this work, we propose a method to translate phrases in NMT by integrating a phrase memory storing target phrases from a phrase-based statistical machine translation (SMT) system into the encoder-decoder architecture of NMT. At each decoding step, the phrase memory is first re-written by the SMT model, which dynamically generates relevant target phrases with contextual information provided by the NMT model. Then the proposed model reads the phrase memory to make probability estimations for all phrases in the phrase memory. If phrase generation is carried on, the NMT decoder selects an appropriate phrase from the memory to perform phrase translation and updates its decoding state by consuming the words in the selected phrase. Otherwise, the NMT decoder generates a word from the vocabulary as the general NMT decoder does. Experiment results on the Chinese to English translation show that the proposed model achieves significant improvements over the baseline on various test sets.

pdf bib
Exploiting Cross-Sentence Context for Neural Machine Translation
Longyue Wang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In translation, considering the document as a whole can help to resolve ambiguities and inconsistencies. In this paper, we propose a cross-sentence context-aware approach and investigate the influence of historical contextual information on the performance of neural machine translation (NMT). First, this history is summarized in a hierarchical way. We then integrate the historical representation into NMT in two strategies: 1) a warm-start of encoder and decoder states, and 2) an auxiliary context source for updating decoder states. Experimental results on a large Chinese-English translation task show that our approach significantly improves upon a strong attention-based NMT system by up to +2.1 BLEU points.

pdf bib
Semantics-Enhanced Task-Oriented Dialogue Translation: A Case Study on Hotel Booking
Longyue Wang | Jinhua Du | Liangyou Li | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the IJCNLP 2017, System Demonstrations

We showcase TODAY, a semantics-enhanced task-oriented dialogue translation system, whose novelties are: (i) task-oriented named entity (NE) definition and a hybrid strategy for NE recognition and translation; and (ii) a novel grounded semantic method for dialogue understanding and task-order management. TODAY is a case-study demo which can efficiently and accurately assist customers and agents in different languages to reach an agreement in a dialogue for the hotel booking.

pdf bib
Context Gates for Neural Machine Translation
Zhaopeng Tu | Yang Liu | Zhengdong Lu | Xiaohua Liu | Hang Li
Transactions of the Association for Computational Linguistics, Volume 5

In neural machine translation (NMT), generation of a target word depends on both source and target contexts. We find that source contexts have a direct impact on the adequacy of a translation while target contexts affect the fluency. Intuitively, generation of a content word should rely more on the source context and generation of a functional word should rely more on the target context. Due to the lack of effective control over the influence from source and target contexts, conventional NMT tends to yield fluent but inadequate translations. To address this problem, we propose context gates which dynamically control the ratios at which source and target contexts contribute to the generation of target words. In this way, we can enhance both the adequacy and fluency of NMT with more careful control of the information flow from contexts. Experiments show that our approach significantly improves upon a standard attention-based NMT system by +2.3 BLEU points.

2016

pdf bib
Modeling Coverage for Neural Machine Translation
Zhaopeng Tu | Zhengdong Lu | Yang Liu | Xiaohua Liu | Hang Li
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Automatic Construction of Discourse Corpora for Dialogue Translation
Longyue Wang | Xiaojun Zhang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.

pdf bib
A Novel Approach to Dropped Pronoun Translation
Longyue Wang | Zhaopeng Tu | Xiaojun Zhang | Hang Li | Andy Way | Qun Liu
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib
Context-Dependent Translation Selection Using Convolutional Neural Network
Baotian Hu | Zhaopeng Tu | Zhengdong Lu | Hang Li | Qingcai Chen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2013

pdf bib
A Novel Graph-based Compact Representation of Word Alignment
Qun Liu | Zhaopeng Tu | Shouxun Lin
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Using Syntactic Head Information in Hierarchical Phrase-Based Translation
Junhui Li | Zhaopeng Tu | Guodong Zhou | Josef van Genabith
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Combining Multiple Alignments to Improve Machine Translation
Zhaopeng Tu | Yang Liu | Yifan He | Josef van Genabith | Qun Liu | Shouxun Lin
Proceedings of COLING 2012: Posters

pdf bib
Head-Driven Hierarchical Phrase-based Translation
Junhui Li | Zhaopeng Tu | Guodong Zhou | Josef van Genabith
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification
Zhaopeng Tu | Yifan He | Jennifer Foster | Josef van Genabith | Qun Liu | Shouxun Lin
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

pdf bib
Extracting Hierarchical Rules from a Weighted Alignment Matrix
Zhaopeng Tu | Yang Liu | Qun Liu | Shouxun Lin
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
Dependency Forest for Statistical Machine Translation
Zhaopeng Tu | Yang Liu | Young-Sook Hwang | Qun Liu | Shouxun Lin
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
The ICT statistical machine translation system for the IWSLT 2009
Haitao Mi | Yang Li | Tian Xia | Xinyan Xiao | Yang Feng | Jun Xie | Hao Xiong | Zhaopeng Tu | Daqi Zheng | Yanjuan Lu | Qun Liu
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the ICT Statistical Machine Translation systems that used in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2009. For this year’s evaluation, we participated in the Challenge Task (Chinese-English and English-Chinese) and BTEC Task (Chinese-English). And we mainly focus on one new method to improve single system’s translation quality. Specifically, we developed a sentence-similarity based development set selection technique. For each task, we finally submitted the single system who got the maximum BLEU scores on the selected development set. The four single translation systems are based on different techniques: a linguistically syntax-based system, two formally syntax-based systems and a phrase-based system. Typically, we didn’t use any rescoring or system combination techniques in this year’s evaluation.