Shuangzhi Wu


pdf bib
Learning Confidence for Transformer-based Neural Machine Translation
Yu Lu | Jiali Zeng | Jiajun Zhang | Shuangzhi Wu | Mu Li
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Confidence estimation aims to quantify the confidence of the model prediction, providing an expectation of success. A well-calibrated confidence estimate enables accurate failure prediction and proper risk measurement when given noisy samples and out-of-distribution data in real-world settings. However, this task remains a severe challenge for neural machine translation (NMT), where probabilities from softmax distribution fail to describe when the model is probably mistaken. To address this problem, we propose an unsupervised confidence estimate learning jointly with the training of the NMT model. We explain confidence as how many hints the NMT model needs to make a correct prediction, and more hints indicate low confidence. Specifically, the NMT model is given the option to ask for hints to improve translation accuracy at the cost of some slight penalty. Then, we approximate their level of confidence by counting the number of hints the model uses. We demonstrate that our learned confidence estimate achieves high accuracy on extensive sentence/word-level quality estimation tasks. Analytical results verify that our confidence estimate can correctly assess underlying risk in two real-world scenarios: (1) discovering noisy samples and (2) detecting out-of-domain data. We further propose a novel confidence-based instance-specific label smoothing approach based on our learned confidence estimate, which outperforms standard label smoothing.

pdf bib
Modeling Multi-Granularity Hierarchical Features for Relation Extraction
Xinnian Liang | Shuangzhi Wu | Mu Li | Zhoujun Li
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Relation extraction is a key task in Natural Language Processing (NLP), which aims to extract relations between entity pairs from given texts. Recently, relation extraction (RE) has achieved remarkable progress with the development of deep neural networks. Most existing research focuses on constructing explicit structured features using external knowledge such as knowledge graph and dependency tree. In this paper, we propose a novel method to extract multi-granularity features based solely on the original input sentences. We show that effective structured features can be attained even without external knowledge. Three kinds of features based on the input sentences are fully exploited, which are in entity mention level, segment level, and sentence level. All the three are jointly and hierarchically modeled. We evaluate our method on three public benchmarks: SemEval 2010 Task 8, Tacred, and Tacred Revisited. To verify the effectiveness, we apply our method to different encoders such as LSTM and BERT. Experimental results show that our method significantly outperforms existing state-of-the-art models that even use external knowledge. Extensive analyses demonstrate that the performance of our model is contributed by the capture of multi-granularity features and the model of their hierarchical structure.

pdf bib
Task-guided Disentangled Tuning for Pretrained Language Models
Jiali Zeng | Yufan Jiang | Shuangzhi Wu | Yongjing Yin | Mu Li
Findings of the Association for Computational Linguistics: ACL 2022

Pretrained language models (PLMs) trained on large-scale unlabeled corpus are typically fine-tuned on task-specific downstream datasets, which have produced state-of-the-art results on various NLP tasks. However, the data discrepancy issue in domain and scale makes fine-tuning fail to efficiently capture task-specific patterns, especially in low data regime. To address this issue, we propose Task-guided Disentangled Tuning (TDT) for PLMs, which enhances the generalization of representations by disentangling task-relevant signals from the entangled representations. For a given task, we introduce a learnable confidence model to detect indicative guidance from context, and further propose a disentangled regularization to mitigate the over-reliance problem. Experimental results on GLUE and CLUE benchmarks show that TDT gives consistently better results than fine-tuning with different PLMs, and extensive analysis demonstrates the effectiveness and robustness of our method. Code is available at

pdf bib
An Efficient Coarse-to-Fine Facet-Aware Unsupervised Summarization Framework Based on Semantic Blocks
Xinnian Liang | Jing Li | Shuangzhi Wu | Jiali Zeng | Yufan Jiang | Mu Li | Zhoujun Li
Proceedings of the 29th International Conference on Computational Linguistics

Unsupervised summarization methods have achieved remarkable results by incorporating representations from pre-trained language models. However, existing methods fail to consider efficiency and effectiveness at the same time when the input document is extremely long. To tackle this problem, in this paper, we proposed an efficient Coarse-to-Fine Facet-Aware Ranking (C2F-FAR) framework for unsupervised long document summarization, which is based on the semantic block. The semantic block refers to continuous sentences in the document that describe the same facet. Specifically, we address this problem by converting the one-step ranking method into the hierarchical multi-granularity two-stage ranking. In the coarse-level stage, we proposed a new segment algorithm to split the document into facet-aware semantic blocks and then filter insignificant blocks. In the fine-level stage, we select salient sentences in each block and then extract the final summary from selected sentences. We evaluate our framework on four long document summarization datasets: Gov-Report, BillSum, arXiv, and PubMed. Our C2F-FAR can achieve new state-of-the-art unsupervised summarization results on Gov-Report and BillSum. In addition, our method speeds up 4-28 times more than previous methods.

pdf bib
Towards Modeling Role-Aware Centrality for Dialogue Summarization
Xinnian Liang | Chao Bian | Shuangzhi Wu | Zhoujun Li
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing

Role-oriented dialogue summarization generates summaries for different roles in dialogue (e.g. doctor and patient). Existing methods consider roles separately where interactions among different roles are not fully explored. In this paper, we propose a novel Role-Aware Centrality (RAC) model to capture role interactions, which can be easily applied to any seq2seq models. The RAC assigns each role a specific sentence-level centrality score by involving role prompts to control what kind of summary to generate. The RAC measures both the importance of utterances and the relevance between roles and utterances. Then we use RAC to re-weight context representations, which are used by the decoder to generate role summaries. We verify RAC on two public benchmark datasets, CSDS and MC. Experimental results show that the proposed method achieves new state-of-the-art results on the two datasets. Extensive analyses have demonstrated that the role-aware centrality helps generate summaries more precisely.


pdf bib
Tencent Translation System for the WMT21 News Translation Task
Longyue Wang | Mu Li | Fangxu Liu | Shuming Shi | Zhaopeng Tu | Xing Wang | Shuangzhi Wu | Jiali Zeng | Wen Zhang
Proceedings of the Sixth Conference on Machine Translation

This paper describes Tencent Translation systems for the WMT21 shared task. We participate in the news translation task on three language pairs: Chinese-English, English-Chinese and German-English. Our systems are built on various Transformer models with novel techniques adapted from our recent research work. First, we combine different data augmentation methods including back-translation, forward-translation and right-to-left training to enlarge the training data. We also apply language coverage bias, data rejuvenation and uncertainty-based sampling approaches to select content-relevant and high-quality data from large parallel and monolingual corpora. Expect for in-domain fine-tuning, we also propose a fine-grained “one model one domain” approach to model characteristics of different news genres at fine-tuning and decoding stages. Besides, we use greed-based ensemble algorithm and transductive ensemble method to further boost our systems. Based on our success in the last WMT, we continuously employed advanced techniques such as large batch training, data selection and data filtering. Finally, our constrained Chinese-English system achieves 33.4 case-sensitive BLEU score, which is the highest among all submissions. The German-English system is ranked at second place accordingly.

pdf bib
Attention Calibration for Transformer in Neural Machine Translation
Yu Lu | Jiali Zeng | Jiajun Zhang | Shuangzhi Wu | Mu Li
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Attention mechanisms have achieved substantial improvements in neural machine translation by dynamically selecting relevant inputs for different predictions. However, recent studies have questioned the attention mechanisms’ capability for discovering decisive inputs. In this paper, we propose to calibrate the attention weights by introducing a mask perturbation model that automatically evaluates each input’s contribution to the model outputs. We increase the attention weights assigned to the indispensable tokens, whose removal leads to a dramatic performance decrease. The extensive experiments on the Transformer-based translation have demonstrated the effectiveness of our model. We further find that the calibrated attention weights are more uniform at lower layers to collect multiple information while more concentrated on the specific inputs at higher layers. Detailed analyses also show a great need for calibration in the attention weights with high entropy where the model is unconfident about its decision.

pdf bib
Unsupervised Keyphrase Extraction by Jointly Modeling Local and Global Context
Xinnian Liang | Shuangzhi Wu | Mu Li | Zhoujun Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Embedding based methods are widely used for unsupervised keyphrase extraction (UKE) tasks. Generally, these methods simply calculate similarities between phrase embeddings and document embedding, which is insufficient to capture different context for a more effective UKE model. In this paper, we propose a novel method for UKE, where local and global contexts are jointly modeled. From a global view, we calculate the similarity between a certain phrase and the whole document in the vector space as transitional embedding based models do. In terms of the local view, we first build a graph structure based on the document where phrases are regarded as vertices and the edges are similarities between vertices. Then, we proposed a new centrality computation method to capture local salient information based on the graph structure. Finally, we further combine the modeling of global and local context for ranking. We evaluate our models on three public benchmarks (Inspec, DUC 2001, SemEval 2010) and compare with existing state-of-the-art models. The results show that our model outperforms most models while generalizing better on input documents with different domains and length. Additional ablation study shows that both the local and global information is crucial for unsupervised keyphrase extraction tasks.

pdf bib
Recurrent Attention for Neural Machine Translation
Jiali Zeng | Shuangzhi Wu | Yongjing Yin | Yufan Jiang | Mu Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent research questions the importance of the dot-product self-attention in Transformer models and shows that most attention heads learn simple positional patterns. In this paper, we push further in this research line and propose a novel substitute mechanism for self-attention: Recurrent AtteNtion (RAN) . RAN directly learns attention weights without any token-to-token interaction and further improves their capacity by layer-to-layer interaction. Across an extensive set of experiments on 10 machine translation tasks, we find that RAN models are competitive and outperform their Transformer counterpart in certain scenarios, with fewer parameters and inference time. Particularly, when apply RAN to the decoder of Transformer, there brings consistent improvements by about +0.5 BLEU on 6 translation tasks and +1.0 BLEU on Turkish-English translation task. In addition, we conduct extensive analysis on the attention weights of RAN to confirm their reasonableness. Our RAN is a promising alternative to build more effective and efficient NMT models.

pdf bib
Improving Unsupervised Extractive Summarization with Facet-Aware Modeling
Xinnian Liang | Shuangzhi Wu | Mu Li | Zhoujun Li
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021


pdf bib
Robust Machine Reading Comprehension by Learning Soft labels
Zhenyu Zhao | Shuangzhi Wu | Muyun Yang | Kehai Chen | Tiejun Zhao
Proceedings of the 28th International Conference on Computational Linguistics

Neural models have achieved great success on the task of machine reading comprehension (MRC), which are typically trained on hard labels. We argue that hard labels limit the model capability on generalization due to the label sparseness problem. In this paper, we propose a robust training method for MRC models to address this problem. Our method consists of three strategies, 1) label smoothing, 2) word overlapping, 3) distribution prediction. All of them help to train models on soft labels. We validate our approach on the representative architecture - ALBERT. Experimental results show that our method can greatly boost the baseline with 1% improvement in average, and achieve state-of-the-art performance on NewsQA and QUOREF.

pdf bib
Emotion Classification by Jointly Learning to Lexiconize and Classify
Deyu Zhou | Shuangzhi Wu | Qing Wang | Jun Xie | Zhaopeng Tu | Mu Li
Proceedings of the 28th International Conference on Computational Linguistics

Emotion lexicons have been shown effective for emotion classification (Baziotis et al., 2018). Previous studies handle emotion lexicon construction and emotion classification separately. In this paper, we propose an emotional network (EmNet) to jointly learn sentence emotions and construct emotion lexicons which are dynamically adapted to a given context. The dynamic emotion lexicons are useful for handling words with multiple emotions based on different context, which can effectively improve the classification accuracy. We validate the approach on two representative architectures – LSTM and BERT, demonstrating its superiority on identifying emotions in Tweets. Our model outperforms several approaches proposed in previous studies and achieves new state-of-the-art on the benchmark Twitter dataset.

pdf bib
Tencent Neural Machine Translation Systems for the WMT20 News Translation Task
Shuangzhi Wu | Xing Wang | Longyue Wang | Fangxu Liu | Jun Xie | Zhaopeng Tu | Shuming Shi | Mu Li
Proceedings of the Fifth Conference on Machine Translation

This paper describes Tencent Neural Machine Translation systems for the WMT 2020 news translation tasks. We participate in the shared news translation task on English Chinese and English German language pairs. Our systems are built on deep Transformer and several data augmentation methods. We propose a boosted in-domain finetuning method to improve single models. Ensemble is used to combine single models and we propose an iterative transductive ensemble method which can further improve the translation performance based on the ensemble results. We achieve a BLEU score of 36.8 and the highest chrF score of 0.648 on Chinese English task.


pdf bib
Sequence-to-Dependency Neural Machine Translation
Shuangzhi Wu | Dongdong Zhang | Nan Yang | Mu Li | Ming Zhou
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Nowadays a typical Neural Machine Translation (NMT) model generates translations from left to right as a linear sequence, during which latent syntactic structures of the target sentences are not explicitly concerned. Inspired by the success of using syntactic knowledge of target language for improving statistical machine translation, in this paper we propose a novel Sequence-to-Dependency Neural Machine Translation (SD-NMT) method, in which the target word sequence and its corresponding dependency structure are jointly constructed and modeled, and this structure is used as context to facilitate word generations. Experimental results show that the proposed method significantly outperforms state-of-the-art baselines on Chinese-English and Japanese-English translation tasks.


pdf bib
Efficient Disfluency Detection with Transition-based Parsing
Shuangzhi Wu | Dongdong Zhang | Ming Zhou | Tiejun Zhao
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)


pdf bib
Punctuation Prediction with Transition-based Parsing
Dongdong Zhang | Shuangzhi Wu | Nan Yang | Mu Li
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)