Dacheng Tao


2022

pdf bib
On the Complementarity between Pre-Training and Random-Initialization for Resource-Rich Machine Translation
Changtong Zan | Liang Ding | Li Shen | Yu Cao | Weifeng Liu | Dacheng Tao
Proceedings of the 29th International Conference on Computational Linguistics

Pre-Training (PT) of text representations has been successfully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve notable gains (some- times, even worse) on resource-rich NMT on par with its Random-Initialization (RI) counterpart. We take the first step to investigate the complementarity between PT and RI in resource-rich scenarios via two probing analyses, and find that: 1) PT improves NOT the accuracy, but the generalization by achieving flatter loss landscapes than that of RI; 2) PT improves NOT the confidence of lexical choice, but the negative diversity by assigning smoother lexical probability distributions than that of RI. Based on these insights, we propose to combine their complementarities with a model fusion algorithm that utilizes optimal transport to align neurons between PT and RI. Experiments on two resource-rich translation benchmarks, WMT’17 English-Chinese (20M) and WMT’19 English-German (36M), show that PT and RI could be nicely complementary to each other, achieving substantial improvements considering both translation accuracy, generalization, and negative diversity. Probing tools and code are released at: https://github.com/zanchangtong/PTvsRI.

pdf bib
A Contrastive Cross-Channel Data Augmentation Framework for Aspect-Based Sentiment Analysis
Bing Wang | Liang Ding | Qihuang Zhong | Ximing Li | Dacheng Tao
Proceedings of the 29th International Conference on Computational Linguistics

Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task, which focuses on detecting the sentiment polarity towards the aspect in a sentence. However, it is always sensitive to the multi-aspect challenge, where features of multiple aspects in a sentence will affect each other. To mitigate this issue, we design a novel training framework, called Contrastive Cross-Channel Data Augmentation (C3 DA), which leverages an in-domain generator to construct more multi-aspect samples and then boosts the robustness of ABSA models via contrastive learning on these generated data. In practice, given a generative pretrained language model and some limited ABSA labeled data, we first employ some parameter-efficient approaches to perform the in-domain fine-tuning. Then, the obtained in-domain generator is used to generate the synthetic sentences from two channels, i.e., Aspect Augmentation Channel and Polarity Augmentation Channel, which generate the sentence condition on a given aspect and polarity respectively. Specifically, our C3 DA performs the sentence generation in a cross-channel manner to obtain more sentences, and proposes an Entropy-Minimization Filter to filter low-quality generated samples. Extensive experiments show that our C3 DA can outperform those baselines without any augmentations by about 1% on accuracy and Macro- F1. Code and data are released in https://github.com/wangbing1416/C3DA.

pdf bib
Redistributing Low-Frequency Words: Making the Most of Monolingual Data in Non-Autoregressive Translation
Liang Ding | Longyue Wang | Shuming Shi | Dacheng Tao | Zhaopeng Tu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge distillation (KD) is the preliminary step for training non-autoregressive translation (NAT) models, which eases the training of NAT models at the cost of losing important information for translating low-frequency words. In this work, we provide an appealing alternative for NAT – monolingual KD, which trains NAT student on external monolingual data with AT teacher trained on the original bilingual data. Monolingual KD is able to transfer both the knowledge of the original bilingual data (implicitly encoded in the trained AT teacher model) and that of the new monolingual data to the NAT student model. Extensive experiments on eight WMT benchmarks over two advanced NAT models show that monolingual KD consistently outperforms the standard KD by improving low-frequency word translation, without introducing any computational cost. Monolingual KD enjoys desirable expandability, which can be further enhanced (when given more computational budget) by combining with the standard KD, a reverse monolingual KD, or enlarging the scale of monolingual data. Extensive analyses demonstrate that these techniques can be used together profitably to further recall the useful information lost in the standard KD. Encouragingly, combining with standard KD, our approach achieves 30.4 and 34.1 BLEU points on the WMT14 English-German and German-English datasets, respectively. Our code and trained models are freely available at https://github.com/alphadl/RLFW-NAT.mono.

pdf bib
A Model-agnostic Data Manipulation Method for Persona-based Dialogue Generation
Yu Cao | Wei Bi | Meng Fang | Shuming Shi | Dacheng Tao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Towards building intelligent dialogue agents, there has been a growing interest in introducing explicit personas in generation models. However, with limited persona-based dialogue data at hand, it may be difficult to train a dialogue generation model well. We point out that the data challenges of this generation task lie in two aspects: first, it is expensive to scale up current persona-based dialogue datasets; second, each data sample in this task is more complex to learn with than conventional dialogue data. To alleviate the above data issues, we propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model to improve their performance. The original training samples will first be distilled and thus expected to be fitted more easily. Next, we show various effective ways that can diversify such easier distilled data. A given base model will then be trained via the constructed data curricula, i.e. first on augmented distilled samples and then on original ones. Experiments illustrate the superiority of our method with two strong base dialogue models (Transformer encoder-decoder and GPT2).

2021

pdf bib
Improving Neural Machine Translation by Bidirectional Training
Liang Ding | Di Wu | Dacheng Tao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present a simple and effective pretraining strategy – bidirectional training (BiT) for neural machine translation. Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally. To achieve bidirectional updating, we simply reconstruct the training samples from “srctgt” to “src+tgttgt+src” without any complicated model modifications. Notably, our approach does not increase any parameters or training steps, requiring the parallel data merely. Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs (data sizes range from 160K to 38M) significantly higher. Encouragingly, our proposed model can complement existing data manipulation strategies, i.e. back translation, data distillation, and data diversification. Extensive analyses show that our approach functions as a novel bilingual code-switcher, obtaining better bilingual alignment.

pdf bib
Progressive Multi-Granularity Training for Non-Autoregressive Translation
Liang Ding | Longyue Wang | Xuebo Liu | Derek F. Wong | Dacheng Tao | Zhaopeng Tu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
The USYD-JD Speech Translation System for IWSLT2021
Liang Ding | Dacheng Tao
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

This paper describes the University of Sydney & JD’s joint submission of the IWSLT 2021 low resource speech translation task. We participated in the Swahili->English direction and got the best scareBLEU (25.3) score among all the participants. Our constrained system is based on a pipeline framework, i.e. ASR and NMT. We trained our models with the officially provided ASR and MT datasets. The ASR system is based on the open-sourced tool Kaldi and this work mainly explores how to make the most of the NMT models. To reduce the punctuation errors generated by the ASR model, we employ our previous work SlotRefine to train a punctuation correction model. To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking, and transductive finetuning. For model structure, we tried auto-regressive and non-autoregressive models, respectively. In addition, we proposed two novel pre-train approaches, i.e. de-noising training and bidirectional training to fully exploit the data. Extensive experiments show that adding the above techniques consistently improves the BLEU scores, and the final submission system outperforms the baseline (Transformer ensemble model trained with the original parallel data) by approximately 10.8 BLEU score, achieving the SOTA performance.

pdf bib
Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation
Liang Ding | Longyue Wang | Xuebo Liu | Derek F. Wong | Dacheng Tao | Zhaopeng Tu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. However, there exists a discrepancy on low-frequency words between the distilled and the original data, leading to more errors on predicting low-frequency words. To alleviate the problem, we directly expose the raw data into NAT by leveraging pretraining. By analyzing directed alignments, we found that KD makes low-frequency source words aligned with targets more deterministically but fails to align sufficient low-frequency words from target to source. Accordingly, we propose reverse KD to rejuvenate more alignments for low-frequency target words. To make the most of authentic and synthetic data, we combine these complementary approaches as a new training strategy for further boosting NAT performance. We conduct experiments on five translation benchmarks over two advanced architectures. Results demonstrate that the proposed approach can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Encouragingly, our approach achieves 28.2 and 33.9 BLEU points on the WMT14 English-German and WMT16 Romanian-English datasets, respectively. Our code, data, and trained models are available at https://github.com/longyuewangdcu/RLFW-NAT.

2020

pdf bib
Self-Attention with Cross-Lingual Position Representation
Liang Ding | Longyue Wang | Dacheng Tao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Position encoding (PE), an essential part of self-attention networks (SANs), is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences. However, in cross-lingual scenarios, machine translation, the PEs of source and target sentences are modeled independently. Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem. In this paper, we augment SANs with cross-lingual position representations to model the bilingually aware latent structure for the input sentence. Specifically, we utilize bracketing transduction grammar (BTG)-based reordering information to encourage SANs to learn bilingual diagonal alignments. Experimental results on WMT’14 EnglishGerman, WAT’17 JapaneseEnglish, and WMT’17 ChineseEnglish translation tasks demonstrate that our approach significantly and consistently improves translation quality over strong baselines. Extensive analyses confirm that the performance gains come from the cross-lingual information.

pdf bib
Context-Aware Cross-Attention for Non-Autoregressive Translation
Liang Ding | Longyue Wang | Di Wu | Dacheng Tao | Zhaopeng Tu
Proceedings of the 28th International Conference on Computational Linguistics

Non-autoregressive translation (NAT) significantly accelerates the inference process by predicting the entire target sequence. However, due to the lack of target dependency modelling in the decoder, the conditional generation process heavily depends on the cross-attention. In this paper, we reveal a localness perception problem in NAT cross-attention, for which it is difficult to adequately capture source context. To alleviate this problem, we propose to enhance signals of neighbour source tokens into conventional cross-attention. Experimental results on several representative datasets show that our approach can consistently improve translation quality over strong NAT baselines. Extensive analyses demonstrate that the enhanced cross-attention achieves better exploitation of source contexts by leveraging both local and global information.

pdf bib
Pretrained Language Models for Dialogue Generation with Multiple Input Sources
Yu Cao | Wei Bi | Meng Fang | Dacheng Tao
Findings of the Association for Computational Linguistics: EMNLP 2020

Large-scale pretrained language models have achieved outstanding performance on natural language understanding tasks. However, it is still under investigating how to apply them to dialogue generation tasks, especially those with responses conditioned on multiple sources. Previous work simply concatenates all input sources or averages information from different input sources. In this work, we study dialogue models with multiple input sources adapted from the pretrained language model GPT2. We explore various methods to fuse multiple separate attention information corresponding to different sources. Our experimental results show that proper fusion methods deliver higher relevance with dialogue history than simple fusion baselines.

2019

pdf bib
BAG: Bi-directional Attention Entity Graph Convolutional Network for Multi-hop Reasoning Question Answering
Yu Cao | Meng Fang | Dacheng Tao
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Multi-hop reasoning question answering requires deep comprehension of relationships between various documents and queries. We propose a Bi-directional Attention Entity Graph Convolutional Network (BAG), leveraging relationships between nodes in an entity graph and attention information between a query and the entity graph, to solve this task. Graph convolutional networks are used to obtain a relation-aware representation of nodes for entity graphs built from documents with multi-level features. Bidirectional attention is then applied on graphs and queries to generate a query-aware nodes representation, which will be used for the final prediction. Experimental evaluation shows BAG achieves state-of-the-art accuracy performance on the QAngaroo WIKIHOP dataset.

pdf bib
The University of Sydney’s Machine Translation System for WMT19
Liang Ding | Dacheng Tao
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the University of Sydney’s submission of the WMT 2019 shared news translation task. We participated in the Finnish->English direction and got the best BLEU(33.0) score among all the participants. Our system is based on the self-attentional Transformer networks, into which we integrated the most recent effective strategies from academic research (e.g., BPE, back translation, multi-features data selection, data augmentation, greedy model ensemble, reranking, ConMBR system combination, and postprocessing). Furthermore, we propose a novel augmentation method Cycle Translation and a data mixture strategy Big/Small parallel construction to entirely exploit the synthetic corpus. Extensive experiments show that adding the above techniques can make continuous improvements of the BLEU scores, and the best result outperforms the baseline (Transformer ensemble model trained with the original parallel corpus) by approximately 5.3 BLEU score, achieving the state-of-the-art performance.