Weixuan Wang


2022

pdf bib
Cross-lingual Feature Extraction from Monolingual Corpora for Low-resource Unsupervised Bilingual Lexicon Induction
Zihao Feng | Hailong Cao | Tiejun Zhao | Weixuan Wang | Wei Peng
Proceedings of the 29th International Conference on Computational Linguistics

Despite their progress in high-resource language settings, unsupervised bilingual lexicon induction (UBLI) models often fail on corpora with low-resource distant language pairs due to insufficient initialization. In this work, we propose a cross-lingual feature extraction (CFE) method to learn the cross-lingual features from monolingual corpora for low-resource UBLI, enabling representations of words with the same meaning leveraged by the initialization step. By integrating cross-lingual representations with pre-trained word embeddings in a fully unsupervised initialization on UBLI, the proposed method outperforms existing state-of-the-art methods on low-resource language pairs (EN-VI, EN-TH, EN-ZH, EN-JA). The ablation study also proves that the learned cross-lingual features can enhance the representational ability and robustness of the existing embedding model.

pdf bib
Huawei BabelTar NMT at WMT22 Biomedical Translation Task: How We Further Improve Domain-specific NMT
Weixuan Wang | Xupeng Meng | Suqing Yan | Ye Tian | Wei Peng
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes Huawei Artificial Intelligence Application Research Center’s neural machine translation system (“BabelTar”). Our submission to the WMT22 biomedical translation shared task covers language directions between English and the other seven languages (French, German, Italian, Spanish, Portuguese, Russian, and Chinese). During the past four years, our participation in this domain-specific track has witnessed a paradigm shift of methodology from a purely data-driven focus to embracing diversified techniques, including pre-trained multilingual NMT models, homograph disambiguation, ensemble learning, and preprocessing methods. We illustrate practical insights and measured performance improvements relating to how we further improve our domain-specific NMT system.

2021

pdf bib
Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings
Weixuan Wang | Wei Peng | Meng Zhang | Qun Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural Machine Translation (NMT) has shown a strong ability to utilize local context to disambiguate the meaning of words. However, it remains a challenge for NMT to leverage broader context information like topics. In this paper, we propose heterogeneous ways of embedding topic information at the sentence level into an NMT model to improve translation performance. Specifically, the topic information can be incorporated as pre-encoder topic embedding, post-encoder topic embedding, and decoder topic embedding to increase the likelihood of selecting target words from the same topic of the source sentence. Experimental results show that NMT models with the proposed topic knowledge embedding outperform the baselines on the English -> German and English -> French translation tasks.

pdf bib
Huawei AARC’s Submissions to the WMT21 Biomedical Translation Task: Domain Adaption from a Practical Perspective
Weixuan Wang | Wei Peng | Xupeng Meng | Qun Liu
Proceedings of the Sixth Conference on Machine Translation

This paper describes Huawei Artificial Intelligence Application Research Center’s neural machine translation systems and submissions to the WMT21 biomedical translation shared task. Four of the submissions achieve state-of-the-art BLEU scores based on the official-released automatic evaluation results (EN->FR, EN<->IT and ZH->EN). We perform experiments to unveil the practical insights of the involved domain adaptation techniques, including finetuning order, terminology dictionaries, and ensemble decoding. Issues associated with overfitting and under-translation are also discussed.