Ming Chen


2024

pdf bib
Revisiting the Self-Consistency Challenges in Multi-Choice Question Formats for Large Language Model Evaluation
Wenjie Zhou | Qiang Wang | Mingzhou Xu | Ming Chen | Xiangyu Duan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multi-choice questions (MCQ) are a common method for assessing the world knowledge of large language models (LLMs), demonstrated by benchmarks such as MMLU and C-Eval. However, recent findings indicate that even top-tier LLMs, such as ChatGPT and GPT4, might display inconsistencies when faced with slightly varied inputs. This raises concerns about the credibility of MCQ-based evaluations. To address this issue, we introduced three knowledge-equivalent question variants: option position shuffle, option label replacement, and conversion to a True/False format. We rigorously tested a range of LLMs, varying in model size (from 6B to 70B) and types—pretrained language model (PLM), supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). Our findings from MMLU and C-Eval revealed that accuracy for individual questions lacks robustness, particularly in smaller models (<30B) and PLMs. Consequently, we advocate that consistent accuracy may serve as a more reliable metric for evaluating and ranking LLMs.

2023

pdf bib
Understanding and Improving the Robustness of Terminology Constraints in Neural Machine Translation
Huaao Zhang | Qiang Wang | Bo Qin | Zelin Shi | Haibo Wang | Ming Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we study the robustness of two typical terminology translation methods: Placeholder (PH) and Code-Switch (CS), concerning (1) the number of constraints and (2) the target constraint length. We identify that existing terminology constraint test sets, such as IATE, Wiktionary, and TICO, are blind to this issue due to oversimplified constraint settings. To solve it, we create a new challenging test set of English-German, increasing the average constraint count per sentence from 1.1~1.7 to 6.1 and the length per target constraint from 1.1~1.2 words to 3.4 words. Then we find that PH and CS methods degrade as the number of constraints increases, but they have complementary strengths. Specifically, PH is better at retaining high constraint accuracy but lower translation quality as measured by BLEU and COMET scores. In contrast, CS has the opposite results. Based on these observations, we propose a simple but effective method combining the advantages of PH and CS. This approach involves training a model like PH to predict the term labels, and then during inference replacing those labels with target terminology text like CS, so that the subsequent generation is aware of the target term content. Extensive experimental results show that this approach can achieve high constraint accuracy and translation quality simultaneously, regardless of the number or length of constraints.

pdf bib
Hybrid-Regressive Paradigm for Accurate and Speed-Robust Neural Machine Translation
Qiang Wang | Xinhui Hu | Ming Chen
Findings of the Association for Computational Linguistics: ACL 2023

This work empirically confirms that non-autoregressive translation (NAT) is less robust in decoding batch size and hardware settings than autoregressive translation (AT). To address this issue, we demonstrate that prompting a small number of AT predictions can significantly reduce the performance gap between AT and NAT through synthetic experiments. Following this line, we propose hybrid-regressive translation (HRT), a two-stage translation prototype that combines the strengths of AT and NAT. Specifically, HRT first generates discontinuous sequences via autoregression (e.g., make a prediction for every k tokens, k>1) and then fills in all previously skipped tokens at once in a non-autoregressive manner. Experiments on five translation tasks show that HRT achieves comparable translation quality with AT while having at least 1.5x faster inference regardless of batch size and device. Additionally, HRT successfully inherits the sound characteristics of AT in the deep-encoder-shallow-decoder architecture, allowing for further speedup without BLEU loss.

2022

pdf bib
The RoyalFlush System for the WMT 2022 Efficiency Task
Bo Qin | Aixin Jia | Qiang Wang | Jianning Lu | Shuqin Pan | Haibo Wang | Ming Chen
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the submission of the RoyalFlush neural machine translation system for the WMT 2022 translation efficiency task. Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation (HRT) to combine the advantages of autoregressive and non-autoregressive translation. Specifically, HRT first autoregressively generates a discontinuous sequence (e.g., make a prediction every k tokens, k1) and then fills in all previously skipped tokens at once in a non-autoregressive manner. Thus, we can easily trade off the translation quality and speed by adjusting k. In addition, by integrating other modeling techniques (e.g., sequence-level knowledge distillation and deep-encoder-shallow-decoder layer allocation strategy) and a mass of engineering efforts, HRT improves 80% inference speed and achieves equivalent translation performance with the same-capacity AT counterpart. Our fastest system reaches 6k+ words/second on the GPU latency setting, estimated to be about 3.1x faster than the last year’s winner.

pdf bib
Learning Decoupled Retrieval Representation for Nearest Neighbour Neural Machine Translation
Qiang Wang | Rongxiang Weng | Ming Chen
Proceedings of the 29th International Conference on Computational Linguistics

K-Nearest Neighbor Neural Machine Translation (kNNMT) successfully incorporates external corpus by retrieving word-level representations at test time. Generally, kNNMT borrows the off-the-shelf context representation in the translation task, e.g., the output of the last decoder layer, as the query vector of the retrieval task. In this work, we highlight that coupling the representations of these two tasks is sub-optimal for fine-grained retrieval. To alleviate it, we leverage supervised contrastive learning to learn the distinctive retrieval representation derived from the original context representation. We also propose a fast and effective approach to constructing hard negative samples. Experimental results on five domains show that our approach improves the retrieval accuracy and BLEU score compared to vanilla kNNMT.

2020

pdf bib
WeChat Neural Machine Translation Systems for WMT20
Fandong Meng | Jianhao Yan | Yijin Liu | Yuan Gao | Xianfeng Zeng | Qinsong Zeng | Peng Li | Ming Chen | Jie Zhou | Sifan Liu | Hao Zhou
Proceedings of the Fifth Conference on Machine Translation

We participate in the WMT 2020 shared newstranslation task on Chinese→English. Our system is based on the Transformer (Vaswaniet al., 2017a) with effective variants and the DTMT (Meng and Zhang, 2019) architecture. In our experiments, we employ data selection, several synthetic data generation approaches (i.e., back-translation, knowledge distillation, and iterative in-domain knowledge transfer), advanced finetuning approaches and self-bleu based model ensemble. Our constrained Chinese→English system achieves 36.9 case-sensitive BLEU score, which is thehighest among all submissions.

2019

pdf bib
Fine-tune BERT with Sparse Self-Attention Mechanism
Baiyun Cui | Yingming Li | Ming Chen | Zhongfei Zhang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In this paper, we develop a novel Sparse Self-Attention Fine-tuning model (referred as SSAF) which integrates sparsity into self-attention mechanism to enhance the fine-tuning performance of BERT. In particular, sparsity is introduced into the self-attention by replacing softmax function with a controllable sparse transformation when fine-tuning with BERT. It enables us to learn a structurally sparse attention distribution, which leads to a more interpretable representation for the whole input. The proposed model is evaluated on sentiment analysis, question answering, and natural language inference tasks. The extensive experimental results across multiple datasets demonstrate its effectiveness and superiority to the baseline methods.

2018

pdf bib
Deep Attentive Sentence Ordering Network
Baiyun Cui | Yingming Li | Ming Chen | Zhongfei Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose a novel deep attentive sentence ordering network (referred as ATTOrderNet) which integrates self-attention mechanism with LSTMs in the encoding of input sentences. It enables us to capture global dependencies among sentences regardless of their input order and obtains a reliable representation of the sentence set. With this representation, a pointer network is exploited to generate an ordered sequence. The proposed model is evaluated on Sentence Ordering and Order Discrimination tasks. The extensive experimental results demonstrate its effectiveness and superiority to the state-of-the-art methods.