Wenxuan Wang


2024

pdf bib
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models
Yuxuan Wan | Wenxuan Wang | Yiliu Yang | Youliang Yuan | Jen-tse Huang | Pinjia He | Wenxiang Jiao | Michael Lyu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite LLMs’ prowess in tasks like writing assistance, code generation, and machine translation, assessing their ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs’ learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. Moreover, we leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs’ formal reasoning capabilities. We make our code, data, and results publicly available(https://github.com/yxwan123/LogicAsker) to facilitate further research and replication of our findings.

pdf bib
On the Reliability of Psychological Scales on Large Language Models
Jen-tse Huang | Wenxiang Jiao | Man Ho Lam | Eric John Li | Wenxuan Wang | Michael Lyu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent research has focused on examining Large Language Models’ (LLMs) characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs demonstrate consistent personality traits. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory, indicating a satisfactory level of reliability. Furthermore, our research explores the potential of GPT-3.5 to emulate diverse personalities and represent various groups—a capability increasingly sought after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to represent different personalities with specific prompt instructions.

pdf bib
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
Wenxuan Wang | Yisi Zhang | Xingjian He | Yichen Yan | Zijia Zhao | Xinlong Wang | Jing Liu
Findings of the Association for Computational Linguistics: ACL 2024

Visual grounding (VG) aims at locating the foreground entities that match the given natural language expression. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios. Since users usually prefer to provide the intention-based expressions for the desired object instead of covering all the details, it is necessary for the agents to interpret the intention-driven instructions. Thus, in this work, we take a step further to the intention-driven visual-language (V-L) understanding. To promote classic VG towards human intention interpretation, we propose a new intention-driven visual grounding (IVG) task and build a largest-scale IVG dataset named IntentionVG with free-form intention expressions. Considering that practical agents need to move and find specific targets among various scenarios to realize the grounding task, our IVG task and IntentionVG dataset have taken the crucial properties of both multi-scenario perception and egocentric view into consideration. Besides, various types of models are set up as the baselines to realize our IVG task. Extensive experiments on our IntentionVG dataset and baselines demonstrate the necessity and efficacy of our method for the V-L field. To foster future research in this direction, our newly built dataset and baselines will be publicly available at https://github.com/Rubics-Xuan/IVG.

pdf bib
All Languages Matter: On the Multilingual Safety of LLMs
Wenxuan Wang | Zhaopeng Tu | Chang Chen | Youliang Yuan | Jen-tse Huang | Wenxiang Jiao | Michael Lyu
Findings of the Association for Computational Linguistics: ACL 2024

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose a simple and effective prompting method to improve the multilingual safety of ChatGPT by enhancing cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses by 42% for non-English queries. We will release all the data and results to facilitate future research on LLMs’ safety.

pdf bib
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models
Wenxuan Wang | Wenxiang Jiao | Jingyuan Huang | Ruyi Dai | Jen-tse Huang | Zhaopeng Tu | Michael Lyu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper identifies a cultural dominance issue within large language models (LLMs) due to the predominant use of English data in model training (e.g., ChatGPT). LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages. To systematically evaluate the cultural dominance issue, we build a benchmark of concrete (e.g., holidays and songs) and abstract (e.g., values and opinions) cultural objects. Empirical results show that the representative GPT models suffer from the culture dominance problem, where GPT-4 is the most affected while text-davinci-003 suffers the least from this problem. Our study emphasizes the need to critically examine cultural dominance and ethical considerations in their development and deployment. We show that two straightforward methods in model development (i.e., pretraining on more diverse data) and deployment (e.g., culture-aware prompting) can significantly mitigate the cultural dominance issue in LLMs.

pdf bib
Does ChatGPT Know That It Does Not Know? Evaluating the Black-Box Calibration of ChatGPT
Youliang Yuan | Wenxuan Wang | Qingshuo Guo | Yiming Xiong | Chihao Shen | Pinjia He
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recently, ChatGPT has demonstrated remarkable performance in various downstream tasks such as open-domain question answering, machine translation, and code generation. As a general-purpose task solver, an intriguing inquiry arises: Does ChatGPT itself know that it does not know, without any access to internal states? In response to this query, we present an initial evaluation of ChatGPT for black-box calibration. We designed three types of proxy confidence, from three perspectives to assess its performance. Experiments are conducted on five datasets, spanning four tasks, and the results show that ChatGPT has a degree of capability for black-box calibration. Specifically, proxy confidence displayed a significantly positive Pearson correlation (95.16%) with accuracy in the TruthfulQA dataset, while revealing a negative correlation in the ModAr dataset. We delved deeper into ChatGPT’s black-box calibration ability by examining failure cases in the ModAr dataset. Our analysis revealed that ChatGPT’s tendency to exhibit overconfidence may stem from its reliance on semantic priors. Furthermore, we investigated why ChatGPT performs relatively well in TruthfulQA. The findings suggest that ChatGPT might implicitly acquire calibration skills during the reinforcement learning process, rather than relying solely on simplistic heuristics.

2023

pdf bib
ParroT: Translating during Chat using Large Language Models tuned with Human Translation and Feedback
Wenxiang Jiao | Jen-tse Huang | Wenxuan Wang | Zhiwei He | Tian Liang | Xing Wang | Shuming Shi | Zhaopeng Tu
Findings of the Association for Computational Linguistics: EMNLP 2023

Large language models (LLMs) like ChatGPT have exhibited remarkable abilities on a wide range of natural language processing (NLP) tasks, including various machine translation abilities accomplished during chat. However, these models are only accessible through restricted APIs, which creates barriers to new research and advancements in the field. Therefore, we propose ParroT, a framework to enhance and regulate the translation abilities during chat based on open-source LLMs (e.g., LLaMA), human-written translation and feedback data. Specifically, ParroT reformulates translation data into the instruction-following style, and introduces a “Hint” field for incorporating extra requirements to regulate the translation process. Accordingly, we propose three instruction types for finetuning ParroT models, including translation instruction, contrastive instruction, and error-guided instruction. Experiments on Flores subsets and WMT22 test sets suggest that translation instruction improves the translation performance of vanilla LLMs significantly while error-guided instruction can lead to further improvement, which demonstrates the importance of learning from low-quality translations annotated by humans. We also demonstrate the potential of automatic evaluation tools in providing quality information of translations, when constructing error-guided instructions for directions that lack human annotation data. Please refer to our Github project for more implementation details: https://github.com/wxjiao/ParroT.

2022

pdf bib
Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation
Wenxuan Wang | Wenxiang Jiao | Yongchang Hao | Xing Wang | Shuming Shi | Zhaopeng Tu | Michael Lyu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we present a substantial step in better understanding the SOTA sequence-to-sequence (Seq2Seq) pretraining for neural machine translation (NMT). We focus on studying the impact of the jointly pretrained decoder, which is the main difference between Seq2Seq pretraining and previous encoder-based pretraining approaches for NMT. By carefully designing experiments on three language pairs, we find that Seq2Seq pretraining is a double-edged sword: On one hand, it helps NMT models to produce more diverse translations and reduce adequacy-related translation errors. On the other hand, the discrepancies between Seq2Seq pretraining and NMT finetuning limit the translation quality (i.e., domain discrepancy) and induce the over-estimation issue (i.e., objective discrepancy). Based on these observations, we further propose simple and effective strategies, named in-domain pretraining and input adaptation to remedy the domain and objective discrepancies, respectively. Experimental results on several language pairs show that our approach can consistently improve both translation performance and model robustness upon Seq2Seq pretraining.

pdf bib
Tencent’s Multilingual Machine Translation System for WMT22 Large-Scale African Languages
Wenxiang Jiao | Zhaopeng Tu | Jiarui Li | Wenxuan Wang | Jen-tse Huang | Shuming Shi
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes Tencent’s multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages. We participated in the constrained translation track in which only the data and pretrained models provided by the organizer are allowed. The task is challenging due to three problems, including the absence of training data for some to-be-evaluated language pairs, the uneven optimization of language pairs caused by data imbalance, and the curse of multilinguality. To address these problems, we adopt data augmentation, distributionally robust optimization, and language family grouping, respectively, to develop our multilingual neural machine translation (MNMT) models. Our submissions won the 1st place on the blind test sets in terms of the automatic evaluation metrics. Codes, models, and detailed competition results are available at https://github.com/wxjiao/WMT2022-Large-Scale-African.

2020

pdf bib
Rethinking the Value of Transformer Components
Wenxuan Wang | Zhaopeng Tu
Proceedings of the 28th International Conference on Computational Linguistics

Transformer becomes the state-of-the-art translation model, while it is not well studied how each intermediate component contributes to the model performance, which poses significant challenges for designing optimal architectures. In this work, we bridge this gap by evaluating the impact of individual component (sub-layer) in trained Transformer models from different perspectives. Experimental results across language pairs, training strategies, and model capacities show that certain components are consistently more important than the others. We also report a number of interesting findings that might help humans better analyze, understand and improve Transformer models. Based on these observations, we further propose a new training strategy that can improves translation performance by distinguishing the unimportant components in training.