Cheng-Han Chiang


2024

pdf bib
Over-Reasoning and Redundant Calculation of Large Language Models
Cheng-Han Chiang | Hung-yi Lee
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Large language models (LLMs) can solve problems step-by-step.While this chain-of-thought (CoT) reasoning boosts LLMs’ performance, it is unclear if LLMs know when to use CoT and whether those CoT are always necessary to answer the question. This paper shows that LLMs tend to generate redundant calculations and reasoning on a manually constructed math QA dataset, GSM8K-Zero.GSM8K-Zero is constructed such that the questions can be answered without any calculations, but LLMs, including Llama-2 models and Claude-2, tend to generate lengthy and unnecessary calculations to answer the questions.We also conduct experiments to explain why LLMs generate redundant calculations and reasonings.

2023

pdf bib
Are Synonym Substitution Attacks Really Synonym Substitution Attacks?
Cheng-Han Chiang | Hung-yi Lee
Findings of the Association for Computational Linguistics: ACL 2023

In this paper, we explore the following question: Are synonym substitution attacks really synonym substitution attacks (SSAs)?We approach this question by examining how SSAs replace words in the original sentence and show that there are still unresolved obstacles that make current SSAs generate invalid adversarial samples. We reveal that four widely used word substitution methods generate a large fraction of invalid substitution words that are ungrammatical or do not preserve the original sentence’s semantics. Next, we show that the semantic and grammatical constraints used in SSAs for detecting invalid word replacements are highly insufficient in detecting invalid adversarial samples.

pdf bib
A Closer Look into Using Large Language Models for Automatic Evaluation
Cheng-Han Chiang | Hung-yi Lee
Findings of the Association for Computational Linguistics: EMNLP 2023

Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some existing prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze *LLM evaluation* and *G-Eval*, and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.

pdf bib
Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS
Cheng-Han Chiang | Hung-yi Lee | Yung-Sung Chuang | James Glass
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

pdf bib
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang | Hung-yi Lee
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs.We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.

2022

pdf bib
Recent Advances in Pre-trained Language Models: Why Do They Work and How Do They Work
Cheng-Han Chiang | Yung-Sung Chuang | Hung-yi Lee
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts

Pre-trained language models (PLMs) are language models that are pre-trained on large-scaled corpora in a self-supervised fashion. These PLMs have fundamentally changed the natural language processing community in the past few years. In this tutorial, we aim to provide a broad and comprehensive introduction from two perspectives: why those PLMs work, and how to use them in NLP tasks. The first part of the tutorial shows some insightful analysis on PLMs that partially explain their exceptional downstream performance. The second part first focuses on emerging pre-training methods that enable PLMs to perform diverse downstream tasks and then illustrates how one can apply those PLMs to downstream tasks under different circumstances. These circumstances include fine-tuning PLMs when under data scarcity, and using PLMs with parameter efficiency. We believe that attendees of different backgrounds would find this tutorial informative and useful.

2020

pdf bib
Pretrained Language Model Embryology: The Birth of ALBERT
Cheng-Han Chiang | Sung-Feng Huang | Hung-yi Lee
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While behaviors of pretrained language models (LMs) have been thoroughly examined, what happened during pretraining is rarely studied. We thus investigate the developmental process from a set of randomly initialized parameters to a totipotent language model, which we refer to as the embryology of a pretrained language model. Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining. We also find that linguistic knowledge and world knowledge do not generally improve as pretraining proceeds, nor do downstream tasks’ performance. These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge. We provide source codes and pretrained models to reproduce our results at https://github.com/d223302/albert-embryology.