Jimmy Huang


2024

pdf bib
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar | Sawsan Alqahtani | M Saiful Bari | Mizanur Rahman | Mohammad Abdullah Matin Khan | Haidar Khan | Israt Jahan | Amran Bhuiyan | Chee Wei Tan | Md Rizwan Parvez | Enamul Hoque | Shafiq Joty | Jimmy Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.

2023

pdf bib
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers
Israt Jahan | Md Tahmid Rahman Laskar | Chun Peng | Jimmy Huang
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

ChatGPT is a large language model developed by OpenAI. Despite its impressive performance across various tasks, no prior work has investigated its capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of ChatGPT on various benchmark biomedical tasks, such as relation extraction, document classification, question answering, and summarization. To the best of our knowledge, this is the first work that conducts an extensive evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative transformer models, such as BioGPT and BioBART. This suggests that ChatGPT’s pre-training on large text corpora makes it quite specialized even in the biomedical domain. Our findings demonstrate that ChatGPT has the potential to be a valuable tool for various tasks in the biomedical domain that lack large annotated data.

pdf bib
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets
Md Tahmid Rahman Laskar | M Saiful Bari | Mizanur Rahman | Md Amran Hossen Bhuiyan | Shafiq Joty | Jimmy Huang
Findings of the Association for Computational Linguistics: ACL 2023

The development of large language models (LLMs) such as ChatGPT has brought a lot of attention recently. However, their evaluation in the benchmark academic datasets remains under-explored due to the difficulty of evaluating the generative outputs produced by this model against the ground truth. In this paper, we aim to present a thorough evaluation of ChatGPT’s performance on diverse academic datasets, covering tasks like question-answering, text summarization, code generation, commonsense reasoning, mathematical problem-solving, machine translation, bias detection, and ethical considerations. Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets. This makes our work the largest evaluation of ChatGPT in NLP benchmarks. In short, our study aims to validate the strengths and weaknesses of ChatGPT in various tasks and provide insights for future research using LLMs. We also report a new emergent ability to follow multi-query instructions that we mostly found in ChatGPT and other instruction-tuned models. Our extensive evaluation shows that even though ChatGPT is capable of performing a wide variety of tasks, and may obtain impressive performance in several benchmark datasets, it is still far from achieving the ability to reliably solve many challenging tasks. By providing a thorough assessment of ChatGPT’s performance across diverse NLP tasks, this paper sets the stage for a targeted deployment of ChatGPT-like LLMs in real-world applications.

pdf bib
Learning Query Adaptive Anchor Representation for Inductive Relation Prediction
Zhiwen Xie | Yi Zhang | Jin Liu | Guangyou Zhou | Jimmy Huang
Findings of the Association for Computational Linguistics: ACL 2023

Relation prediction on knowledge graphs (KGs) attempts to infer the missing links between entities. Most previous studies are limited to the transductive setting where all entities must be seen during the training, making them unable to perform reasoning on emerging entities. Recently, the inductive setting is proposed to handle the entities in the test phase to be unseen during training, However, it suffers from the inefficient reasoning under the enclosing subgraph extraction issue and the lack of effective entity-independent feature modeling. To this end, we propose a novel Query Adaptive Anchor Representation (QAAR) model for inductive relation prediction. First, we extract one opening subgraph and perform reasoning by one time for all candidate triples, which is more efficient when the number of candidate triples is large. Second, we define some query adaptive anchors which are independent on any specific entity. Based on these anchors, we take advantage of the transferable entity-independent features (relation-aware, structure-aware and distance features) that can be used to produce entity embeddings for emerging unseen entities. Such entity-independent features is modeled by a query-aware graph attention network on the opening subgraph. Experimental results demonstrate that our proposed QAAR outperforms state-of-the-art baselines in inductive relation prediction task.

pdf bib
Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization
Md Tahmid Rahman Laskar | Mizanur Rahman | Israt Jahan | Enamul Hoque | Jimmy Huang
Findings of the Association for Computational Linguistics: EMNLP 2023

Debatepedia is a publicly available dataset consisting of arguments and counter-arguments on controversial topics that has been widely used for the single-document query-focused abstractive summarization task in recent years. However, it has been recently found that this dataset is limited by noise and even most queries in this dataset do not have any relevance to the respective document. In this paper, we study whether large language models (LLMs) can be utilized to clean the Debatepedia dataset to make it suitable for query-focused abstractive summarization. More specifically, we harness the language generation capabilities of two LLMs, namely, ChatGPT and PaLM to regenerate its queries. Based on our experiments, we find that solely depending on large language models for query correction may not be very useful for data cleaning. However, we observe that leveraging a rule-based approach for data sampling followed by query regeneration using LLMs (especially ChatGPT) for the sampled instances may ensure a higher quality version of this dataset suitable for the development of more generalized query-focused text summarization models.