2024
pdf
bib
abs
Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
Yuyan Chen
|
Chenwei Wu
|
Songzhou Yan
|
Panjun Liu
|
Yanghua Xiao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs’ capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored.In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles.Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl’s taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs’ outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.
pdf
bib
DGLF: A Dual Graph-based Learning Framework for Multi-modal Sarcasm Detection
Zhihong Zhu
|
Kefan Shen
|
Zhaorun Chen
|
Yunyan Zhang
|
Yuyan Chen
|
Xiaoqi Jiao
|
Zhongwei Wan
|
Shaorong Xie
|
Wei Liu
|
Xian Wu
|
Yefeng Zheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
pdf
bib
abs
EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models
Yuyan Chen
|
Songzhou Yan
|
Sijia Liu
|
Yueze Li
|
Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2024
Emotional intelligence in large language models (LLMs) is of great importance in Natural Language Processing. However, the previous research mainly focus on basic sentiment analysis tasks, such as emotion recognition, which is not enough to evaluate LLMs’ overall emotional intelligence. Therefore, this paper presents a novel framework named EmotionQueen for evaluating the emotional intelligence of LLMs. The framework includes four distinctive tasks: Key Event Recognition, Mixed Event Recognition, Implicit Emotional Recognition, and Intention Recognition. LLMs are requested to recognize important event or implicit emotions and generate empathetic response.We also design two metrics to evaluate LLMs’ capabilities in recognition and response for emotion-related statements. Experiments yield significant conclusions about LLMs’ capabilities and limitations in emotion intelligence.
pdf
bib
abs
HOTVCOM: Generating Buzzworthy Comments for Videos
Yuyan Chen
|
Songzhou Yan
|
Qingpei Guo
|
Jiyuan Jia
|
Zhixu Li
|
Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2024
In the era of social media video platforms, popular “hot-comments” play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or “danmaku” in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces HOTVCOM, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the ComHeat framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.
pdf
bib
abs
Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios?
Yuyan Chen
|
Yueze Li
|
Songzhou Yan
|
Sijia Liu
|
Jiaqing Liang
|
Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2024
The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs’ problem-solving capability such as “Twenty Questions”.However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario.Moreover, the existing game such as “Who is undercover” are highly subjective, making it challenging for evaluation.Therefore, in this paper, we introduce a novel game named BrainKing based on the “Who is undercover” and “Twenty Questions” for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.
2023
pdf
bib
abs
MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization
Yuyan Chen
|
Zhihao Wen
|
Ge Fan
|
Zhengyu Chen
|
Wei Wu
|
Dayiheng Liu
|
Zhixu Li
|
Bang Liu
|
Yanghua Xiao
Findings of the Association for Computational Linguistics: EMNLP 2023
Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.
2022
pdf
bib
abs
Efficient Two-Stage Progressive Quantization of BERT
Charles Le
|
Arash Ardakani
|
Amir Ardakani
|
Hang Zhang
|
Yuyan Chen
|
James Clark
|
Brett Meyer
|
Warren Gross
Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)
The success of large BERT models has raised the demand for model compression methods to reduce model size and computational cost. Quantization can reduce the model size and inference latency, making inference more efficient, without changing its stucture, but it comes at the cost of performance degradation. Due to the complex loss landscape of ternarized/binarized BERT, we present an efficient two-stage progressive quantization method in which we fine tune the model with quantized weights and progressively lower its bits, and then we fine tune the model with quantized weights and activations. At the same time, we strategically choose which bitwidth to fine-tune on and to initialize from, and which bitwidth to fine-tune under augmented data to outperform the existing BERT binarization methods without adding an extra module, compressing the binary model 18% more than previous binarization methods or compressing BERT by 31x w.r.t. to the full-precision model. Our method without data augmentation can outperform existing BERT ternarization methods.