2024
pdf
bib
abs
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Jiangshu Du
|
Yibo Wang
|
Wenting Zhao
|
Zhongfen Deng
|
Shuaiqi Liu
|
Renze Lou
|
Henry Peng Zou
|
Pranav Narayanan Venkit
|
Nan Zhang
|
Mukund Srinath
|
Haoran Ranran Zhang
|
Vipul Gupta
|
Yinghui Li
|
Tao Li
|
Fei Wang
|
Qin Liu
|
Tianlin Liu
|
Pengzhi Gao
|
Congying Xia
|
Chen Xing
|
Cheng Jiayang
|
Zhaowei Wang
|
Ying Su
|
Raj Sanjay Shah
|
Ruohao Guo
|
Jing Gu
|
Haoran Li
|
Kangda Wei
|
Zihao Wang
|
Lu Cheng
|
Surangika Ranathunga
|
Meng Fang
|
Jie Fu
|
Fei Liu
|
Ruihong Huang
|
Eduardo Blanco
|
Yixin Cao
|
Rui Zhang
|
Philip S. Yu
|
Wenpeng Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Claim: This work is not advocating the use of LLMs for paper (meta-)reviewing. Instead, wepresent a comparative analysis to identify and distinguish LLM activities from human activities. Two research goals: i) Enable better recognition of instances when someone implicitly uses LLMs for reviewing activities; ii) Increase community awareness that LLMs, and AI in general, are currently inadequate for performing tasks that require a high level of expertise and nuanced judgment.This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload?This study focuses on the topic of LLMs as NLP Researchers, particularly examining the effectiveness of LLMs in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with “deficiency” labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) “LLMs as Reviewers”, how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) “LLMs as Metareviewers”, how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
pdf
bib
abs
Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?
Siyu Yuan
|
Cheng Jiayang
|
Lin Qiu
|
Deqing Yang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available athttps://github.com/siyuyuan/SCUA.
pdf
bib
abs
ECON: On the Detection and Resolution of Evidence Conflicts
Cheng Jiayang
|
Chunkit Chan
|
Qianqian Zhuang
|
Lin Qiu
|
Tianhang Zhang
|
Tengxiao Liu
|
Yangqiu Song
|
Yue Zhang
|
Pengfei Liu
|
Zheng Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rise of large language models (LLMs) has significantly influenced the quality of information in decision-making systems, leading to the prevalence of AI-generated content and challenges in detecting misinformation and managing conflicting information, or “inter-evidence conflicts.” This study introduces a method for generating diverse, validated evidence conflicts to simulate real-world misinformation scenarios. We evaluate conflict detection methods, including Natural Language Inference (NLI) models, factual consistency (FC) models, and LLMs, on these conflicts (RQ1) and analyze LLMs’ conflict resolution behaviors (RQ2). Our key findings include: (1) NLI and LLM models exhibit high precision in detecting answer conflicts, though weaker models suffer from low recall; (2) FC models struggle with lexically similar answer conflicts, while NLI and LLM models handle these better; and (3) stronger models like GPT-4 show robust performance, especially with nuanced conflicts. For conflict resolution, LLMs often favor one piece of conflicting evidence without justification and rely on internal knowledge if they have prior beliefs.
pdf
bib
abs
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities
Ying Su
|
Zhan Ling
|
Haochen Shi
|
Cheng Jiayang
|
Yauwai Yim
|
Yangqiu Song
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model’s reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by finetuning over BLEURT model to facilitate future research on our benchmark.
pdf
bib
abs
OpenResearcher: Unleashing AI for Accelerated Scientific Research
Yuxiang Zheng
|
Shichao Sun
|
Lin Qiu
|
Dongyu Ru
|
Cheng Jiayang
|
Xuefeng Li
|
Jifan Lin
|
Binjie Wang
|
Yun Luo
|
Renjie Pan
|
Yang Xu
|
Qingkai Min
|
Zizhao Zhang
|
Yiwen Wang
|
Wenjie Li
|
Pengfei Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse questions from researchers. OpenResearcher is built based on Retrieval-Augmented Generation (RAG) to integrate Large Language Models (LLMs) with up-to-date, domain-specific knowledge. Moreover, we develop various tools for OpenResearcher to understand researchers’ queries, search from the scientific literature, filter retrieved information, provide accurate and comprehensive answers, and self-refine these answers. OpenResearcher can flexibly use these tools to balance efficiency and effectiveness. As a result, OpenResearcher enables researchers to save time and increase their potential to discover new insights and drive scientific breakthroughs. Demo, video, and code are available at: https://github.com/GAIR-NLP/OpenResearcher.
pdf
bib
abs
Exploring the Potential of ChatGPT on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations
Chunkit Chan
|
Cheng Jiayang
|
Weiqi Wang
|
Yuxin Jiang
|
Tianqing Fang
|
Xin Liu
|
Yangqiu Song
Findings of the Association for Computational Linguistics: EACL 2024
This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse relations. Given ChatGPT’s promising performance across various tasks, we proceed to carry out thorough evaluations on the whole test sets of 11 datasets, including temporal and causal relations, PDTB2.0-based, and dialogue-based discourse relations. To ensure the reliability of our findings, we employ three tailored prompt templates for each task, including the zero-shot prompt template, zero-shot prompt engineering (PE) template, and in-context learning (ICL) prompt template, to establish the initial baseline scores for all popular sentence-pair relation classification tasks for the first time. Through our study, we discover that ChatGPT exhibits exceptional proficiency in detecting and reasoning about causal relations, albeit it may not possess the same level of expertise in identifying the temporal order between two events. While it is capable of identifying the majority of discourse relations with existing explicit discourse connectives, the implicit discourse relation remains a formidable challenge. Concurrently, ChatGPT demonstrates subpar performance in the dialogue discourse parsing task that requires structural understanding in a dialogue before being aware of the discourse relation.
pdf
bib
abs
NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding
Chunkit Chan
|
Cheng Jiayang
|
Yauwai Yim
|
Zheye Deng
|
Wei Fan
|
Haoran Li
|
Xin Liu
|
Hongming Zhang
|
Weiqi Wang
|
Yangqiu Song
Findings of the Association for Computational Linguistics: EMNLP 2024
Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.
pdf
bib
abs
CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning
Weiqi Wang
|
Tianqing Fang
|
Chunyang Li
|
Haochen Shi
|
Wenxuan Ding
|
Baixuan Xu
|
Zhaowei Wang
|
Jiaxin Bai
|
Xin Liu
|
Cheng Jiayang
|
Chunkit Chan
|
Yangqiu Song
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The sequential process of conceptualization and instantiation is essential to generalizable commonsense reasoning as it allows the application of existing knowledge to unfamiliar scenarios. However, existing works tend to undervalue the step of instantiation and heavilyrely on pre-built concept taxonomies and human annotations to collect both types of knowledge, resulting in a lack of instantiated knowledge to complete reasoning, high cost, and limited scalability. To tackle these challenges, we introduce CANDLE (ConceptuAlizationand INstantiation Distillation from Large Language ModEls), a distillation framework that iteratively performs contextualized conceptualization and instantiation over commonsense knowledge bases by instructing large language models to generate both types of knowledge with critic filtering. By applying CANDLE to ATOMIC (Sap et al., 2019a), we construct a comprehensive knowledge base comprising six million conceptualizations and instantiated commonsense knowledge triples. Both types of knowledge are firmly rooted in the original ATOMIC dataset, and intrinsic evaluations demonstrate their exceptional quality and diversity. Empirical results indicate that distilling CANDLE on student models provides benefits across three downstream tasks. Our data and models are publicly available at https://github.com/HKUST-KnowComp/CANDLE.
pdf
bib
abs
EventGround: Narrative Reasoning by Grounding to Eventuality-centric Knowledge Graphs
Cheng Jiayang
|
Lin Qiu
|
Chunkit Chan
|
Xin Liu
|
Yangqiu Song
|
Zheng Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down knowledge structures and lacks interpretability. Others explicitly collect world knowledge of eventualities into structured eventuality-centric knowledge graphs (KGs). However, existing research on leveraging these knowledge sources for free-texts is limited. In this work, we propose an initial comprehensive framework called EventGround, which aims to tackle the problem of grounding free-texts to eventuality-centric KGs for contextualized narrative reasoning. We identify two critical problems in this direction: the event representation and sparsity problems. We provide simple yet effective parsing and partial information extraction methods to tackle these problems. Experimental results demonstrate that our approach consistently outperforms baseline models when combined with graph neural network (GNN) or large language model (LLM) based graph reasoning models. Our framework, incorporating grounded knowledge, achieves state-of-the-art performance while providing interpretable evidence.
2023
pdf
bib
abs
StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding
Cheng Jiayang
|
Lin Qiu
|
Tsz Chan
|
Tianqing Fang
|
Weiqi Wang
|
Chunkit Chan
|
Dongyu Ru
|
Qipeng Guo
|
Hongming Zhang
|
Yangqiu Song
|
Yue Zhang
|
Zheng Zhang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, StoryAnalogy, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on StoryAnalogy, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in StoryAnalogy can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.