“理解自然语言的逻辑结构和关系是机器理解的核心任务,也是人工智能领域的关键研究议题。随着大数据和计算能力的提升,预训练语言模型在逻辑推理方面取得了显著进展,使得大规模模型的逻辑推理能力成为研究的新焦点。本综述旨在全面梳理大模型在逻辑推理领域的研究进展,探讨其对人工智能系统智能水平评估的重要性及其在推动人工智能发展中的作用。 本文首先界定了大模型逻辑推理能力的研究范畴,系统性地讨论了逻辑推理的类型和 特点,并回顾了相关理论的发展,为研究提供了清晰的框架。接着,从任务形式和数 据基准的角度,详细介绍了逻辑推理研究的基础工作,为理解大模型的性能提供了基 准。进一步,本文深入分析了大模型在逻辑推理能力上的现状,通过不同推理类型的 案例研究,展示了大模型的能力表现。同时,本文还探讨了提升大模型逻辑推理能力 的方法,包括预训练、指令微调、解码策略和神经符号混合方法,并对这些方法进行 了比较分析。最后,本文提出了对未来研究方向的展望,旨在激发更多的学术讨论和 探索,推动逻辑推理能力研究的进一步发展。”
Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
Generative Pre-trained Transformer 4 (GPT-4) demonstrates impressive chain-of-thought reasoning ability. Recent work on self-instruction tuning, such as Alpaca, has focused on enhancing the general proficiency of models. These instructions enable the model to achieve performance comparable to GPT-3.5 on general tasks like open-domain text generation and paraphrasing. However, they fall short of helping the model handle complex reasoning tasks. To bridge the gap, this paper presents LogiCoT, a new instruction-tuning dataset for Logical Chain-of-Thought reasoning with GPT-4. We elaborate on the process of harvesting instructions for prompting GPT-4 to generate chain-of-thought rationales. LogiCoT serves as an instruction set for teaching models of logical reasoning and elicits general reasoning skills.
Aspect category sentiment analysis has attracted increasing research attention. The dominant methods make use of pre-trained language models by learning effective aspect category-specific representations, and adding specific output layers to its pre-trained representation. We consider a more direct way of making use of pre-trained language models, by casting the ACSA tasks into natural language generation tasks, using natural language sentences to represent the output. Our method allows more direct use of pre-trained knowledge in seq2seq language models by directly following the task setting during pre-training. Experiments on several benchmarks show that our method gives the best reported results, having large advantages in few-shot and zero-shot settings.