2024
pdf
bib
abs
FOLIO: Natural Language Reasoning with First-Order Logic
Simeng Han
|
Hailey Schoelkopf
|
Yilun Zhao
|
Zhenting Qi
|
Martin Riddell
|
Wenfei Zhou
|
James Coady
|
David Peng
|
Yujie Qiao
|
Luke Benson
|
Lucy Sun
|
Alexander Wardle-Solano
|
Hannah Szabó
|
Ekaterina Zubova
|
Matthew Burtell
|
Jonathan Fan
|
Yixin Liu
|
Brian Wong
|
Malcolm Sailor
|
Ansong Ni
|
Linyong Nan
|
Jungo Kasai
|
Tao Yu
|
Rui Zhang
|
Alexander Fabbri
|
Wojciech Maciej Kryscinski
|
Semih Yavuz
|
Ye Liu
|
Xi Victoria Lin
|
Shafiq Joty
|
Yingbo Zhou
|
Caiming Xiong
|
Rex Ying
|
Arman Cohan
|
Dragomir Radev
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. The logical correctness of the premises and conclusions is ensured by their FOL annotations, which are automatically verified by an FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models. Our results show that a subset of FOLIO remains a challenge for one of the most capable Large Language Model (LLM) publicly available, GPT-4.
pdf
bib
abs
Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning
Aosong Feng
|
Rex Ying
|
Leandros Tassiulas
Findings of the Association for Computational Linguistics: EMNLP 2024
As the demand for processing extended textual data grows, the ability to handle long-range dependencies and maintain computational efficiency is more critical than ever. One of the key issues for long-sequence modeling using attention-based model is the mismatch between the limited-range modeling power of full attention and the long-range token dependency in the input sequence. In this work, we propose to scale up the attention receptive field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension. The resulting Tensorized Attention can be adopted as efficient transformer backbones to extend input context length with improved memory and time efficiency. We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention. Extensive experiments show that tensorized attention can be used to adapt pretrained LLMs with improved efficiency. Notably, using customized Triton kernels, tensorization enables Llama-8B training under 32,768 context length and can steadily extrapolate to 128k length during inference with 11 times speedup (compared to full attention with FlashAttention-2).
pdf
bib
abs
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains
Simeng Han
|
Aaron Yu
|
Rui Shen
|
Zhenting Qi
|
Martin Riddell
|
Wenfei Zhou
|
Yujie Qiao
|
Yilun Zhao
|
Semih Yavuz
|
Ye Liu
|
Shafiq Joty
|
Yingbo Zhou
|
Caiming Xiong
|
Dragomir Radev
|
Rex Ying
|
Arman Cohan
Findings of the Association for Computational Linguistics: EMNLP 2024
Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for properly assessing model’s capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is collected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore, fine-tuning Llam3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets.
2023
pdf
bib
abs
HiPool: Modeling Long Documents Using Graph Neural Networks
Irene Li
|
Aosong Feng
|
Dragomir Radev
|
Rex Ying
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Encoding long sequences in Natural Language Processing (NLP) is a challenging problem. Though recent pretraining language models achieve satisfying performances in many NLP tasks, they are still restricted by a pre-defined maximum length, making them challenging to be extended to longer sequences. So some recent works utilize hierarchies to model long sequences. However, most of them apply sequential models for upper hierarchies, suffering from long dependency issues. In this paper, we alleviate these issues through a graph-based method. We first chunk the sequence with a fixed length to model the sentence-level information. We then leverage graphs to model intra- and cross-sentence correlations with a new attention mechanism. Additionally, due to limited standard benchmarks for long document classification (LDC), we propose a new challenging benchmark, totaling six datasets with up to 53k samples and 4034 average tokens’ length. Evaluation shows our model surpasses competitive baselines by 2.6% in F1 score, and 4.8% on the longest sequence dataset. Our method is shown to outperform hierarchical sequential models with better performance and scalability, especially for longer sequences.
2021
pdf
bib
abs
Graph Ensemble Learning over Multiple Dependency Trees for Aspect-level Sentiment Classification
Xiaochen Hou
|
Peng Qi
|
Guangtao Wang
|
Rex Ying
|
Jing Huang
|
Xiaodong He
|
Bowen Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Recent work on aspect-level sentiment classification has demonstrated the efficacy of incorporating syntactic structures such as dependency trees with graph neural networks (GNN), but these approaches are usually vulnerable to parsing errors. To better leverage syntactic information in the face of unavoidable errors, we propose a simple yet effective graph ensemble technique, GraphMerge, to make use of the predictions from different parsers. Instead of assigning one set of model parameters to each dependency tree, we first combine the dependency relations from different parses before applying GNNs over the resulting graph. This allows GNN models to be robust to parse errors at no additional computational cost, and helps avoid overparameterization and overfitting from GNN layer stacking by introducing more connectivity into the ensemble graph. Our experiments on the SemEval 2014 Task 4 and ACL 14 Twitter datasets show that our GraphMerge model not only outperforms models with single dependency tree, but also beats other ensemble models without adding model parameters.