Isabelle Lee
2026
FOL-Traces: Verified First-Order Logic Reasoning Traces at Scale
Isabelle Lee | Sarah Liaw | Dani Yogatama
Findings of the Association for Computational Linguistics: EACL 2026
Isabelle Lee | Sarah Liaw | Dani Yogatama
Findings of the Association for Computational Linguistics: EACL 2026
Reasoning in language models is difficult to evaluate: natural-language traces are unverifiable, symbolic datasets are too small, and most benchmarks conflate heuristics with inference. We present FOL-Traces, the first large-scale dataset of programmatically verified reasoning traces, enabling rigorous evaluation of structured logical inference. We also propose two challenging and comprehensive diagnostic tasks—masked operation prediction and step completion—that directly probe syntactic awareness and process fidelity. FOL-Traces serves as a scalable testbed for rigorously studying how models perform structured logical inference. Systematic experiments with 5 reasoning LLMs show that the dataset remains challenging: models only reach around 45.7% accuracy on masked operation prediction and around 27% on two-step completion.
2024
Self-contradictory reasoning evaluation and detection
Ziyi Liu | Soumya Sanyal | Isabelle Lee | Yongkang Du | Rahul Gupta | Yang Liu | Jieyu Zhao
Findings of the Association for Computational Linguistics: EMNLP 2024
Ziyi Liu | Soumya Sanyal | Isabelle Lee | Yongkang Du | Rahul Gupta | Yang Liu | Jieyu Zhao
Findings of the Association for Computational Linguistics: EMNLP 2024
In a plethora of recent work, large language models (LLMs) demonstrated impressive reasoning ability, but many proposed downstream reasoning tasks only focus on performance-wise evaluation. Two fundamental questions persist: 1) how consistent is the reasoning, and 2) can models detect unreliable reasoning? In this paper, we investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support answers. To answer 1), we define and assess the Self-Contra rate across three datasets and delve into finer-grained categories of Self-Contra reasoning. We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense. The model may generate correct answers by taking shortcuts in reasoning or overlooking contextual evidence, leading to compromised reasoning. For 2), we task the state-of-the-art model GPT-4 with identifying Self-Contra reasoning and finer-grained fallacies. We find that finer-grained aided detection can improve GPT-4’s ability to detect Self-Contra. However, it is only able to detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans. Our results indicate that current LLMs lack the robustness necessary for reliable reasoning and we emphasize the urgent need for establishing best practices in comprehensive reasoning evaluations beyond pure performance-based metrics.
On Retrieval Augmentation and the Limitations of Language Model Training
Ting-Rui Chiang | Xinyan Yu | Joshua Robinson | Ollie Liu | Isabelle Lee | Dani Yogatama
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Ting-Rui Chiang | Xinyan Yu | Joshua Robinson | Ollie Liu | Isabelle Lee | Dani Yogatama
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Augmenting a language model (LM) with k-nearest neighbors (kNN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility — the “softmax bottleneck.” We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional information that is not causally relevant. This task is challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral 7B, kNN retrieval augmentation consistently improves per formance in this setting. Finally, to make kNN retrieval more accessible, we propose using amulti-layer perceptron model that maps datastore keys to values as a drop-in replacement for traditional retrieval. This reduces storage costsby over 25x.
2021
GupShup: Summarizing Open-Domain Code-Switched Conversations
Laiba Mehnaz | Debanjan Mahata | Uma Sushmitha Gunturi | Amardeep Kumar | Rakesh Gosangi | Riya Jain | Gauri Gupta | Isabelle Lee | Anish Acharya | Rajiv Ratn Shah
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Laiba Mehnaz | Debanjan Mahata | Uma Sushmitha Gunturi | Amardeep Kumar | Rakesh Gosangi | Riya Jain | Gauri Gupta | Isabelle Lee | Anish Acharya | Rajiv Ratn Shah
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Code-switching is the communication phenomenon where the speakers switch between different languages during a conversation. With the widespread adoption of conversational agents and chat platforms, code-switching has become an integral part of written conversations in many multi-lingual communities worldwide. Therefore, it is essential to develop techniques for understanding and summarizing these conversations. Towards this objective, we introduce the task of abstractive summarization of Hindi-English (Hi-En) code-switched conversations. We also develop the first code-switched conversation summarization dataset - GupShup, which contains over 6,800 Hi-En conversations and their corresponding human-annotated summaries in English (En) and Hi-En. We present a detailed account of the entire data collection and annotation process. We analyze the dataset using various code-switching statistics. We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation. Our results show that multi-lingual mBART and multi-view seq2seq models obtain the best performances on this new dataset. We also conduct an extensive qualitative analysis to provide insight into the models and some of their shortcomings.