Jinyue Feng


2025

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, so it scales with future LLM improvements. Indeed, our experiments on state-of-the-art LLMs show performance degrading to below random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.
We present CausalLink, an innovative evaluation framework that interactively assesses thecausal reasoning skill to identify the correct intervention in conversational language models. Each CausalLink test case creates a hypothetical environment in which the language models are instructed to apply interventions to entities whose interactions follow predefined causal relations generated from controllable causal graphs. Our evaluation framework isolates causal capabilities from the confounding effects of world knowledge and semantic cues. We evaluate a series of LLMs in a scenario featuring movements of geometric shapes and discover that models start to exhibit reliable reasoning on two or three variables at the 14-billion-parameter scale. However, the performance of state-of-the-art models such as GPT4o degrades below random chance as the number of variables increases. We identify and analyze several key failure modes.

2020

Clinical prediction models often use structured variables and provide outcomes that are not readily interpretable by clinicians. Further, free-text medical notes may contain information not immediately available in structured variables. We propose a hierarchical CNN-transformer model with explicit attention as an interpretable, multi-task clinical language model, which achieves an AUROC of 0.75 and 0.78 on sepsis and mortality prediction, respectively. We also explore the relationships between learned features from structured and unstructured variables using projection-weighted canonical correlation analysis. Finally, we outline a protocol to evaluate model usability in a clinical decision support context. From domain-expert evaluations, our model generates informative rationales that have promising real-life applications.