Ping Jian (鉴萍) - ACL Anthology

Ping Jian

Also published as: 萍鉴

2026

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions
Zhongbin Guo | Zhen Yang | Yushan Li | Xinyue Zhang | Wenyu Gao | Jiacheng Wang | Chengzhi Li | Xiangrui Liu | Ping Jian
Findings of the Association for Computational Linguistics: ACL 2026

Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce **SiT-Bench**, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents.

Where CoT Reasoning Commits: Entropy Traces Identify Interpretable Attention Heads
Tianhe Zhang | Yonghong Deng | Ping Jian | Zhen Yang | Boyang Wang | Xinyue Zhang
Findings of the Association for Computational Linguistics: ACL 2026

While LLMs demonstrate impressive reasoning capabilities, their internal decision dynamics remain opaque. To render these process interpretable and intervenable, we propose Dynamic Entropy Tracing, a mechanism-aware framework that interprets the evolving "choice state" of attention heads during CoT generation through stepwise head-wise option-logit and entropy tracing. Our analysis reveals distinct functional behaviors at attention heads: Steadfast Heads, characterized by consistently low entropy and producing a sharp, option-selective logit pattern with a stable top choice, and Wavering Heads, characterized by consistently high entropy and producing flat or oscillatory option logits without a persistent winner. Leveraging these traces, we identify a set of intervention targets and perform Selective Head Fine-Tuning, updating solely these selected heads against a frozen backbone. Experiments across the LLaMA and Qwen families reveal a striking plasticity hierarchy: fine-tuning just 30 Wavering Heads recovers over 98% of the performance achieved by full-parameter tuning, and in some settings modestly exceeds it. In contrast, intervening on Steadfast Heads yields much less gains. Our findings translate process-level mechanistic observables into a principled criterion for selective fine-tuning, offering a fundamental insight: the most effective tuning knobs are not the components that signal the final decision, but those that retain uncertainty, and thus plasticity, during its formation.

Temporal Token Matters: Investigating and Interpreting the Consistency of Temporal Ordering in Large Language Models
Zhen Yang | Xinyue Zhang | Ping Jian | Chengzhi Li | Zhongbin Guo | Jiaping Feng | Wenpeng Lu
Findings of the Association for Computational Linguistics: ACL 2026

Despite the remarkable performance across numerous tasks, Large Language Models (LLMs) still exhibit notable deficiencies in temporal reasoning, even in simple event ordering tasks. For instance, a slight alteration in the temporal phrasing of the question (e.g., changing "Is event A before B?” to "Is event A after B?") can lead LLMs to hallucinate and produce inconsistent answers, reflecting a lack of robust temporal reasoning. Although many prior studies have focused on benchmarking and improving the temporal reasoning ability of LLMs, little is known about the intrinsic mechanisms within LLMs when performing temporal reasoning. In this work, we investigate the mechanistic interpretability of temporal ordering within event temporal reasoning through a structured "Identify-Interpret-Verify” pipeline. We first employ path patching to identify a sparse subset of attention heads that are causally responsible for reasoning outcomes. Detailed pattern analysis reveals that these key heads specialize in attending to either temporal keywords (semantic cues) or structural delimiters (syntactic cues). Furthermore, we rigorously validate the observed mechanism through comprehensive intervention-based experiments, ranging from head ablation to targeted attention modulation. We demonstrate that dynamically modulating the attention of these specific heads can robustly enhance model performance, which serves as strong empirical evidence that our identified mechanism faithfully captures the internal logic of temporal ordering in LLMs.

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study
Zhen Yang | Ping Jian | Zhongbin Guo | Zuming Zhang | Chengzhi Li | Yonghong Deng | Xinyue Zhang | Wenpeng Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities.

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Greg Durrett | Ping Jian
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

2025

Option Symbol Matters: Investigating and Mitigating Multiple-Choice Option Symbol Bias of Large Language Models
Zhen Yang | Ping Jian | Chengzhi Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Multiple-Choice Question Answering (MCQA) is a widely used task in the evaluation of Large Language Models (LLMs). In this work, we reveal that current LLMs’ performance in MCQA could be heavily influenced by the choice of option symbol sets, due to the option symbol bias. That is, when altering only the option symbols (e.g., A/B/C/D→i/ii/iii/iv), the results could vary sharply, leading to a margin of approximately 10% in accuracy. To uncover the mechanisms behind this, we investigate the internal components of LLMs from a causal perspective. By measuring the causal effects, we identify a small subset of attention heads responsible for the symbol bias. Subsequently, we interpret these key components in a human-understandable way, showing that attention heads with higher causal effects are more likely to focus on only option symbols, while those with lower causal effects tend to distribute their attention across the content of questions and options. It also motivates us to pursue debiasing based on the causal effects. Specifically, to mitigate such bias, we propose a tuning-free, causal effect driven debiasing method which intervenes the activations of identified components according to their causal effects, with stronger interventions corresponding to higher causal effects. Experimental results demonstrate that the proposed method not only alleviates aforementioned bias, but also improves the MCQA performance of LLMs.

Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning
Zhen Yang | Ping Jian | Chengzhi Li | Chenxu Wang | Xinyue Zhang | Wenpeng Lu
Findings of the Association for Computational Linguistics: EMNLP 2025

With the widespread applications of large language models (LLMs), aligning LLMs with human values has emerged as a critical challenge. For alignment, we always expect LLMs to be honest, positive, harmless, etc. And LLMs appear to be capable of generating the desired outputs after the alignment tuning process, such as the preference tuning via reinforcement learning from human feedback (RLHF). However, it also raises a question about **after alignment, do LLMs genuinely obtain a value distinction between positives and negatives, beyond the generation of positive outputs?** In this work, we start by investigating this question from the token distribution perspective. Our findings reveal that compared to the unaligned versions, LLMs after alignment exhibit a larger logits gap between positive and negative tokens at each generation step, which suggests that LLMs do obtain a value distinction of positives and negatives after alignment. Meanwhile, it also motivates us to achieve alignment by directly constructing such value distinction, thus alleviating the excessive reliance on computational resources required by training-time alignment. Specifically, we propose a representation editing method that intervenes the last hidden representation by amplifying the logits difference between positive and negative tokens (defined as anchor words). Experimental results demonstrate that the proposed method not only achieves effective alignment, but also requires fewer computational resources compared to training-time alignment methods

Memory or Reasoning? Explore How LLMs Compute Mixed Arithmetic Expressions
Chengzhi Li | Heyan Huang | Ping Jian | Zhen Yang | Chenxu Wang | Yifan Wang
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) can solve complex multi-step math reasoning problems, but little is known about how these computations are implemented internally. Many recent studies have investigated the mechanisms of LLMs on simple arithmetic tasks (e.g., a+b, a× b), but how LLMs solve mixed arithmetic tasks still remains unexplored. This gap highlights the limitation of these findings in reflecting real-world scenarios. In this work, we take a step further to explore how LLMs compute mixed arithmetic expressions. We find that LLMs follow a similar workflow to mixed arithmetic calculations: first parsing the complete expression, then using attention heads to aggregate information to the last token position for result generation, without step-by-step reasoning at the token dimension. However, **for some specific expressions, the model generates the final result depends on the generation of intermediate results at the last token position, which is similar to human thinking.** Furthermore, we propose a **C**ausal **E**ffect **D**riven **F**ine-tuning method (CEDF) to adaptively enhance the identified key components used to execute mixed arithmetic calculations to improve LLMs reasoning ability.

2024

Improving Implicit Discourse Relation Recognition with Semantics Confrontation
Mingyang Cai | Zhen Yang | Ping Jian
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Implicit Discourse Relation Recognition (IDRR), which infers discourse logical relations without explicit connectives, is one of the most challenging tasks in natural language processing (NLP). Recently, pre-trained language models (PLMs) have yielded impressive results across numerous NLP tasks, but their performance still remains unsatisfactory in IDRR. We argue that prior studies have not fully harnessed the potential of PLMs, thereby resulting in a mixture of logical semantics, which determine the logical relations between discourse arguments, and general semantics, which encapsulate the non-logical contextual aspects (detailed in Sec.1). Such a mixture would inevitably compromise the logic reasoning ability of PLMs. Therefore, we propose a novel method that trains the PLMs through two semantics enhancers to implicitly differentiate logical and general semantics, ultimately achieving logical semantics enhancement. Due to the characteristic of PLM in word representation learning, these two semantics enhancers will inherently confront with each other, facilitating an augmentation of logical semantics by disentangling them from general semantics. The experimental results on PDTB 2.0 dataset show that the confrontation approach exceeds our baseline by 3.81% F1 score, and the effectiveness of the semantics confrontation method is validated by comprehensive ablation experiments.

Effective Integration of Text Diffusion and Pre-Trained Language Models with Linguistic Easy-First Schedule
Yimin Ou | Ping Jian
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Diffusion models have become a powerful generative modeling paradigm, achieving great success in continuous data patterns. However, the discrete nature of text data results in compatibility issues between continuous diffusion models (CDMs) and pre-trained language models (PLMs). That is, the performance of diffusion models even degrades when combined with PLMs. To alleviate this issue, we propose to utilize a pre-trained decoder to convert the denoised embedding vectors into natural language instead of using the widely used rounding operation. In this way, CDMs can be more effectively combined with PLMs. Additionally, considering that existing noise schedules in text diffusion models do not take into account the linguistic differences among tokens, which violates the easy-first policy for text generation, we propose a linguistic easy-first schedule that incorporates the measure of word importance, conforming to easy-first-generation linguistic features and bringing about improved generation quality. Experiment results on the E2E dataset and five controllable tasks show that our approach can combine the merits of CDMs and PLMs, significantly outperforming other diffusion-based models.

2023

Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition
Chenxu Wang | Ping Jian | Mu Huang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Implicit Discourse Relation Recognition (IDRR), which infers discourse relations without the help of explicit connectives, is still a crucial and challenging task for discourse parsing. Recent works tend to exploit the hierarchical structure information from the annotated senses, which demonstrate enhanced discourse relation representations can be obtained by integrating sense hierarchy. Nevertheless, the performance and robustness for IDRR are significantly constrained by the availability of annotated data. Fortunately, there is a wealth of unannotated utterances with explicit connectives, that can be utilized to acquire enriched discourse relation features. In light of such motivation, we propose a Prompt-based Logical Semantics Enhancement (PLSE) method for IDRR. Essentially, our method seamlessly injects knowledge relevant to discourse relation into pre-trained language models through prompt-based connective prediction. Furthermore, considering the prompt-based connective prediction exhibits local dependencies due to the deficiency of masked language model (MLM) in capturing global semantics, we design a novel self-supervised learning objective based on mutual information maximization to derive enhanced representations of logical semantics for IDRR. Experimental results on PDTB 2.0 and CoNLL16 datasets demonstrate that our method achieves outstanding and consistent performance against the current state-of-the-art models.

基于词频效应控制的神经机器翻译用词多样性增强方法(Improving Word-level Diversity in Neural Machine Translation by Controlling the Effects of Word Frequency)
Xuewen Shi (史学文) | Ping Jian (鉴萍) | Yikun Tang (唐翼琨) | Heyan HUang (黄河燕)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“通过最大似然估计优化的神经机器翻译(NMT)容易出现不可最大化的标记或低频词精度差等问题,这会导致生成的翻译缺乏词级别的多样性。词频在训练数据上的不均衡分布是造成上述现象的原因之一。本文旨在通过限制词频对 NMT 解码时估计概率的影响来缓解上述问题。具体地,我们采用了基于因果推断理论的半同胞回归去噪框架,并结合本文提出的自适应去噪系数来控制词频对模型估计概率的影响,以获得更准确的模型估计概率,并丰富 NMT 译文用词的多样性。本文的实验在四个代表不同资源规模的翻译任务上进行,分别是维吾尔语-汉语、汉语-英语、英语-德语和英语-法语。实验结果表明,本文所提出的方法在提升 NMT 译文词级别多样性的同时,不会损害译文的质量。另外,本文提出的方法还具有模型无关、可解释性强等优点。”

2021

Context Tracking Network: Graph-based Context Modeling for Implicit Discourse Relation Recognition
Yingxue Zhang | Fandong Meng | Peng Li | Ping Jian | Jie Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Implicit discourse relation recognition (IDRR) aims to identify logical relations between two adjacent sentences in the discourse. Existing models fail to fully utilize the contextual information which plays an important role in interpreting each local sentence. In this paper, we thus propose a novel graph-based Context Tracking Network (CT-Net) to model the discourse context for IDRR. The CT-Net firstly converts the discourse into the paragraph association graph (PAG), where each sentence tracks their closely related context from the intricate discourse through different types of edges. Then, the CT-Net extracts contextual representation from the PAG through a specially designed cross-grained updating mechanism, which can effectively integrate both sentence-level and token-level contextual semantics. Experiments on PDTB 2.0 show that the CT-Net gains better performance than models that roughly model the context.

2020

Intra-Correlation Encoding for Chinese Sentence Intention Matching
Xu Zhang | Yifeng Li | Wenpeng Lu | Ping Jian | Guoqiang Zhang
Proceedings of the 28th International Conference on Computational Linguistics

Sentence intention matching is vital for natural language understanding. Especially for Chinese sentence intention matching task, due to the ambiguity of Chinese words, semantic missing or semantic confusion are more likely to occur in the encoding process. Although the existing methods have enriched text representation through pre-trained word embedding to solve this problem, due to the particularity of Chinese text, different granularities of pre-trained word embedding will affect the semantic description of a piece of text. In this paper, we propose an effective approach that combines character-granularity and word-granularity features to perform sentence intention matching, and we utilize soft alignment attention to enhance the local information of sentences on the corresponding levels. The proposed method can capture sentence feature information from multiple perspectives and correlation information between different levels of sentences. By evaluating on BQ and LCQMC datasets, our model has achieved remarkable results, and demonstrates better or comparable performance with BERT-based models.

2019

Improving Neural Machine Translation by Achieving Knowledge Transfer with Sentence Alignment Learning
Xuewen Shi | Heyan Huang | Wenguan Wang | Ping Jian | Yi-Kun Tang
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Neural Machine Translation (NMT) optimized by Maximum Likelihood Estimation (MLE) lacks the guarantee of translation adequacy. To alleviate this problem, we propose an NMT approach that heightens the adequacy in machine translation by transferring the semantic knowledge learned from bilingual sentence alignment. Specifically, we first design a discriminator that learns to estimate sentence aligning score over translation candidates, and then the learned semantic knowledge is transfered to the NMT model under an adversarial learning framework. We also propose a gated self-attention based encoder for sentence embedding. Furthermore, an N-pair training loss is introduced in our framework to aid the discriminator in better capturing lexical evidence in translation candidates. Experimental results show that our proposed method outperforms baseline NMT models on Chinese-to-English and English-to-German translation tasks. Further analysis also indicates the detailed semantic knowledge transfered from the discriminator to the NMT model.

Induction Networks for Few-Shot Text Classification
Ruiying Geng | Binhua Li | Yongbin Li | Xiaodan Zhu | Ping Jian | Jian Sun
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Text classification tends to struggle when data is deficient or when it needs to adapt to unseen classes. In such challenging scenarios, recent studies have used meta-learning to simulate the few-shot task, in which new queries are compared to a small support set at the sample-wise level. However, this sample-wise comparison may be severely disturbed by the various expressions in the same class. Therefore, we should be able to learn a general representation of each class in the support set and then compare it to new queries. In this paper, we propose a novel Induction Network to learn such a generalized class-wise representation, by innovatively leveraging the dynamic routing algorithm in meta-learning. In this way, we find the model is able to induce and generalize better. We evaluate the proposed model on a well-studied sentiment classification dataset (English) and a real-world dialogue intent classification dataset (Chinese). Experiment results show that on both datasets, the proposed model significantly outperforms the existing state-of-the-art approaches, proving the effectiveness of class-wise generalization in few-shot text classification.

2017

QLUT at SemEval-2017 Task 2: Word Similarity Based on Word Embedding and Knowledge Base
Fanqing Meng | Wenpeng Lu | Yuteng Zhang | Ping Jian | Shumin Shi | Heyan Huang
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper shows the details of our system submissions in the task 2 of SemEval 2017. We take part in the subtask 1 of this task, which is an English monolingual subtask. This task is designed to evaluate the semantic word similarity of two linguistic items. The results of runs are assessed by standard Pearson and Spearman correlation, contrast with official gold standard set. The best performance of our runs is 0.781 (Final). The techniques of our runs mainly make use of the word embeddings and the knowledge-based method. The results demonstrate that the combined method is effective for the computation of word similarity, while the word embeddings and the knowledge-based technique, respectively, needs more deeply improvement in details.

BIT at SemEval-2017 Task 1: Using Semantic Information Space to Evaluate Semantic Textual Similarity
Hao Wu | Heyan Huang | Ping Jian | Yuhang Guo | Chao Su
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper presents three systems for semantic textual similarity (STS) evaluation at SemEval-2017 STS task. One is an unsupervised system and the other two are supervised systems which simply employ the unsupervised one. All our systems mainly depend on the (SIS), which is constructed based on the semantic hierarchical taxonomy in WordNet, to compute non-overlapping information content (IC) of sentences. Our team ranked 2nd among 31 participating teams by the primary score of Pearson correlation coefficient (PCC) mean of 7 tracks and achieved the best performance on Track 1 (AR-AR) dataset.

2016

Discourse Relation Sense Classification Systems for CoNLL-2016 Shared Task
Ping Jian | Xiaohan She | Chenwei Zhang | Pengcheng Zhang | Jian Feng
Proceedings of the CoNLL-16 shared task

2014

Introduction to BIT Chinese Spelling Correction System at CLP 2014 Bake-off
Min Liu | Ping Jian | Heyan Huang
Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing

2011

Unsupervised Word Sense Disambiguation Using Neighborhood Knowledge
Heyan Huang | Zhizhuo Yang | Ping Jian
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

2009

Layer-Based Dependency Parsing
Ping Jian | Chengqing Zong
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

Co-authors

Yonghong Deng 2

Yuhang Guo (郭宇航) 1

Shumin Shi (史树敏) 1

Jiacheng Wang 1

Chenwei Zhang 1

Guoqiang Zhang 1

Pengcheng Zhang 1

Yingxue Zhang 1

Chengqing Zong 1

Venues