Tao Ji - ACL Anthology

Tao Ji

2025

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li | Jiajun Sun | Guodong Zheng | Xiaoran Fan | Yujiong Shen | Yi Lu | Zhiheng Xi | Yuming Yang | Wenming Tan | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2025

Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model’s over-susceptibility to image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable adversarial training method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
Junjie Ye | Guanyu Li | SongYang Gao | Caishuang Huang | Yilong Wu | Sixian Li | Xiaoran Fan | Shihan Dou | Tao Ji | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 31st International Conference on Computational Linguistics

Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs’ tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.

The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights
Yufang Liu | Yao Du | Tao Ji | Jianing Wang | Yang Liu | Yuanbin Wu | Aimin Zhou | Mengdi Zhang | Xunliang Cai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning.

DVAGen: Dynamic Vocabulary Augmented Generation
Wei Du | Nuowei Liu | Jie Wang | Jiahao Kuang | Tao Ji | Xiaoling Wang | Yuanbin Wu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs
Tao Ji | Bin Guo | Yuanbin Wu | Qipeng Guo | Lixing Shen | Zhan Chen | Xipeng Qiu | Qi Zhang | Tao Gui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (**MHA2MLA**), which includes two key components: for *partial-RoPE*, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for *low-rank approximation*, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.6% to 1%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 1% drop in LongBench performance. Our source code is publicly available at https://github.com/JT-Ushio/MHA2MLA.

2024

Generation with Dynamic Vocabulary
Yanting Liu | Tao Ji | Changzhi Sun | Yuanbin Wu | Xiaoling Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We introduce a new dynamic vocabulary for language models. It can involve arbitrary text spans during generation. These text spans act as basic generation bricks, akin to tokens in the traditional static vocabularies. We show that, the ability to generate multi-tokens atomically improve both generation quality and efficiency (compared to the standard language model, the MAUVE metric is increased by 25%, the latency is decreased by 20 %). The dynamic vocabulary can be deployed in a plug-and-play way, thus is attractive for various downstream applications. For example, we demonstrate that dynamic vocabulary can be applied to different domains in a training-free manner. It also helps to generate reliable citations in question answering tasks (substantially enhancing citation results without compromising answer accuracy).

LongHeads: Multi-Head Attention is Secretly a Long Context Processor
Yi Lu | Xin Zhou | Wei He | Jun Zhao | Tao Ji | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention’s quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM’s long context ability by unlocking multi-head attention’s untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently and fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads’ efficacy in extending the usable context window for existing models.

The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. The code and dataset will be made available upon publication.

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models
Yufang Liu | Tao Ji | Changzhi Sun | Yuanbin Wu | Aimin Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations for CLIP model, and we show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

Length Generalization of Causal Transformers without Position Encoding
Jie Wang | Tao Ji | Yuanbin Wu | Hang Yan | Tao Gui | Qi Zhang | Xuanjing Huang | Xiaoling Wang
Findings of the Association for Computational Linguistics: ACL 2024

Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE’s generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads’ best temperature hyper-parameters, which substantially expands NoPE’s context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

AntLM: Bridging Causal and Masked Language Models
Xinru Yu | Bin Guo | Shiwei Luo | Jie Wang | Tao Ji | Yuanbin Wu
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

Causal Language Modeling (CLM) and Masked Language Modeling (MLM) are two mainstream learning paradigms based on Transformer networks, specifically the Decoder-only and Encoder-only architectures. The strengths of each paradigm in downstream tasks have shown a mix of advantages and disadvantages. In the past BabyLM Challenge 2023, although the MLM paradigm achieved the best average performance, the CLM paradigm demonstrated significantly faster convergence rates. For the BabyLM Challenge 2024, we propose a novel language modeling paradigm named AntLM, which integrates both CLM and MLM to leverage the advantages of these two classic paradigms. We chose the strict-small track and conducted experiments on two foundation models: BabyLlama, representing CLM, and LTG-BERT, representing MLM. During the training process for specific foundation models, we alternate between applying CLM or MLM training objectives and causal or bidirectional attention masks. Experimental results show that combining the two pretraining objectives leverages their strengths, enhancing overall training performance. Under the same epochs, AntLM_BabyLlama improves Macro-average by 1%, and AntLM_LTG-BERT achieves a 2.2% increase over the baselines.

2023

Rehearsal-free Continual Language Learning via Efficient Parameter Isolation
Zhicheng Wang | Yufang Liu | Tao Ji | Xiaoling Wang | Yuanbin Wu | Congcong Jiang | Ye Chao | Zhencong Han | Ling Wang | Xu Shao | Wenqiu Zeng
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study the problem of defying catastrophic forgetting when learning a series of language processing tasks. Compared with previous methods, we emphasize the importance of not caching history tasks’ data, which makes the problem more challenging. Our proposed method applies the parameter isolation strategy. For each task, it allocates a small portion of private parameters and learns them with a shared pre-trained model. To load correct parameters at testing time, we introduce a simple yet effective non-parametric method. Experiments on continual language learning benchmarks show that our method is significantly better than all existing no-data-cache methods, and is comparable (or even better) than those using historical data.

Typology Guided Multilingual Position Representations: Case on Dependency Parsing
Tao Ji | Yuanbin Wu | Xiaoling Wang
Findings of the Association for Computational Linguistics: ACL 2023

Recent multilingual models benefit from strong unified semantic representation models. However, due to conflict linguistic regularities, ignoring language-specific features during multilingual learning may suffer from negative transfer. In this work, we analyze the relationbetween a language’s position space and its typological characterization, and suggest deploying different position spaces for different languages. We develop a position generation network which combines prior knowledge from typology features and existing position vectors. Experiments on the multilingual dependency parsing task show that the learned position vectors exhibit meaningful hidden structures, and they can help achieving the best multilingual parsing results.

2022

Zero-Shot Event Detection Based on Ordered Contrastive Learning and Prompt-Based Prediction
Senhui Zhang | Tao Ji | Wendi Ji | Xiaoling Wang
Findings of the Association for Computational Linguistics: NAACL 2022

Event detection is a classic natural language processing task. However, the constantly emerging new events make supervised methods not applicable to unseen types. Previous zero-shot event detection methods either require predefined event types as heuristic rules or resort to external semantic analyzing tools. To overcome this weakness, we propose an end-to-end framework named Zero-Shot Event Detection Based on Ordered Contrastive Learning and Prompt-Based Prediction (ZEOP). By creatively introducing multiple contrastive samples with ordered similarities, the encoder can learn event representations from both instance-level and class-level, which makes the distinctions between different unseen types more significant. Meanwhile, we utilize the prompt-based prediction to identify trigger words without relying on external resources. Experiments demonstrate that our model detects events more effectively and accurately than state-of-the-art methods.

Explore Unsupervised Structures in Pretrained Models for Relation Extraction
Xi Yang | Tao Ji | Yuanbin Wu
Findings of the Association for Computational Linguistics: EMNLP 2022

Syntactic trees have been widely applied in relation extraction (RE). However, since parsing qualities are not stable on different text domains and a pre-defined grammar may not well fit the target relation schema, the introduction of syntactic structures sometimes fails to improve RE performances consistently. In this work, we study RE models with various unsupervised structures mined from pre-trained language models (e.g., BERT). We show that, similar to syntactic trees, unsupervised structures are quite informative for RE task: they are able to obtain competitive (even the best) performance scores on benchmark RE datasets (ACE05, WebNLG, SciERC). We also conduct detailed analyses on their abilities of adapting new RE domains and influence of noise links in those structures. The results suggest that unsupervised structures are reasonable alternatives of commonly used syntactic structures in relation extraction models.

2021

Word Reordering for Zero-shot Cross-lingual Structured Prediction
Tao Ji | Yong Jiang | Tao Wang | Zhongqiang Huang | Fei Huang | Yuanbin Wu | Xiaoling Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Adapting word order from one language to another is a key problem in cross-lingual structured prediction. Current sentence encoders (e.g., RNN, Transformer with position embeddings) are usually word order sensitive. Even with uniform word form representations (MUSE, mBERT), word order discrepancies may hurt the adaptation of models. In this paper, we build structured prediction models with bag-of-words inputs, and introduce a new reordering module to organizing words following the source language order, which learns task-specific reordering strategies from a general-purpose order predictor model. Experiments on zero-shot cross-lingual dependency parsing, POS tagging, and morphological tagging show that our model can significantly improve target language performances, especially for languages that are distant from the source language.

A Unified Encoding of Structures in Transition Systems
Tao Ji | Yong Jiang | Tao Wang | Zhongqiang Huang | Fei Huang | Yuanbin Wu | Xiaoling Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Transition systems usually contain various dynamic structures (e.g., stacks, buffers). An ideal transition-based model should encode these structures completely and efficiently. Previous works relying on templates or neural network structures either only encode partial structure information or suffer from computation efficiency. In this paper, we propose a novel attention-based encoder unifying representation of all structures in a transition system. Specifically, we separate two views of items on structures, namely structure-invariant view and structure-dependent view. With the help of parallel-friendly attention network, we are able to encoding transition states with O(1) additional complexity (with respect to basic feature extractors). Experiments on the PTB and UD show that our proposed method significantly improves the test speed and achieves the best transition-based model, and is comparable to state-of-the-art methods.

2019

Graph-based Dependency Parsing with Graph Neural Networks
Tao Ji | Yuanbin Wu | Man Lan
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We investigate the problem of efficiently incorporating high-order features into neural graph-based dependency parsing. Instead of explicitly extracting high-order features from intermediate parse trees, we develop a more powerful dependency tree node representation which captures high-order information concisely and efficiently. We use graph neural networks (GNNs) to learn the representations and discuss several new configurations of GNN’s updating and aggregation functions. Experiments on PTB show that our parser achieves the best UAS and LAS on PTB (96.0%, 94.3%) among systems without using any external resources.

2018

AntNLP at CoNLL 2018 Shared Task: A Graph-Based Parser for Universal Dependency Parsing
Tao Ji | Yufang Liu | Yijun Wang | Yuanbin Wu | Man Lan
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We describe the graph-based dependency parser in our system (AntNLP) submitted to the CoNLL 2018 UD Shared Task. We use bidirectional lstm to get the word representation, then a bi-affine pointer networks to compute scores of candidate dependency edges and the MST algorithm to get the final dependency tree. From the official testing results, our system gets 70.90 LAS F1 score (rank 9/26), 55.92 MLAS (10/26) and 60.91 BLEX (8/26).

2017

A Fast and Lightweight System for Multilingual Dependency Parsing
Tao Ji | Yuanbin Wu | Man Lan
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present a multilingual dependency parser with a bidirectional-LSTM (BiLSTM) feature extractor and a multi-layer perceptron (MLP) classifier. We trained our transition-based projective parser in UD version 2.0 datasets without any additional data. The parser is fast, lightweight and effective on big treebanks. In the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, the official results show that the macro-averaged LAS F1 score of our system Mengest is 61.33%.

Co-authors

Caishuang Huang 2

Zhongqiang Huang 2

Congcong Jiang 1

Xipeng Qiu (邱锡鹏) 1

Zhicheng Wang 1

Hang Yan (航颜) 1

Junjie Ye (叶俊杰) 1

Guodong Zheng 1

Venues