Jinzhong Ning

2026

E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition
Meng Zhang | Jinzhong Ning | Xiaolong Wu | Hongfei Lin | Yijia Zhang
Findings of the Association for Computational Linguistics: ACL 2026

Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To further address the instability of generative bounding box prediction, we introduce Gaussian Risk-Aware Box Perturbation (GRBP), which replaces hard box supervision with probabilistically perturbed soft targets to improve robustness against annotation noise and discretization errors. Extensive experiments on the Twitter-GMNER and Twitter-FMNERG benchmarks demonstrate that E2E-GMNER achieves highly competitive performance compared with state of the art methods, validating the effectiveness of unified end-to-end optimization and noise-aware grounding supervision.

2025

pdf bib abs

LLM-Driven Implicit Target Augmentation and Fine-Grained Contextual Modeling for Zero-Shot and Few-Shot Stance Detection
Yanxu Ji | Jinzhong Ning | Yijia Zhang | Zhi Liu | Hongfei Lin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Stance detection aims to identify the attitude expressed in text towards a specific target. Recent studies on zero-shot and few-shot stance detection focus primarily on learning generalized representations from explicit targets. However, these methods often neglect implicit yet semantically important targets and fail to adaptively adjust the relative contributions of text and target in light of contextual dependencies. To overcome these limitations, we propose a novel two-stage framework: First, a data augmentation framework named Hierarchical Collaborative Target Augmentation (HCTA) employs Large Language Models (LLMs) to identify and annotate implicit targets via Chain-of-Thought (CoT) prompting and multi-LLM voting, significantly enriching training data with latent semantic relations. Second, we introduce DyMCA, a Dynamic Multi-level Context-aware Attention Network, integrating a joint text-target encoding and a content-aware mechanism to dynamically adjust text-target contributions based on context. Experiments on the benchmark dataset demonstrate that our approach achieves state-of-the-art results, confirming the effectiveness of implicit target augmentation and fine-grained contextual modeling.

pdf bib abs

"中文语音实体关系三元组抽取任务(Chinese Speech Entity-Relation Triple Extraction Task, CSRTE)是第二十四届中国计算语言学大会中的一项技术评测,旨在从中文语音数据中自动识别并提取实体及其相互关系,构建结构化的语音关系三元组(头实体、关系、尾实体)。本任务的目标是提升中文语音关系三元组抽取的准确性与效率,增强模型在不同语境和复杂语音场景下的鲁棒性,实现从语音输入到文本三元组输出的全流程自动化处理。通过本次评测,有助于推动中文语音信息抽取技术的发展,促进语音与自然语言处理技术的深度融合,为智能应用提供更加丰富且精准的基础数据支持。此次评测共有257支队伍报名参赛,其中59支队伍提交了A榜成绩。成绩排名前15的队伍晋级A榜,并且表现突出的前朷支队伍提交了技术报告。"

2024

pdf bib abs

In recent years, with the vast and rapidly increasing amounts of spoken and textual data, Named Entity Recognition (NER) tasks have evolved into three distinct categories, i.e., text-based NER (TNER), Speech NER (SNER) and Multimodal NER (MNER). However, existing approaches typically require designing separate models for each task, overlooking the potential connections between tasks and limiting the versatility of NER methods. To mitigate these limitations, we introduce a new task named Integrated Multimodal NER (IMNER) to break the boundaries between different modal NER tasks, enabling a unified implementation of them. To achieve this, we first design a unified data format for inputs from different modalities. Then, leveraging the pre-trained MMSpeech model as the backbone, we propose an **I**ntegrated **M**ultimod**a**l **Ge**neration Framework (**IMAGE**), formulating the Chinese IMNER task as an entity-aware text generation task. Experimental results demonstrate the feasibility of our proposed IMAGE framework in the IMNER task. Our work in integrated multimodal learning in advancing the performance of NER may set up a new direction for future research in the field. Our source code is available at https://github.com/NingJinzhong/IMAGE4IMNER.

2023

pdf bib abs

OD-RTE: A One-Stage Object Detection Framework for Relational Triple Extraction
Jinzhong Ning | Zhihao Yang | Yuanyuan Sun | Zhizheng Wang | Hongfei Lin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The Relational Triple Extraction (RTE) task is a fundamental and essential information extraction task. Recently, the table-filling RTE methods have received lots of attention. Despite their success, they suffer from some inherent problems such as underutilizing regional information of triple. In this work, we treat the RTE task based on table-filling method as an Object Detection task and propose a one-stage Object Detection framework for Relational Triple Extraction (OD-RTE). In this framework, the vertices-based bounding box detection, coupled with auxiliary global relational triple region detection, ensuring that regional information of triple could be fully utilized. Besides, our proposed decoding scheme could extract all types of triples. In addition, the negative sampling strategy of relations in the training stage improves the training efficiency while alleviating the imbalance of positive and negative relations. The experimental results show that 1) OD-RTE achieves the state-of-the-art performance on two widely used datasets (i.e., NYT and WebNLG). 2) Compared with the best performing table-filling method, OD-RTE achieves faster training and inference speed with lower GPU memory usage. To facilitate future research in this area, the codes are publicly available at https://github.com/NingJinzhong/ODRTE.

2022

pdf bib abs

Chinese Named Entity Recognition (NER) has continued to attract research attention. However, most existing studies only explore the internal features of the Chinese language but neglect other lingual modal features. Actually, as another modal knowledge of the Chinese language, English contains rich prompts about entities that can potentially be applied to improve the performance of Chinese NER. Therefore, in this study, we explore the bilingual enhancement for Chinese NER and propose a unified bilingual interaction module called the Adapted Cross-Transformers with Global Sparse Attention (ACT-S) to capture the interaction of bilingual information. We utilize a model built upon several different ACT-Ss to integrate the rich English information into the Chinese representation. Moreover, our model can learn the interaction of information between bilinguals (inter-features) and the dependency information within Chinese (intra-features). Compared with existing Chinese NER methods, our proposed model can better handle entities with complex structures. The English text that enhances the model is automatically generated by machine translation, avoiding high labour costs. Experimental results on four well-known benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

Co-authors

Zhi Liu 1

Bo Xu 1

Venues

Fix author