Shizhou Huang

2025

pdf bib abs
A Graph Interaction Framework on Relevance for Multimodal Named Entity Recognition with Multiple Images
Jiachen Zhao | Shizhou Huang | Xin Lin
Proceedings of the 31st International Conference on Computational Linguistics

Posts containing multiple images have significant research potential in Multimodal Named Entity Recognition nowadays. The previous methods determine whether the images are related to named entities in the text through similarity computation, such as using CLIP. However, it is not effective in some cases and not conducive to task transfer, especially in multi-image scenarios. To address the issue, we propose a graph interaction framework on relevance (GIFR) for Multimodal Named Entity Recognition with multiple images. For humans, they have the abilities to distinguish whether an image is relevant to named entities, but human capabilities are difficult to model. Therefore, we propose using reinforcement learning based on human preference to integrate human abilities into the model to determine whether an image-text pair is relevant, which is referred to as relevance. To better leverage relevance, we construct a heterogeneous graph and introduce graph transformer to enable information interaction. Experiments on benchmark datasets demonstrate that our method achieves the state-of-the-art performance.

pdf bib abs
MRE-MI: A Multi-image Dataset for Multimodal Relation Extraction in Social Media Posts
Shizhou Huang | Bo Xu | Changqun Li | Yang Yu | Xin Alex Lin
Findings of the Association for Computational Linguistics: NAACL 2025

Despite recent advances in Multimodal Relation Extraction (MRE), existing datasets and approaches primarily focus on single-image scenarios, overlooking the prevalent real-world cases where relationships are expressed through multiple images alongside text. To address this limitation, we present MRE-MI, a novel human-annotated dataset that includes both multi-image and single-image instances for relation extraction. Beyond dataset creation, we establish comprehensive baselines and propose a simple model named Global and Local Relevance-Modulated Attention Model (GLRA) to address the new challenges in multi-image scenarios. Our extensive experiments reveal that incorporating multiple images substantially improves relation extraction in multi-image scenarios. Furthermore, GLRA achieves state-of-the-art results on MRE-MI, demonstrating its effectiveness. The datasets and source code can be found at https://github.com/JinFish/MRE-MI.

2024

pdf bib abs
Hypernetwork-Assisted Parameter-Efficient Fine-Tuning with Meta-Knowledge Distillation for Domain Knowledge Disentanglement
Changqun Li | Linlin Wang | Xin Lin | Shizhou Huang | Liang He
Findings of the Association for Computational Linguistics: NAACL 2024

Domain adaptation from labeled source domains to the target domain is important in practical summarization scenarios. However, the key challenge is domain knowledge disentanglement. In this work, we explore how to disentangle domain-invariant knowledge from source domains while learning specific knowledge of the target domain. Specifically, we propose a hypernetwork-assisted encoder-decoder architecture with parameter-efficient fine-tuning. It leverages a hypernetwork instruction learning module to generate domain-specific parameters from the encoded inputs accompanied by task-related instruction. Further, to better disentangle and transfer knowledge from source domains to the target domain, we introduce a meta-knowledge distillation strategy to build a meta-teacher model that captures domain-invariant knowledge across multiple domains and use it to transfer knowledge to students. Experiments on three dialogue summarization datasets show the effectiveness of the proposed model. Human evaluations also show the superiority of our model with regard to the summary generation quality.

pdf bib abs
MGCL: Multi-Granularity Clue Learning for Emotion-Cause Pair Extraction via Cross-Grained Knowledge Distillation
Yang Yu | Xin Alex Lin | Changqun Li | Shizhou Huang | Liang He
Findings of the Association for Computational Linguistics: EMNLP 2024

Emotion-cause pair extraction (ECPE) aims to identify emotion clauses and their corresponding cause clauses within a document. Traditional methods often rely on coarse-grained clause-level annotations, which can overlook valuable fine-grained clues. To address this issue, we propose Multi-Granularity Clue Learning (MGCL), a novel approach designed to capture fine-grained emotion-cause clues from a weakly-supervised perspective efficiently. In MGCL, a teacher model is leveraged to give sub-clause clues without needing fine-grained annotated labels and guides a student model to identify clause-level emotion-cause pairs. Furthermore, we explore domain-invariant extra-clause clues under the teacher model’s advice to enhance the learning process. Experimental results on the benchmark dataset demonstrate that our method achieves state-of-the-art performance while offering improved interpretability.

pdf bib abs
MNER-MI: A Multi-image Dataset for Multimodal Named Entity Recognition in Social Media
Shizhou Huang | Bo Xu | Changqun Li | Jiabo Ye | Xin Lin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recently, multimodal named entity recognition (MNER) has emerged as a vital research area within named entity recognition. However, current MNER datasets and methods are predominantly based on text and a single accompanying image, leaving a significant research gap in MNER scenarios involving multiple images. To address the critical research gap and enhance the scope of MNER for real-world applications, we propose a novel human-annotated MNER dataset with multiple images called MNER-MI. Additionally, we construct a dataset named MNER-MI-Plus, derived from MNER-MI, to ensure its generality and applicability. Based on these datasets, we establish a comprehensive set of strong and representative baselines and we further propose a simple temporal prompt model with multiple images to address the new challenges in multi-image scenarios. We have conducted extensive experiments to demonstrate that considering multiple images provides a significant improvement over a single image and can offer substantial benefits for MNER. Furthermore, our proposed method achieves state-of-the-art results on both MNER-MI and MNER-MI-Plus, demonstrating its effectiveness. The datasets and source code can be found at https://github.com/JinFish/MNER-MI.

2022

pdf bib abs
Different Data, Different Modalities! Reinforced Data Splitting for Effective Multimodal Information Extraction from Social Media Posts
Bo Xu | Shizhou Huang | Ming Du | Hongya Wang | Hui Song | Chaofeng Sha | Yanghua Xiao
Proceedings of the 29th International Conference on Computational Linguistics

Recently, multimodal information extraction from social media posts has gained increasing attention in the natural language processing community. Despite their success, current approaches overestimate the significance of images. In this paper, we argue that different social media posts should consider different modalities for multimodal information extraction. Multimodal models cannot always outperform unimodal models. Some posts are more suitable for the multimodal model, while others are more suitable for the unimodal model. Therefore, we propose a general data splitting strategy to divide the social media posts into two sets so that these two sets can achieve better performance under the information extraction models of the corresponding modalities. Specifically, for an information extraction task, we first propose a data discriminator that divides social media posts into a multimodal and a unimodal set. Then we feed these sets into the corresponding models. Finally, we combine the results of these two models to obtain the final extraction results. Due to the lack of explicit knowledge, we use reinforcement learning to train the data discriminator. Experiments on two different multimodal information extraction tasks demonstrate the effectiveness of our method. The source code of this paper can be found in https://github.com/xubodhu/RDS.

Co-authors

Yang Yu 2

Ming Du 1

Venues

Fix author