Rajiv Gupta
2025
Enhancing multi-modal Relation Extraction with Reinforcement Learning Guided Graph Diffusion Framework
Rui Yang
|
Rajiv Gupta
Proceedings of the 31st International Conference on Computational Linguistics
With the massive growth of multi-modal information such as text, images, and other data, how should we analyze and align these data becomes very important. In our work, we introduce a new framework based on Reinforcement Learning Guided Graph Diffusion to address the complexity of multi-modal graphs and enhance the interpretability, making it clearer to understand the alignment of multi-modal information. Our approach leverages pre-trained models to encode multi-modal data into scene graphs and combines them into a cross-modal graph (CMG). We design a reinforcement learning agent to filter nodes and modify edges based on the observation of the graph state to dynamically adjust the graph structure, providing coarse-grained refinement. Then we will iteratively optimize edge weights and node selection to achieve fine-grained adjustment. We conduct extensive experimental results on multi-modal relation extraction task datasets and show that our model significantly outperforms existing multi-modal methods such as MEGA and MKGFormer. We also conduct an ablation study to demonstrate the importance of each key component, showing that performance drops significantly when any key element is removed. Our method uses reinforcement learning methods to better mine potential multi-modal information relevance, and adjustments based on graph structure make our method more interpretable.