Jingren Zhou


2023

pdf bib
ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models
Chenliang Li | He Chen | Ming Yan | Weizhou Shen | Haiyang Xu | Zhikai Wu | Zhicheng Zhang | Wenmeng Zhou | Yingda Chen | Chen Cheng | Hongzhu Shi | Ji Zhang | Fei Huang | Jingren Zhou
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there is a growing trend to build agent frameworks that equips LLMs, such as ChatGPT, with tool-use abilities to connect with massive external APIs. In this work, we introduce ModelScope-Agent, a general and customizable agent framework for real-world applications, based on open-source LLMs as controllers. It provides a user-friendly system library, with a customizable engine design to support model training on multiple open-source LLMs, while also enabling seamless integration with both model APIs and common APIs in a unified way. To equip the LLMs with tool-use abilities, a comprehensive framework has been proposed spanning tool-use data collection, tool retrieval, tool registration, memory control, customized model training, and evaluation for practical real-world applications. Finally, we showcase ModelScopeGPT, a real-world intelligent assistant of ModelScope Community based on the ModelScope-Agent framework, which is able to connect open-source LLMs with more than 1000 public AI models and localized community knowledge in ModelScope. The ModelScope-Agent online demo, library are now publicly available.

2022

pdf bib
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li | Haiyang Xu | Junfeng Tian | Wei Wang | Ming Yan | Bin Bi | Jiabo Ye | He Chen | Guohai Xu | Zheng Cao | Ji Zhang | Songfang Huang | Fei Huang | Jingren Zhou | Luo Si
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Large-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind

2021

pdf bib
Sketch and Refine: Towards Faithful and Informative Table-to-Text Generation
Peng Wang | Junyang Lin | An Yang | Chang Zhou | Yichang Zhang | Jingren Zhou | Hongxia Yang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Learning Relation Alignment for Calibrated Cross-modal Retrieval
Shuhuai Ren | Junyang Lin | Guangxiang Zhao | Rui Men | An Yang | Jingren Zhou | Xu Sun | Hongxia Yang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Despite the achievements of large-scale multimodal pre-training approaches, cross-modal retrieval, e.g., image-text retrieval, remains a challenging task. To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions. The neglect of such relation consistency impairs the contextualized representation of image-text pairs and hinders the model performance and the interpretability. In this paper, we first propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. In response, we present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions from the two modalities mutually via inter-modal alignment. The IAIS regularizer boosts the performance of prevailing models on Flickr30k and MS COCO datasets by a considerable margin, which demonstrates the superiority of our approach.