ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models

Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there is a growing trend to build agent framework that equips LLMs, such as ChatGPT, with tool-use abilities to connect with massive external APIs. In this work, we introduce ModelScope-Agent, a general and customizable agent framework for real-world applications, based on open-source LLMs as controllers. It provides a user-friendly system library, with customizable engine design to support model training on multiple open-source LLMs, while also enabling seamless integration with both model APIs and common APIs in a unified way. To equip the LLMs with tool-use abilities, a comprehensive framework has been proposed spanning over tool-use data collection, tool retrieval, tool registration, memory control, customized model training, and evaluation for practical real-world applications. Finally, we showcase ModelScopeGPT, a real-world intelligent assistant of ModelScope Community based on the ModelScope-Agent framework, which is able to connect open-source LLMs with more than 1000 public AI models and localized community knowledge in ModelScope. The ModelScope-Agent library\footnote{https://github.com/modelscope/modelscope-agent} and online demo\footnote{https://modelscope.cn/studios/damo/ModelScopeGPT/summary} are now publicly available.


Introduction
Large language models (OpenAI, 2022(OpenAI, , 2023;;Touvron et al., 2023;Chowdhery et al., 2022) have gradually become common AI assistants that demonstrate great potential in comprehending human intentions, performing complex reasoning tasks, and enabling content creation.Despite the rapid advancements of open-source LLMs, e.g., LLaMA (Touvron et al., 2023) and Chat-GLM (THUDM, 2023), they still remain limited in performing complex tasks, such as following user instructions to use external tools and capture up-to-date information.
To further unleash the power of LLMs for realworld practical applications, a rising trend of current research (Schick et al., 2023;Shen et al., 2023;Yang et al., 2023;Qin et al., 2023;Patil et al., 2023) begins to enable LLMs with tool-use abilities towards building an AI Agent.These include Hug-gingGPT (Shen et al., 2023), Visual-ChatGPT (Wu et al., 2023) and Gorilla (Patil et al., 2023) for connecting with HuggingFace models, ToolAlpaca (Tang et al., 2023) and ToolLLaMA (Qin et al., 2023) for using massive common APIs such as weather forecast and search engine.These methods either directly rely on closed-source counterparts like ChatGPT or focus on certain types of API tools.Recently, there have also been public releases of AI agents, such as Auto-GPT4 , LangChain5 and Transformers Agent (Huggingface, 2023), which enable LLMs, such as ChatGPT or GPT-4, to use tools and solve complex AI tasks.However, these agents are mainly built with closed-source LLMs and how to build a customizable agent system with open-source LLMs remains largely unexplored.
In this work, we present ModelScope-Agent, a general and customizable agent system for realworld applications, based on open-source LLMs as controllers.ModelScope 6 is a public ML community, that seeks to bring together the most advanced machine learning models from the AI community, and streamlines the process of leveraging AI models in real-world applications.ModelScope-Agent provides a flexible and user-friendly system library, with a customizable engine design to support model training on multiple open-source LLMs, while also enabling seamless integration with both model APIs and common APIs in a unified way.It features an LLM-centric system design, which includes open-source LLMs as core controller, and further interact with a tool-use module and a memory module to accomplish complex tasks.At the core of ModelScope-Agent , the library supports flexible selection and training on various open-source LLMs, such as LLaMA (Touvron et al., 2023), ChatGLM (THUDM, 2023), Chat-PLUG (Tian et al., 2023) and other customized LLMs in ModelScope.For tool use, ModelScope-Agent provides a default tool library, which supports diverse AI model APIs across NLP, CV, Audio and Multi-model fields, as well as massive common APIs such as search engine.It also supports registering new self-defined API plugins and automatic API retrieval from the large tool library.It is easy for users to customize their most appropriate LLMs, local API tools and functions to develop real-world applications.Moreover, a memory module is also introduced to better store and manage the system message, user history, in-context examples, tool message and localized knowledge.
To enable the open-source LLMs to better control the whole agent system, we further propose a comprehensive framework of tool-use data collection, customized model training, evaluation and deployment.Notably, we release a comprehensive tool-enhanced dataset MSAgent-Bench, which consists of 598k dialogues with various API categories, multi-turn API calls, API-Oriented QA, and API-Agnostic instructions in both English and Chinese.A simple training strategy of Weighted LM, that enhances the training of generation of API name and parameters, is used to better ensure the correctness of API calls.Besides, an evaluation framework is also supported in our library to examine the tool-use abilities of the trained models in different aspects.Furthermore, we applied ModelScope-Agent in a real-world application of ModelScope Community namely ModelScopeGPT, which is able to connect open-source LLMs with more than 1000 public AI models and access localized community knowledge in ModelScope for community QA.
To summarize, ModelScope-Agent is a general and customizable agent system designed for developers to harness the power of open-source LLMs.The library targets the following goals: • Agent based on Open-Source LLMs: the controller of ModelScope-Agent can be flexibly selected from open-source LLMs that are optimized through our agent training framework.
• Support and Customization of Diverse Tools: Dozens of diverse model APIs and common APIs are given by default.The library supports registering new self-defined APIs and automatic API retrieval from the toolset.
• Customizable of Applications: ModelScope-Agent can be flexibly applied in various industry applications.The agent and training framework are documented describing its usage, construction and optimization.
ModelScope-Agent is in continual development by the engineers at ModelScope and is released under an Apache 2.0 license.Full documentation is available through the project website.

The ModelScope Agent
ModelScope-Agent is designed to facilitate developers in building customizable agent systems based on open-source LLMs.The overall system architecture is shown in Figure 1.It includes open-source LLMs as controller, a tool-use module and a memory module to interact with.Given human instruction, the Agent, which adopts the selected LLM as the controller, will automatically plan tasks, selectively use tools, leverage knowledge in memory, and finally provide helpful responses to users.

LLMs as Brain
LLMs serve as the brain of the agent, responsible for planning and decomposing user requests, selectively calling tools, performing retrieval, and integrating all the information from previous steps to generate the final response.In order to make it easier for users to customize the agent with their own LLMs, we have added support for various open-source LLMs by default, such as LLaMA, ChatGLM and ChatPLUG, which have been optimized through our tool learning pipeline.The details of the training strategy and tool-use datasets can be referred to in Section 3. ModelScope-Agent has integrated the LLM inference pipeline of the ModelScope community, and replacing LLMs Furthermore, the ModelScope-Agent also provides a standard way to integrate new LLM.Users can add their own LLMs, by integrating the LLM pipeline into ModelScope.After that, the agent can select the new LLMs for training and inference.

Tool Use
Tool Library The tool library is used to configure and manage various collections of APIs used in the agent.ModelScope-Agent can support a wide range of both common APIs such as search APIs, and AI model APIs across NLP, CV, Audio and Multi-modal models in ModelScope and Hugging-Face.Each tool API consists of the API name, description, parameters and request functions.Users can easily choose and configure proper APIs in the library to build their own agents.The default APIs supported in the library can be referred to in Appendix A.1.
# tool default config file "default_file" tool_cfg = Config.from_file(default_file) Register and Customize New Tool The agent allows users to register and customize new tools, while also supporting quick integration of newly registered tools into the agent, enabling LLMs to selectively use the additional self-defined tools for specific applications.This can be simply done by inheriting from a base class, namely Tool, and defining a new CustomTool with the API-related schema of API name, description, parameters, and request functions.More details about CustomTool can be referred to in Appendix A.2. from modelscope_agent.tools import Tool class CustomTool(Tool): # logic added here # refer example in Appendix A.2 tool_list = {'customo-tool': CustomTool()} Tool Retrieval and Execution Due to the large amount of tool APIs in the tool library, a tool retrieval module is further introduced to recommend appropriate APIs for each instruction prompt.Specifically, we use the dense vector retrieval method based on the unified multilingual textembedding API7 .We vectorize both the text descriptions of the APIs and the instruction prompt using the text-embedding API.The top-3 most relevant APIs with the highest vector product scores are selected for tool use.As a result, the schema information of the retrieved APIs will be concatenated with other system prompts in the subsequent memory module and sent to LLMs as input.With the concatenated instruction prompt, the LLMs will plan and generate the API request, which will be executed by the agent.The agent will then return the results to the LLMs for continuous generation.

Memory Control
The memory module is used to retrieve and assemble a series of contextual information as input to the LLMs.It consists of a knowledge retrieval submodule and a prompt generator submodule, which are responsible for external knowledge retrieval and instruction prompt generation, respectively.
Knowledge Retrieval It enables the agent to get access to up-to-date and localized information related with query prompt, thereby augmenting LLMs with dynamic and domain-specific knowledge.We follow the same dense vector retrieval method as the previous tool retrieval module and support large-scale knowledge retrieval from localized document corpus.Similarly, it allows users to customize by changing to other open-source retrieval frameworks.

Prompt Generator
The prompt generator is used to assemble all available contextual information such as system prompt, API schema, retrieved knowledge, conversation history, and few-shot examples.According to the type of user query and the maximum length of the LLM, the users can selectively choose proper contextual information and assemble the required input to the LLM.In our agent, the prompt generator needs to be defined before the agent is constructed.

Agent Pipeline
In summary, we build the agent by combining all the modules: LLM controller, tool-use module, and memory module.With agent.run, the agent can efficiently execute and complete the instruction in a one-step generation.First, the agent retrieves query-related tools through the tool retrieval and combines the retrieved API schema with other contextual prompts in the memory module, to construct a new instruction prompt.Then, the agent sends this new prompt to the LLM, which plans whether and which API to call and generate an API request.Next, the agent will execute the selected API with the extracted API parameters and return the API results to the LLMs, which will continue to plan whether to call other APIs.If another API call is needed, the process is repeated, otherwise, the LLMs generate the final response and the agent returns the final result to the user.agent = AgentExecutor(llm, tool_cfg, additional_tool_list=tool_list) agent.run("Drawa logo image of agent")

Dataset
To facilitate building an agent with the ability to use tools while upholding an optimal level of user en-gagement, we release a comprehensive tool dataset, MSAgent-Bench, utilizing ChatGPT synthetic data and the existing instruction-following datasets.Our released dataset encompasses 598k dialogues.Table 1 outlines the key differences between the released dataset and other publicly available tool learning datasets, while the data distribution of our dataset is illustrated in Figure 2. As demonstrated in the Table and Figure, we have made certain efforts to construct a comprehensive dataset that enables the effective training of an agent: Multilingual: We collect instances in both Chinese and English, ensuring that the trained agent is capable of functioning in both languages.Various API Categories: Our dataset supports Common APIs that have been registered by users or applied through online API platforms, as well as model APIs that can call neural models.Multi Turn Dialog: In real-life scenarios, agents may need to request more specific clarification from users to complete a task or receive additional instructions after completing a previous task.Our dataset accounts for these scenarios and supports multi-turn user-agent interactions when using tools.API-Oriented QA: An effective agent should possess knowledge of APIs.Our dataset incorporates API document QA tasks and task planning tasks which requires agents to offer appropriate suggestions to users on how to use various APIs to solve complex tasks.API-Agnostic Instructions: To enhance the agent's ability to follow common instructions and increase user engagement, we have incorporated both Chinese and English API-agnostic instructions within our dataset.These instructions place greater emphasis on the agent's inherent capabilities rather than reliance on API invocation.
The data was collected by prompting ChatGPT (gpt-3.5-turbo) to generate instructions, API requests, and answers based on the API calling results, more details can be accessed in Appendix D.

Model Training
We use the MSAgent-Bench to fine-tune multiple open-source LLMs, including LLaMA (Touvron et al., 2023), Qwen (QwenLM, 2023), Chat-PLUG (Tian et al., 2023) etc.We train all the open-source LLMs in a multi-round conversation mode and concatenate all the prompts and answers.Compared to common instruction tuning data, the tool learning samples focus more heavily on the

Evaluation
Our evaluation system, MSAgent-Eval, comprises two modules: an automatic evaluation framework that comprehensively evaluates the API usability of the agents and a human evaluation framework implemented by an agent arena that reflects the preferences of human users.

Automatic Evaluation Framework
In automatic evaluation, we mainly focus on evaluating the agent's ability to generate accurate API requests and the proper answers according to the API calling results.Specifically, we use the action exactly match score (Action EM) which measures whether the agent uses the correct API as the reference gold API, and the ROUGE-L score which measures the similarity between the generated response and the gold answer.Additionally, we intro-duce a novel metric called Argument F1 for fully evaluating the quality of API requests.To compute Argument F1, we categorize the arguments in the agent's API request into two cases, namely Half match (HM) and Full match (FM), representing the correct argument but with the wrong value and the correct argument with the correct value, respectively.Suppose the gold argument number in the API is |A|, and the number of arguments in the agent API request is |A * |, we compute the new Recall and Precision as follows: R = (0.5 × # HM + # FM)/|A| (1) and the final argument F1 is computed as: A sample code for the automated evaluation of agents is provided below: from tool_agent_finetune import evaluation EM, F1, ROUGE = evaluation (refs, preds) Expert annotators were engaged to annotate the evaluation instances, with the task of providing diverse instructions, manually documenting correct API calling requests, and writing appropriate responses.The statistics of our currently assembled test data is in Appendix B.1, and the automatic evaluation scores of our trained agents can be found in Appendix B.2.We also guarantee the 570

Human Evaluation with Agent Arena
Inspired by the Arena for ChatBots (Zheng et al., 2023), we have built an accessible Agent Arena8 that allows users to furnish instructions to two anonymous agents, based on the provided APIs.Subsequently, users have the opportunity to vote on which Agent performs better in tackling the instruction with the given APIs.In accordance with the framework presented by Zheng et al. (2023), we adopt a system of ELO ratings and leaderboard maintenance for the participating Agents.

Usage Example of ModelScopeGPT
In this section, we showcase a successful application of ModelScope Community, Mod-elScopeGPT9 , based on our ModelScope-Agent.To make the pipeline more practical, we have included API retrieval and knowledge retrieval tools to automatically select proper APIs and get access to the local ModelScope knowledge.As shown in Figure 3a, ModelScopeGPT can support API calls in multi-turn conversations and generate correct API call parameters using information from previous conversations.More cases can refer to Appendix C. As a result, ModelScopeGPT has achieved a total request number of over 170k from 40k user visits within one month after its release.

ModelScope Intelligent Assistant
Register and Use New Tools Another key feature of an agent is its generalization capability to unseen APIs.This allows users to quickly register their own APIs and customize their specific applications.Therefore, we test the generalization ability of ModelScopeGPT by applying it to an Alibaba Cloud application scenario.As shown in Figure 3b, we first found an API for renewing an ECS instance on Alibaba Cloud.Then, we registered the API schema defined in the tool library to the agent.Finally, we entered the prompt "Please help me renew an ECS..." in the demo.The agent generated a request through planning, selected the appropriate API, called the API to renew the instance successfully, and provided a reply to inform the user that the renewal was completed.This test demonstrates that the open-source LLM optimized based on the released API dataset has a strong generalization ability towards unseen APIs.dataset consisted of 360 conversations with 2059 text snippets as the references to be compared with the agent prediction, which comprise 798 API requsts and 1261 plain text answers according to the previous calling results.We compare the models trained in our proposed ModelScopeGPT.The automatic evaluation results are shown in Table 3.Based on the findings obtained from our experimentation, it is evident that ChatGPT with in-context learning yielded inferior results as compared to other models that were subjected to finetuning.Furthermore, LLaMA underperformed when compared to other fine-tuned models.Our error study revealed that the lower performance of ChatGPT and LLaMA could be attributed to a large proportion of Chinese test cases in our test set.The models (ChatPLUG, MSAgent-7B11 ) that performed better were those that predominantly focused on Chinese data.Our investigation revealed that ChatGPT and LLaMA exhibited limitations in user intent recognition, which ultimately led to their suboptimal performance on Action EM.Among the models examined, MSAgent-7B displayed the most favorable performance, which could be attributed to the superior performance of its basic model.

B.3 Weighted LM
We give an example of the training strategy Weighted LM.As show in Figure 4, tokens with different colors have different loss weights.For the user input prompt, we set the loss weight to 0, so that the model does not calculate the loss for the prompt.For the API-Agnostic text of the assistant, we keep the loss weight as 1.Finally, for the important text of the API calling, such as API name, parameters, URL, etc., we set the loss weight to 2, which can improve the generation accuracy of API calling.

C Cases
In this section, we show the qualitative results about ModelScopeGPT implementation based on ModelScope-Agent.
Single-step Tool Use As shown in Figure 5 and 6, the instruction expects the model to generate a video and chat about the image respectively.These instructions can be completed with a single step of tool use.

Multi-step Tool Use
As shown in Figure 7, the instruction expects the model to write the promo-tional copy first, then read it, and finally generate a video.These instructions require the model to have the ability of multi-step Tool use.In the Chinese case, our model accurately completed the threestep tool use.
Multi-turn Tool Use As shown in Figure 8, the instruction requires the model to have the ability to multi-turn conversation and use the history conversation.Our model can accurately call the API and capture the content of the previous conversation to generate API parameters.575

D Data Collection Procedure
We collected our dataset by using prompt engineer simulate the agent scenarios with two ChatG-PTs (gpt-3.5-turbo).One of the ChatGPTs was prompted to act as the user, while the other was assigned to act as the agent.In order to expand the domains and functionalities of APIs presented in the training data, rather than the exsisting real APIs, we also included a number of synthetic APIs that were generated by ChatGPT.When these synthetic APIs were incorporated into the dialogues, we prompted another ChatGPT to serve as the API and return the relevant calling outcomes.
The data collection procedure is shown in Figure 10.Initially, a set of random in-context demonstrations were provided to ChatGPT for generating an instruction.This instruction could either be a regular one or one that requires solving with APIs, depending on the demonstrations provided.Subsequently, ChatGPT was prompt to act as an agent by first thinking about which action to undertake.If no API calls were deemed necessary, or if the user clarification is needed, the agent would respond with a follow-up response to the user.Otherwise the agent will send API request to the API gallery.After receiving the result of the API call, the agent would assess the situation and decide on the next action.This iterative process of the "user-agent-API" loop would continue until the agent determined that it was appropriate to terminate the conversation with the final answer.After acquiring the raw dataset, we applied filtering mechanisms to eliminate instances in which ChatGPT generated API requests containing hallucinated API names and parameters that were absent from the retrieved API.Additionally, we excluded instances in which Chat-GPT generated illegal API requests, thus resulting in a refined and finalized dataset.
As introduced in Section 3.1, we collect instances across different languages and topics, the detailed statistics of our collected data are shown in Table 4.

E.2 Agent & Tool Learning
The utilization of Large Language Models (LLMs) as a controller to construct an agent system has emerged as a prominent research area.Several related works employ prompt engineering techniques on closed-source LLMs, such as ChatGPT (Ope-nAI, 2022) and Claude, to enable their application in specific domains.For instance, Visual-ChatGPT (Wu et al., 2023) and HuggingGPT (Shen et al., 2023) facilitate the HuggingFace model callings accessible to OpenAI LLMs.SayCan (Ahn et al., 2022) and inner monologue (Huang et al., 2023) integrate LLMs with robots to achieve robotic systems.Notably, recent works such as Langchain and Auto-GPT encompass a wide range of tools, including common APIs and neural models, and enhance long-term reasoning and human-agent interaction whilst solving tasks, which demonstrate the immense potential for building a generalized agent.Numerous endeavors have also been made to enable open-source LLMs to utilize tools.For instance, Gorilla (Patil et al., 2023) and GPT4Tools (Yang et al., 2023) generate training data using self-instruction techniques to train opensource LLMs to effectively utilize neural models.ToolAlpaca (Tang et al., 2023) and ToolL-LaMA (Qin et al., 2023) train LLAMA using common APIs, with the distinction that ToolAlpaca employs synthetic APIs from LLMS, whereas Tool-LLaMA utilizes real APIs.
Overall, compared to the above-mentioned methods, ModelScope-Agent differs in the following aspects.Firstly, our method includes a universal training framework that supports user-customized agent learning for open-source models to meet industrial needs.Secondly, ModelScope-Agent can support various APIs in different fields, including model APIs and common APIs, while previous works only support certain specific APIs.

F Future Work
In the future, we will evolve to support more sophisticated agent architectures such as ReAct and code interpreter.In the meantime, we will continuously improve the capabilities required by open-source LLMs as agents.ModelScope-Agent relies on the ModelScope community and will adapt to more new open-source LLMs in the future, providing more applications developed based on ModelScope-Agent, such as personal-assistantagent, story-agent, motion agent, and so on.578

Figure 2 :
Figure 2: The instance types and distribution of our collected MSAgent-Bench.accuracy of tool selection and API parameter prediction.Therefore, we propose a simple training strategy, Weighted LM, which enhances the training of generation of API names and parameters, while zero-out the loss of tokens from the user prompt and the tool execution.More details can be referred to in Appendix B.3. kwargs = dict(model=model, ...) trainer: EpochBasedTrainer = build_trainer (name=args.trainer,default_args=kwargs) trainer.train() Figure 3: Demo cases of ModelScopeGPT based on ModelScope-Agent .
Based on ModelScope-Agent , we have developed an intelligent assistant for the ModelScope Community, namely ModelScopeGPT.It uses LLMs as a controller to connect dozens of domain-specific AI models in the ModelScope open-source community, covering NLP, CV, Audio, and Multi-Modal fields.
ModelScope-Agent aims to facilitate building AI Agent applications and research based on opensource LLMs by providing a general and customizable agent framework covering flexible system design, data collection, model training, evaluation and usage examples in real-world applications.It provides an open-source, community-driven library for AI Agent learning and best practices for building an agent system with open-source LLMs.We hope ModelScope-Agent can help pave the way towards a new era of AI Agent.

Figure 4 :
Figure 4: Example of training strategy for weighted LM.Different colored tokens have different loss weights.

Figure 5 :
Figure 5: Single-step tool-use instructions, text-to-video cases.We have captured a few frames of the video to display.Testing the model using the same semantic instruction in both English (left) and Chinese (right).

Figure 6 :
Figure 6: Single-step tool-use instructions, image-chat cases.Testing the model using the same semantic instruction in both English (left) and Chinese (right).

Figure 7 :
Figure 7: Multi-step tool-use instructions.We have captured a few frames of the video to display.Testing the model using the same semantic instruction in both English(left) and Chinese(right).

Figure 8 :
Figure 8: Multi-turn tool-use instructions, text-to-speech and text-to-image cases.Testing the model using the same semantic instruction in both English(left) and Chinese(right).

Figure 9 :Figure 10 :
Figure 9: Multi-turn tool-use instructions, text-to-speech and text-to-image cases.Testing the model using the same semantic instruction in both English(left) and Chinese(right).

Table 1 :
The statistics of MSAgent-Bench and other existing tool learning datasets.

Table 3 :
Automatic evaluation results.* represents that we do not fine-tune ChatGPT but use in-context learning with 2 demonstrations.

Table 4 :
The statistics of our collected dataset.as multi-modal tasks.It also falls short on tasks that require up-to-date information, which are beyond the pretraining data.Using tools or external APIs can help overcome the limitations and harness the power of LLMs to facilitate seam-577 less connections with downstream applications.In ModelScope-Agent , we provide the whole customizable framework and best practices for building an agent system, which enables open-source LLMs to use tools and external APIs.