DetGPT: Detect What You Need via Reasoning

In recent years, the field of computer vision has seen significant advancements thanks to the development of large language models (LLMs). These models have enabled more effective and sophisticated interactions between humans and machines, paving the way for novel techniques that blur the lines between human and machine intelligence. In this paper, we introduce a new paradigm for object detection that we call reasoning-based object detection. Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions, allowing for a higher level of interactivity. Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors to perform reasoning within the context of the user's instructions and the visual scene. This enables DetGPT to automatically locate the object of interest based on the user's expressed desires, even if the object is not explicitly mentioned. For instance, if a user expresses a desire for a cold beverage, DetGPT can analyze the image, identify a fridge, and use its knowledge of typical fridge contents to locate the beverage. This flexibility makes our system applicable across a wide range of fields, from robotics and automation to autonomous driving. Overall, our proposed paradigm and DetGPT demonstrate the potential for more sophisticated and intuitive interactions between humans and machines. We hope that our proposed paradigm and approach will provide inspiration to the community and open the door to more interative and versatile object detection systems. Our project page is launched at detgpt.github.io.

As highlighted by recent studies (Shah et al., 2023;Brohan et al., 2022;Fang et al., 2020), since intelligent robot heavily relies on interactions with humans, the field of embodied AI / robotics is set to experience a significant transformation.With the emergence of human-like intelligence of LLMs and VLMs, robots will be able to interpret human instructions and reason over visual scenes, enabling them to execute corresponding actions.This breakthrough will lead to the creation of intelligent robots that are more helpful to humans, and opens up possibilities for various fields.
However, it is important to note that while VLMs have made remarkable progress in generating highquality image descriptions, this alone is insufficient for robots to interact with the physical world effectively.To achieve this, robots must be able to accurately identify and localize objects within visual scenes, which is a vital prerequisite for performing actions such as "moving" and "grasping" objects.This goal of "localizing objects" is closely linked to the field of object detection, which is one of the most fundamental and extensively studied research areas in computer vision.Conventional object detection systems, such as Faster-RCNN (Ren et al., 2015), Retina-Net (Lin et al., 2017), and YOLO (Redmon et al., 2016) can only detect a fixed number of object categories, which restricts their practicality.Recently, a series of open-vocabulary detection (OVD) systems have emerged as the new trend (Gu et al., 2021;Li et al., 2022;Yao et al., 2022;Liu et al., 2023b).Specifically, those models adopt the contrastive learning approach to align the object-level visual features with the textual class embeddings extracted from a pretrained text encoder (e.g., BERT (Devlin et al., 2019)).In this way, those models are able to detect a much wider range of objects during inference.
Despite the success achieved by OVD systems, they still require humans to provide specific object names, which is neither user-friendly nor realistic.Firstly, human users tend to provide high-level instructions, which may not explicitly contain the object of interest.Secondly, the constraints of human knowledge often hinders the users to provide the object names.For example, the user may wish to identify fruits with a high vitamin K content but lack the necessary expertise to determine which fruits fulfill this requirement.Finally, the range of object categories that humans can supply is intrinsically finite and non-exhaustive.As an illustration, when attempting to detect "objects posing hazards to autonomous vehicles," humans may only be able to enumerate a limited number of scenarios, such as compromised visibility or intricate pedestrian traffic patterns.In summary, it would be desirable if the detection model is able to interpret human instruction, employ its own knowledge to identify all objects of interest via reasoning, and finally localize them.
To this end, we propose a new research task: reasoning-based object detection.In essence, humans provide abstract queries via natural language, then the model discerns and reasons which objects in the image may fulfill the query, subsequently detecting them.We made preliminary explorations in this direction.Specifically, we fine-tune a VLM (e.g., MiniGPT-4 (Zhu et al., 2023)) built on LLMs (e.g., Vicuna (Chiang et al., 2023)) to perform reasoning and predict objects of interest based on user queries (instructions) and input images.We then provide the object names to an open-vocabulary detector for specific location prediction.To facilitate future research in the direction of reasoning-based object detection, we curate a benchmark named RD-Bench containing 20000 images and around 120000 query-answer pairs, which will be opensourced for the research community.

Vision Language Models
Given the success of language models, many following research explored vision-language interaction, resulting in the development of various multimodal models.The development in a number of

I want to have a cold beverage.
There is no visible beverage.

Where can I find it?
Oh!There is a refrigerator that might contain cold beverage!language model research.Inspired by BERT-like encoder models, most of the multi-modal models (Lu et al., 2019;Tan and Bansal, 2019;Chen et al., 2020;Li et al., 2021;Ding et al., 2021;Li et al., 2020;Ding et al., 2023Ding et al., , 2022) ) before 2021 are encoder-only Transformers, which are good at cross-modal understanding tasks.However, the transition from encoder-only models to decoderbased models in language model research inspires the pattern shift in multi-modal learning, including encoder-decoder models like VL-T5 (Cho et al., 2021), OFA (Wang et al., 2022a), DaVinci (Diao et al., 2023c), and decoder-only models like GPT-4 (OpenAI, 2023).Most recently, we have witnessed the potential of multi-modal learning due to the powerful language abilities of LLaMA.Recent works include LLaVA (Liu et al., 2023a) andMiniGPT4 (Zhu et al., 2023).Unlike these works, our DetGPT focuses on localizing object of interests based on user instruction, allowing for greater control over objects through language.

Object Detection
Object detection is one of the most fundamental tasks in computer vision, which aims at localizing objects in images.Traditional object detectors have a fixed number of classification heads, which makes them only capable of predicting the classes on which they are trained (Girshick, 2015;Ren et al., 2015;Lin et al., 2017;Yao et al., 2021a;Duan et al., 2019;Yao et al., 2021b;Zhu et al., 2020;Carion et al., 2020).Recently, open-vocabulary object detection has attracted a lot of attention (Gu et al., 2021;Li et al., 2022;Liu et al., 2023b;Yao et al., 2022).The main philosophy is to utilize contrastive training between object visual features and their class embeddings.In such a way, object de-tectors are able to recognize objects that are unseen during training based on their semantics.Despite the success of open vocabulary object detectors, their ability is still limited in the sense that they can only perform prediction given specific object phrases.On the other hand, our DetGPT enables reasoning and localizing objects of interest given any high-level user instructions.

Problem Statement
Recent vision language models (VLMs) backed with LLMs have shown promising results in visual understanding based on the visual scene and natural language input.However, they lack the ability for fine-grained visual understanding and precise localization, which makes them difficult to be applied to real-world scenarios involving embodied AI, such as robotics ad autonomous driving.On the other hand, object detection is a crucial task in computer vision, which enables models to analyze images in a fine-grained manner and predict precise object locations.Unfortunately, existing detection methods either can only predict fixed number of object classes, or need exact object names for detection.
To address these limitations, we propose a new task termed reasoning-based object detection.In this task, users provide abstract queries using natural language, and the model analyzes both the image and user input, reasons about which objects in the image may fulfill the user's goal, and finally detects their location in the image.For example, as shown in Figure 2, when a user requests "I want to have a cold beverage," the model first analyzes the image of a kitchen and determines that there is no "cold beverage" available.Then, it identifies a refrigerator in the image and, based on the common sense knowledge stored in the LLM, infers that the find what's interesting about the image.
The image shows a bird perched on a window sill, looking out onto a busy city street.The bird appears to be a pigeon, which is a common sight in urban areas.In the image, there is an interesting scene where a bird is flying past a window, and a cat is sitting on the windowsill, watching the bird intently.The bird is located closer to the left side of the image, while the cat is on the right side, occupying the edge of the window.This captivating moment captures the interaction between the two animals, showcasing their curiosity and natural instincts towards each other.find what's interesting about the image.
The image shows a view of a cityscape from a window.The window is made of glass and has a white frame.There are buildings and cars visible in the distance.The sky is cloudy and there are some trees on the sidewalk.There is a red cup on the windowsill.
The proposed reasoning-based object detection task and DetGPT open up a world of possibilities for human-machine interactions, which has the potential to greatly improve the capabilities of general-purpose robots.

Multi-Modal Query-Answer Instruction Data Generation
The traditional way for labelling a dataset requires a lot of human labor.Recently, LLMs such as ChatGPT possess superior generation capability, which can be used to replace human labeling with automatically generated annotations (Schick and Schütze, 2021;Ye et al., 2022a,b;Meng et al., 2022;Gao et al., 2023;Ye et al., 2023).However, such text-only LLMs are not able to interpret visual inputs, which hinders their practicality for data generation based on images.Motivated by LLAVA (Liu et al., 2023a), we leverage the images of pre-existing datasets (Lin et al., 2014;Shao et al., 2019a), and employ two types of textual annotations to bridge the gap between visual and textual representations: (1) Image Captions, which depict the visual content from different viewpoints.(2) Objects categories, which are the objects present in the image.Specifically, we adopt COCO (Lin et al., 2014) and Ob-jects365 (Shao et al., 2019a) datasets for constructing RD-Bench.Based on the given captions and objects, we design query-answer prompts to instruct ChatGPT to generate the following: (1) a more detailed description of the scene, which gives ChatGPT itself a better sense of the visual scene; (2) query-answer pairs, which consist of a user query (instruction) and the corresponding answer that contains both reasoning process and object names in the image that matches the query.For each image, we generate one detailed description followed by several instruction-answer pairs.
We reorganize the annotations such that each image is associated with the corresponding queryanswer pairs.The detailed system prompt for our cross-modal object detection task is shown in Table 9 from the Appendix.To enable better annotation generation, we further manually design two incontext examples for querying ChatGPT, which are shown in Table 10 and Table 11 in 12.

Model Design
As an initial attempt towards reasoning-based object detection, we propose a two-stage approach that first leverages the VLM to interpret the image and generate relevant objects names/phrases that match the user's instructions via reasoning; then we leverage an open-vocabulary object detector to localize the relevant objects given the results from the VLM.Specifically, for the VLM, we employ a pre-trained visual encoder to extract image features, followed by a cross-modal alignment function to map the image features to the text domain.Then, we utilize a pre-trained LLM as the knowledge brain to interpret both the image features and human instructions, perform reasoning, and determine target objects that help fulfill the given user query.Our framework is illustrated in Figure 1.
Inspired by (Zhu et al., 2023), we employ the visual encoder of BLIP-2 (Li et al., 2023) as the vision encoder and utilize Vicuna (Chiang et al., 2023) or Robin (Diao et al., 2023a) as the language model to interpret both visual and text features.For the open-vocabulary detector, we leverage Grounding-DINO (Liu et al., 2023b) to localize the target objects in the image.Following MiniGPT-4 (Zhu et al., 2023), we train a linear projection layer from scratch for the cross-modal alignment, which has been proven effective in bridging the gap between vision and language modalities.
Challenge One straightforward approach to implement our proposed framework is combining offthe-shelf VLMs with open vocabulary object detectors without further training.However, we observe that even though with carefully-chosen prompting, VLMs are able to output objects in a specific pattern, they tend to output redundant objects that are either not shown in the image, or unrelated to the user's instruction (shown in Figure 4).

Training and Inference
Step 1. Image-Text Pretraining.We follow (Zhu et al., 2023) and leverage a combined dataset of SBU, LAION and Conceptual Caption to conduct image-text pretraining.We minimize the language modeling loss: where F represents the multi-modal model.I i represents the i th image, and y i,j t denotes the t th token of the j th answer that belongs to the i th image.L is the length of the answer.During the process, only the projection linear layer is trainable, while all other components are kept frozen.Since the visual feature from BLIP-2 (Li et al., 2023) is already well aligned with textual features, tuning just the linear layer already suffices for aligning with a new language model such as Vicuna.
Step 2. Instruction Tuning.After the first step, although the model is able to generate detailed descriptions based on the image, it is not able to well interpret human instruction and derive objects that fulfill the user's goal.Therefore, we leverage a subset of our curated RD-Bench to fine-tune our model.Similar to the first step, the language modeling loss is optimized, and the linear projection layer is made tunable.We demonstrate that the instruction tuning phase is a crucial step that empowers the VLM to I want a cold beverage.
In the given image, there is a refrigerator which can be used to store cold beverages.Therefore the answer is [refrigerator].
I want a cold beverage.
The kitchen is modern and spacious, with white cabinetry and stainless steel appliances.The fridge is stocked with drinks, and there is a silver refrigerator with frosted glass doors and a built-in ice maker.The countertops are made of light gray granite, and there is a wooden dining table with six chairs in the dining area.A black and white photograph is hanging on the wall, depicting a group of people standing in front of a building.The floor is made of hardwood.Overall, the kitchen is clean, sleek, and inviting.Therefore the answer is [fridge, drinks, refrigerator, countertops, dining table , chairs comprehend fine-grained image features and derive the object of interest given the user's instruction.
Instruction tuning empowers the model to identify the objects of interest in the image.However, we find that the output format of the model often varies, which poses difficulty in extracting the relevant object names/phrases.Therefore, we design a user prompt that is helps the model output the objects strictly in a given format (shown in Table 1), which makes our model more stable.The final input sequence used to train the model is "###Human: ⟨ Img ⟩ ⟨ ImageHere ⟩ ⟨ Img⟩ ⟨ TextHere ⟩ ⟨ User_Prompt ⟩".Blue color represents the input image and Red color represents user instruction.

User Prompt
Answer me with several sentences.End the answer by listing out target objects to my question strictly as follows: ⟨Therefore the answer is: [object_names]⟩.
Table 1: User Prompt.We found that prompting is necessary for listing names of objects of interest in a consistent format, which makes DetGPT more stable.
Inference.During inference, we first provide the model with a system prompt (show in Appendix), which we find to be helpful in stablizing the model's output.Then, we append the user prompt after the user's query.After obtaining the generated answer from the VLM, we extract the object names/phrases from it by matching the specific output format, .i.e, the object names following "Therefore the answer is: ".Finally, we send the names/phrases and the image to the object detector for localization.

Demonstration
We present the visualization results of DetGPT in Figure 5 and evaluate its capabilities.Interestingly, DetGPT exhibits the following appealing features: 1) it is proficient in common-sense reasoning based on the user's abstract query and the image; 2) it can utilize the rich knowledge stored in LLMs that are beyond human common sense; 3) thanks to the abundant knowledge stored in the VLM, DetGPT generalizes to a broad range of objects that do not appear during the instruction tuning.

Experiments
We conduct first stage training on paired image-text data to achieve vision-language alignment.Afterwards, we conduct instruction tuning and evaluation on our curated RD-Bench.Specifically, we randomly sample a subset of 5000 images that are originally from COCO dataset and the corresponding query-answer pairs, which accounts for around 30000 query-answer pairs for instruction tuning.For evaluation, we sample (1) 1000 images that are originally from COCO dataset to evaluate Det-GPT's in domain (ID) performance, and (2) 1000 images from Object365 dataset, which are not seen by the model during training, to test its out of domain (OOD) performance.
Training Details.For first stage training, the learning rate is set to 1 × 10 −4 , the batch size is set to 128, and the model is trained for 40000 steps.For instruction tuning, the learning rate is 3.5×10 −5 , the batch size is set to 32, and the model is trained for 40000 steps.We use adamW as the optimizer, with cosine learning rate scheduler.We use 8 A40 GPUs to conduct all experiments.
Evaluation.We conduct evaluation using the conventional metric for object detection, i.e., mean average precision (mAP), which quantifies how well the predicted bounding boxes overlap with the ground truth ones.Specifically, the precision and recall are defined as follows: where TP represents true positives (correctly detected objects), FP represents false positives (incorrectly detected objects), FN represents the number of false negatives (missed detections).Then, Average Precision (AP) and Mean Average Precision (mAP) are defined as: where R represents the set of recall values, AP can be interpreted as the area under the precision-recall curve.N is the total number of classes, and AP i represents the average precision for class i.
Rather than deriving the mAP for all the objects in the image, we calculate only the objects that fulfill the user's queries.Therefore, detecting objects that are irrelevant to the user query will be counted as FP and decrease the final evaluation metric.
There exist two major issues during evaluation: (1) the language model's output can not be guaranteed to exactly match the object names in the benchmark, even though they share the same meaning; (2) there may exist hierarchy among the categories, e.g., a stuffed animal is also a toy.To address the above problems, we first leverage FastText (Joulin et al., 2016) to calculate the similarities between the LLM-predicted objects and all the class names in the benchmark (COCO and Object365 have 80 and 365 classes, respectively).Then, we take the top 1, top 5 and top 10 class names for each LLMpredicted object, and check if the ground truth class of the object is included.This approach makes our evaluation more robust for reasoning-based object detection, since the words that share similar meanings and those that possess hierarchies tend to have higher similarities in their word embeddings.
Main Results.As the first attempt in our proposed reasoning-based object detection task, we conduct exhaustive experiments on different variants of DetGPT using our curated RD-Bench.Specifically, we study the effect of instruction tuning, system prompt and user prompt.As shown in Table 2, instruction tuning contributes the most performance gain, and prompts are able to further enhance the model's performance.
Varying the Language Model.We conduct experiments on different language model families with various model sizes.Specifically, we adopt Flan-T5 (Chung et al., 2022), Vicuna (Chiang et al., 2023) and Robin (Diao et al., 2023a).As demonstrated in Table 3, Flan-T5 model family achieves inferior performance compared with Vicuna and Robin model families.This result verifies that the Find food that can relieve high blood pressure.
In the image, there are various fruits and vegetables present on the produce stand, including bananas, apples, and carrots.These fruits and vegetables are known to be high in potassium, which can help to lower blood pressure.
Find potential danger for humans.
In the image, there is a manhole cover on the ground, which can pose potential danger for humans if they accidentally step on it or fall into it.
I want to have a cold beverage.
In the given image, there is a refrigerator which can be used to store cold beverages.
Why is the boy so sad?
The boy in the image appears to be sad because his toy car has been damaged or broken.It is likely that he was playing with the toy car before it was damaged, and now he is upset because he cannot play with it anymore.
Find an item inappropriate for children.
There are several items in the image that may be inappropriate for children, such as cigarettes.Cigarettes can be harmful to children's health and can lead to addiction.
How to enter the desert city "Gerudo Town" where only women are allowed and men are prohibited?
The image depicts a scene from the video game "The Legend of Zelda: Breath of the Wild."In this game, there is a city called Gerudo Town, where only women are allowed and men are prohibited.find what's interesting about the image.
The image shows a bird perched on a window sill, looking out onto a busy city street.The bird appears to be a pigeon, which is a common sight in urban areas.The bird's presence in the image is interesting because it provides a glimpse into the life of a wild animal living in an urban environment.
Find items appropriate for a romantic dinner.
There are several items that are appropriate for a romantic dinner, including wine glasses, candles, flowers, and nice clothing.These items can create a romantic atmosphere and make the dinner more special.how to enter this city for Link?
In the game "The Legend of Zelda: Breath of the Wild," there is a city called Gerudo Town, where only women are allowed and men are prohibited.To enter this city, you need to find a way to disguise yourself as a woman.One way to do this is to wear a Gerudo outfit, which can be found in various locations throughout the game.
Figure 5: Demonstration of the reasoning process and generated bounding boxes of our DetGPT.Due to space limitation, we do not show the system prompt, the user prompt, or the final sentence "<Therefore the answer is: [object_names]>" for outputting the object names.

COCO (In Domain)
Objecr365 (Out of Domain) TUNE SYS USER MAP(1) MAP( 5) MAP(10) AVG MAP(1) MAP( 5) MAP( 10   quality of the language model is crucial for the promising performance of DetGPT.
Instruction Tuning with Prompting.As shown in Table 4, we observe that instruction tuning achieves better results if prompts are augmented to the queries during training.This implies that it is desirable to keep the input format consistent during instruction tuning and inference.
The Impact of Reasoning.Before outputting the objects of interest, our DetGPT first performs reasoning by describing the image content, then using commonsense knowledge to decide which objects help fulfill the user's query.In Table 5, we analyze the impact of such a reasoning process on the accuracy of detection.Specifically, if we train the model to directly output objects of interest without reasoning, a significant performance drop can be observed.This verifies that reasoning is not only a desirable feature of DetGPT, but also a key factor that helps it accurately derive the object of interest based on human instruction.
The Impact of Instruction Tuning Size.From Figure 6, we observe that promising performance can already be achieved with around 10000 sam-   ples for instruction tuning.This verifies that the knowledge stored in the base model is the key to DetGPT's strong performance, and only a small number of samples is needed to empower the model to follow human instructions and output the objects of interest in a standard format.

Conclusion
We propose a new task termed reasoning-based object detection, in which the model needs to interpret high-level human instructions, reason over the visual scene, and finally localize the objects of interest.To facilitate future research in this task, we curate RD-Bench, a dataset that can be used for training and evaluation.Then, we design a twostage detection pipeline, named DetGPT, which demonstrates strong ability in open-ended tasks and achieves promising performance on our proposed benchmark.We hope our work will pave the way for a more interactive and user-friendly object detection system, which will inspire later works on embodied AI, autonomous driving, and robotics.
Find the elderly person.
The image depicts an elderly person wearing a face mask while walking down the street with a cane.A young man is helping the elderly person.
Find the toy plane.
In the image, there is a pile of toy planes Figure 7: Demonstration of failure cases.Top: even though the multi-modal is able to understand the visual scene and find the elderly lady, the object detector localizes both the young man and the elderly person, and label them both as elderly person.This may be due to the detector is not able to distinguish "young man" from "elderly person".Bottom: there is only one toy plane in the image, but the multi-modal model recognizes "a pile of toy planes".This may be caused by the multi-modal model's lack of fine-grained visual recognition capability.

Limitation
As the first attempt towards a reasoning-based object detection system, despite the promising results, DetGPT still has some limitations (shown in Figure 7).Due to the two-stage nature of DetGPT, the weaknesses of both open-vocabulary detector and multi-modal models become the bottleneck.For example, we observe that in some cases, even though the multi-modal model is able to find the relevant objects from the image, the open-vocabulary detector is not able to localize them, which may be because the training data of the object detector does not encompass such visual concepts.In some other cases, the multi-modal model is not able to find all relevant objects in the image, possibly due to the lack of fine-grained visual recognition ability.The above limitations promote new research in this direction and demand more advanced solutions.

Ethic Statement and Broader Impact
The proposed task of reasoning-based object detection and the proposed method, DetGPT, have the potential for significant broader impact across a variety of fields.By enabling an embodied agent to automatically locate objects of interest based on human instructions, DetGPT has the potential to improve the efficiency and effectiveness of tasks such as grasping in robotics and object recognition in autonomous driving.This could lead to safer and more reliable autonomous systems in these fields.
Furthermore, the introduction of RD-Bench as a curated and open-source benchmark for instruction tuning and evaluation can facilitate further research and development in this area, potentially leading to more advanced and versatile applications of reasoning-based object detection.
Overall, the proposed task and method demonstrate a step towards more sophisticated and intuitive interactions between humans and machines.
We do not foresee any ethical concern regarding our paper.
Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625.

A Data Generation Using ChatGPT
We built our RD-Bench by utilizing images from established datasets like COCO and Object365, as well as the powerful LLM ChatGPT.However, as ChatGPT is designed to handle textual inputs exclusively, we employed caption and bounding box annotations to connect the visual and textual modalities.In the case of Object365, where caption annotations were not available, we employed LLaVA to generate them with the prompt "Describe the image in detail".To showcase the data generation process, we provide both system prompts and contextual examples that were presented to ChatGPT.
System Prompt Table 9 displays the system prompt we utilized.In this prompt, we set the role for ChatGPT as a visual assistant to generate query-answer pairs based on the given image.
To ensure a comprehensive and diverse range of instructions, we delineated four specific types of queries that needed to be included during data generation.These types include: 1) goal-oriented queries, which associate relevant objects with highlevel user queries; 2) detection of all visible objects in the image, similar to conventional object detection tasks; 3) detection of specific objects based on their categories; and 4) attribute-related queries, aimed at localizing objects with specific attributes such as color and shape.
In-Context Examples To enable better generation quality, we append two manually written examples after the system prompt as in-context examples, which are shown in Table 10 and Table 11.Specifically, we we first list out the captions and the objects, which are the inputs to ChatGPT.Then we provide a detailed description and the queryanswer pairs, which should be the outputs from ChatGPT.

Figure 1 :
Figure 1: Framework of DetGPT.The VLM consisted of vision encoder and LLM interprets the user instruction, reasons over the visual scene, and finds objects matching the user instruction.Then, the open-vocabulary detector localizes these objects base on the VLM's output.

Figure 2 :
Figure 2: The illustration of reasoning-based object detection task.The detection system is able to interpret human instruction, reason about the visual scene with common sense knowledge, and finally localize the objects of interest.
The bird's presence in the image is interesting because it provides a glimpse into the life of a wild animal living in an urban environment.find what's interesting about the image.it shows a cat looking out of a window find what's interesting about the image.

Figure 3 :
Figure 3: Comparison with other SOTA VLMs.Our DetGPT is able to find and localize the object of interest.
the Appendix.Examples of generated detailed description and query-answer pairs are shown in Table

Figure 4 :
Figure 4: MiniGPT4 vs DetGPT with appended user prompt.The reasoning process and the final sentences containing target objects are both shown for clarity.MiniGPT4 generates redundant objects, while DetGPT can accurately recognize the object of interest.

Figure 6 :
Figure 6: Average mAP for various sizes of instruction tuning datasets.Only a small number of samples are needed for reaching a promising performance.

Table 2 :
Test Results on RD-Bench.Instruction tuning and prompting enable DetGPT to achieve promising performances on both in domain and out of domain tasks.

Table 3 :
Performance with different language models.

Table 4 :
Adding prompts during instruction tuning.

Table 5 :
Effect of adding the reasoning process.