3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description. Typically, the sentences describing the target object tend to provide information about its relative relation between other objects and its position within the whole scene. In this work, we propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3DRP-Net), which can effectively capture the relative spatial relationships between objects and enhance object attributes. Specifically, 1) we propose a 3D Relative Position Multi-head Attention (3DRP-MA) module to analyze relative relations from different directions in the context of object pairs, which helps the model to focus on the specific object relations mentioned in the sentence. 2) We designed a soft-labeling strategy to alleviate the spatial ambiguity caused by redundant points, which further stabilizes and enhances the learning process through a constant and discriminative distribution. Extensive experiments conducted on three benchmarks (i.e., ScanRefer and Nr3D/Sr3D) demonstrate that our method outperforms all the state-of-the-art methods in general. The source code will be released on GitHub.


Introduction
Visual grounding aims to localize the desired objects based on the given natural language description.With the rapid development and wide applications of 3D vision (Xia et al., 2018;Savva et al., 2019;Zhu et al., 2020;Wang et al., 2019) in recent years, 3D visual grounding task has received more and more attention.Compared to the wellstudied 2D visual grounding (Yang et al., 2019;Kamath et al., 2021;Yang et al., 2022;Li and Sigal, 2021;Deng et al., 2021;Plummer et al., 2015;Kazemzadeh et al., 2014), the input sparse point clouds in the 3D visual grounding task are more Figure 1: 3D visual grounding is the task of grounding a description in a 3D scene.In the sentences, all the words indicating the relative positions of the target object are bolded.Notice that relative position relations between objects are crucial for distinguishing the target object, and the relative position-related descriptions in 3D space are complex (e.g., "above", "on the left", "in front of", and "next to", etc.) irregular and more complex in terms of spatial positional relationships, which makes it much more challenging to locate the target object.
In the field of 3D visual grounding, the previous methods can be mainly categorized into two groups: the two-stage approaches (Chen et al., 2020;Achlioptas et al., 2020;Zhao et al., 2021b;Yuan et al., 2021;Huang et al., 2022;Cai et al., 2022;Huang et al., 2021;Wang et al., 2023) and the one-stage approaches (Luo et al., 2022).The former ones follow the detection-and-rank paradigm, and thanks to the flexibility of this architecture, they mainly explore the benefits of different object relation modeling methods for discriminating the target object.The latter fuse visual-text features to predict the bounding boxes of the target objects directly, and enhance the object attribute representation by removing the unreliable proposal generation phase.
However, these two methods still have limitations.For two-stage methods, the model performance is highly dependent on the quality of the object proposals.However, due to the sparsity and irregularity of the input 3D point cloud, sparse proposals may leave out the target object, while dense proposals will bring redundant computational costs and make the matching stage too complicated to distinguish the target object.As for the one-stage methods, although the existing approach (Luo et al., 2022) achieves better performance, they can not capture the relative spatial relationships between objects, which makes it often fail in samples that rely on relative relation reasoning.As shown in Fig. 1, the majority of sentences in 3D visual grounding contain relative spatial relation descriptions.Furthermore, due to the spatial complexity of the 3D scene, there are various relative position-related descriptions from different orientations.To further illustrate that relative position is a general and fundamental issue in 3D visual grounding tasks, we analyze the frequency of relative position words in ScanRefer and Nr3D/Sr3D, and the results show that at least 90% of the sentences describe the relative position of objects, and most of them contain multiple spatial relations.Detailed statistics can be found in supplementary materials.
To alleviate above problems, we propose a onestage 3D visual grounding framework, named 3D Relative Position-aware Network (3DRP-Net).Our 3DRP-Net combines and enhances the advantages of the two-stage approaches for relations modeling and the one-stage approaches for proposalfree detection while avoiding the shortcomings of both methods.For the relations modeling, we devise a novel 3D Relative Position Multi-head Attention (3DRP-MA) module, which can capture object relations along multiple directions and fully consider the interaction between the relative position and object pairs which is ignored in previous two-stage methods (Yuan et al., 2021;Zhao et al., 2021b;Huang et al., 2021).Specifically, we first extract features from the point cloud and description, and select key points.Then, the language and visual features interacted while considering the relative relations between objects.For the relation modeling, We introduce learnable relative position encoding in different heads of the multi-head attention to capture object pair relations from different orientations.Moreover, in sentences, the relative relations between objects are usually described as "Object 1-Relation-Object 2", such as "tv is on the tv cabinet" and "curtain is hanging on the window" in Fig. 1.The relation is meaningful only in the context of object pairs, thus our relative position encoding would interact with the object pairs' feature, to better capture and focus on the mentioned relations.
Besides, as discussed in (Qi et al., 2019), point clouds only capture surface of object, and the 3D object centers are likely to be far away from any point.To accurately reflect the location of objects and learn comprehensive object relation knowledge, we sample multiple key points of each object.However, redundant key points may lead to ambiguity.To achieve disambiguation while promoting a more stable and discriminative learning process, we propose a soft-labeling strategy that uses a constant and discriminative distribution as the target label instead of relying on unstable and polarized hardlabel or IoU scores.
Our main contributions can be summarized as follows: • We propose a novel single-stage 3D visual grounding model, called 3D Relative Positionaware Network (3DRP-Net), which for the first time captures relative position relationships in the context of object pairs for better spatial relation reasoning.
• We design a 3D Relative Position Multi-head Attention (3DRP-MA) module for simultaneously modeling spatial relations from different orientations of 3D space.Besides, we devise a soft-labeling strategy to alleviate the ambiguity while further enhancing the discriminative ability of the optimal key point and stabilizing the learning process.
• Extensive experiments demonstrate the effectiveness of our method.Our 3DRP-Net achieves state-of-the-art performance on three mainstream benchmark datasets ScanRefer, Nr3D, and Sr3D.
2 Related Work

3D Visual Grounding
Recent works in 3D visual grounding can be summarized in two categories: two-stage and one-stage methods.We briefly review them in the following.Two-stage Methods.Two-stage approaches follow the detection-and-rank scheme.In the first stage, 3D object proposals are generated by a pretrained 3D object detector (Chen et al., 2020) or with the ground truth (Achlioptas et al., 2020).
In the second stage, the best matching proposals would be selected by leveraging the language description.Advanced two-stage methods achieve good performance by better modeling the relationships among objects.Referit3D (Achlioptas et al., In the stacked transformer layer, the 3DRP-MA captures the relative relations between points in the 3D perspective.Specifically, the two self-attentions based on 3DRP-MA capture the relative relations between objects, while the cross-attention between key points and seed points enhances the global position information. 2020) and TGNN (Huang et al., 2021) make use of the graph neural network (Scarselli et al., 2008) to model the relationships between objects.3DVG-Transformer (Zhao et al., 2021b) utilize attention mechanisms (Vaswani et al., 2017) to enable interactions between proposals, and the similarity matrix can be adjusted based on the relative Euclidean distances between each pair of proposals.
One-stage Methods.One-stage approaches avoid the unstable and time-consuming object proposals generation stage under the detection-and-rank paradigm.The visual features extracted by the backbone are directly and densely fused with the language features, and the fused features are leveraged to predict the bounding boxes and referring scores.3D-SPS (Luo et al., 2022) first addresses the 3D visual grounding problem by one-stage strategy.It firstly filters out the key points of languagerelevant objects and processes inter-model interaction to progressively down-sample the key points.
Our work utilizes the advanced one-stage framework and introduces a novel relative relation module to effectively capture the intricate relations between objects, enabling our model to achieve superior performance.

Position Encoding in Attention
The attention mechanism is the primary component of transformer (Vaswani et al., 2017).Since the attention mechanism is order-independent, infor-mation about the position should be injected for each token.In general, there are two mainstream encoding methods: absolute and relative position encoding.Absolute Position Encoding.The original transformer (Vaswani et al., 2017) considers the absolute positions, and the encodings are generated based on the sinusoids of varying frequency.Recent 3D object detection studies also use absolute position encodings.In Group-free (Liu et al., 2021b), the encodings are learned by the center and size of the predicted bounding box, while the Fourier function is used in 3DETR (Misra et al., 2021).Relative Position Encoding.Recently, some advanced works in natural language processing (He et al., 2020;Raffel et al., 2020;Shaw et al., 2018) and image understanding (Liu et al., 2021a;Hu et al., 2019Hu et al., , 2018) ) generate position encoding based on the relative distance between tokens.Relative relation representations are important for tasks where the relative ordering or distance matters.
Our method extends relative position encoding to 3D Euclidean space and enhances relative relation reasoning ability in 3D visual grounding.

Method
This section introduces the proposed 3D Relative Position-aware Network (3DRP-Net) for 3D visual grounding.In Sec.3.1, we present an overview of our method.In Sec.3.2, we dive into the techni-cal details of the 3D Relative Position Multi-head Attention (3DRP-MA) module and how to comprehensively and efficiently exploit the spatial position relations in the context of object pairs.In Sec.3.3 and Sec.3.4,we introduce our soft-labeling strategy and the training objective function of our method.

Overview
The 3D visual grounding task aims to find the object most relevant to a given textual query.So there are two inputs in the 3D visual grounding task.One is the 3D point cloud which is represented by the 3D coordinates and auxiliary features (RGB values and normal vectors in our setting) of N points.Another input is a free-form natural language description with L words.
The overall architecture of our 3DRP-Net is illustrated in Fig. 2. Firstly, we adopt the pretrained PointNet++ (Qi et al., 2017) to sample S seed points and K key points from the input 3d point cloud and extract the C-dimensional enriched points feature.For the language description input, by using a pre-trained language encoder (Radford et al., 2021), we encode the L-length sentences to D-dimensional word features.Secondly, a stack of transformer layers are applied for multimodal fusion.The features of key points are accordingly interacted with language and seed points to group the scene and language information for detection and localization.Our new 3D relative position multihead attention in each layer enables the model to understand vital relative relations among objects in the context of each object pair.Eventually, we use two standard multi-layer perceptrons to regress the bounding box and predict the referring confidence score based on the feature of each key point.As shown in Fig. 2, in the training phase, we generate the target labels of referring scores based on the IoUs of the predicted boxes.During inference, we only select the key point with the highest referring score to regress the target bounding box.

3D Relative Position Multi-head Attention
When describing an object in 3D space, relations between objects are essential to distinguish objects in the same class.Given the spatial complexity of 3D space and the potentially misleading similar relative positions between different object pairs, a precise and thorough comprehension of the relative position relationships is crucial for 3D visual grounding.However, existing 3D visual grounding methods fail to effectively address complex spa-tial reasoning challenges, thereby compromising their performance.To address this limitation, we propose a novel 3D relative position multi-head attention to model object relations in the context of corresponding object pairs within an advanced one-stage framework.

Relative Position Attention
Before detailing our relative position attention, we briefly review the original attention mechanism in (Vaswani et al., 2017).Given an input sequence x = {x 1 , ..., x n } of n elements where x i ∈ R dx , and the output sequence z = {z 1 , ..., z n } with the same length where z i ∈ R dz .Taking single-head attention, the output can be formulated as: (1) where W Q , W K , W V ∈ R dx×dz represents the projection matrices, a i,j is the attention weight from element i to j.
Based on the original attention mechanism, we propose a novel relative position attention that incorporates relative position encoding between elements.Since the semantic meaning of a relative relation "Object 1-Relation-Object 2" is also highly dependent on the object pairs involved, it is essential for the position encoding to fully interact with object features in order to accurately capture the specific relative relations mentioned in the description.To this end, the attention weight a i,j in our proposed relative position attention is calculated as follows: where d ij represents the relative distance from element i to element j, while d ji is the opposite.p(d) ∈ [0, 2k) is an index function that maps continuous distance to discrete value, as detailed in Eq.4.r k p(•) , r q p(•) ∈ R (2k+1)×dz is the learnable relative position encoding.Considering a typical object relation expression "Object 1-Relation-Object 2", our attention weight can be understood as a sum of three attention scores on object pairs and relation: Object 1-to-Object 2, Object 1-to-Relation, and Relation-to-Object 2.

Piecewise Index Function
The points in the 3D point cloud are unevenly distributed in a Euclidean space, and the relative distances are continuous.To enhance the relative spatial information and reduce computation costs, we propose to map the continuous 3D relative distances into discrete integers in a finite set.Inspired by (Wu et al., 2021), we use the following piecewise index function: (4) where [•] is a round operation, sign(•) represents the sign of a number, i.e., returning 1 for positive input, -1 for negative, and 0 for otherwise.
Eq.4 performs a fine mapping in the α range.The further over α, the coarser it is, and distances beyond β would be mapped to the same value.In the 3D understanding field, many studies (Zhao et al., 2021a;Misra et al., 2021) have demonstrated that neighboring points are much more important than the further ones.Therefore, mapping from continuous space to discrete values by Eq.4 would not lead to much semantic information loss while significantly reducing computational costs.

Multi-head Attention for 3D Position
Till now, our relative position attention module can handle the interaction between object features and relative position information in continuous space.However, points in 3D space have much more complicated spatial relations than pixels in 2D images or words in 1D sentences.As shown in Table 4, relying on a single relative distance metric leads to insufficient and partial capture of inter-object relations.This makes it difficult to distinguish the target object when multiple spatial relations are described in the language expression.Therefore, we attempt to capture object relations from multiple directions.Specifically, we encode the relative distances under x, y, z coordinates, and the Euclidean metric, denoted as D x , D y , D z , and D e , respectively.These four relative position metrics represent most of object relations in the language description (e.g., D x for "left, right", D y for "front, behind", D z for "top, bottom", D e for "near, far").Based on the architecture of multi-head attention, each relative position encoding is injected into the relative position attention module of each head.Such a 3DRP-MA allows the model to jointly attend to information from different relative relations in 3D space.

Soft-labeling Strategy
Due to the object center are often not contained in the given point clouds, we select multiple key points for each object to better reflect its location.Therefore, as shown in Fig. 3, there will be lots of accurately predicted boxes achieving high Intersection over Union (IoU) of target object.Previous methods (Chen et al., 2020;Zhao et al., 2021b;Luo et al., 2022) use one-hot or multi-hot labels to supervise the referring score.The key points whose predicted box has the top N s highest IoU are set to 1, and others are set to 0, which can encourage the model to select the most high-IoU proposals.However, the simple hard-labeling strategy results in two problems: Firstly, proposals with similar and high IoUs may be labeled differently as 1 and 0, which can cause an unstable training phase.Secondly, it becomes difficult to distinguish between optimal and sub-optimal proposals, affecting the model's ability to accurately identify the most accurate proposal.
To tackle these issues, we introduce a softlabeling strategy to smooth the label distribution and encourage the model to effectively distinguish the optimal proposal.To be specific, the softlabeling function can be calculated as follow: where i ∈ {0, ..., N s } represents the i-th highest IoU.We set σ as [N s /3] to control the smoothness of the distributions.The target label of the keypoint whose predicted box's IoU is i-th highest and greater than 0.25 is set to ŝi , and others are set to 0. Although this strategy is simple, its role is to do more as one stroke, and the insight it provides is non-trivial.
For discriminative ability, the soft-labels enhance the difference between the optimal and sub-optimal proposals, which enforces the model to accurately identify the best key point for regressing detection box.In contrast, when hard-labels or IoU scores are used as the target labels, there is little difference between optimal and sub-optimal proposals from the perspective of learning objectives.For stability, compared to hard-labels, our soft-labels can cover a broader range of accurate proposals with a smoother label distribution, and excluding the proposals with low IoU further stabilizes the learning process.Additionally, compared to directly using IoU scores, the constant distribution in soft-labels provides a more stable loss across different samples.For example, if we have two samples with vastly different target objects, such as a large bed and a small chair, the bed sample would have significantly more key points selected, resulting in more proposals of the target object.Using IoU scores as labels would ultimately lead to a much larger loss for the bed sample than the chair sample, which is clearly unreasonable.

Training and Inference
We apply a multi-task loss function to train our 3DRP-Net in an end-to-end manner.Referring Loss.The Referring loss L ref is calculated between the target labels Ŝ discussed in Sec.3.3 and predicted referring scores S of K keypoints with focal loss (Lin et al., 2017).Keypoints Sampling Loss.Following the loss used in (Luo et al., 2022), we apply the key points sampling loss L ks to make sure the selected key points are relevant to any object whose category is mentioned in the description.Detection Loss.To supervise the predicted bounding boxes, we use the detection loss L det as an auxiliary loss.Following (Luo et al., 2022), the L det consists of semantic classification loss, objectness binary classification loss, center offset regression loss and bounding box regression loss.Language Classification Loss.Similar to (Chen et al., 2020), We introduce the language classification loss L text to enhance language encoder.
Finally, the overall loss function in the training process can be summarized as where the balancing factors α 1 , α 2 , α 3 , α 4 are set default as 0.05, 0.8, 5, 0.1, respectively, and the L ref and L det are applied on all decoder stages following the setting in (Qi et al., 2019).
For ScanRefer (Chen et al., 2020), following previous work, we use Acc@mIoU as the evaluation metric, where m ∈ {0.25, 0.5}.This metric represents the ratio of the predicted bounding boxes whose Intersection over Union (IoU) with the ground-truth (GT) bounding boxes is larger than m.For Sr3D and Nr3D (Achlioptas et al., 2020), the ground truth bounding boxes are available, and the model only needs to identify the described object from all the bounding boxes.Therefore, the evaluation metric of these two datasets is accuracy, i.e., the percentage of the correctly selected target object.

Quantitative Comparison
We compare our 3DRP-Net with other state-of-theart methods on these three 3D visual grounding benchmarks.ScanRefer.Table 1 shows the performance on ScanRefer.3DRP-Net outperforms the best twostage method by +4.20 at Acc@0.25 and +4.40 at Acc@0.5 and exceeds the best one-stage method by +2.45 at Acc@0.25 and +2.47 at Acc@0.5.Even when compared to 3DJCG, which utilizes an extra Scan2Cap (Chen et al., 2021) dataset to assist its training, our 3DRP-Net still shows superiority in all metrics.Specifically, for the "Multiple" subset, 3DRP-Net achieves +2.66 and +2.34 gains when compared with the advanced one-stage model in terms of Acc@0.25 and Acc@0.5, which validates the proposed 3DRP-MA module is powerful for modeling complex relative position relations in 3D space and significantly contributes to distinguishing the described target object from multiple interfering objects.
Table 1: Comparisons with state-of-the-art methods on ScanRefer.We highlight the best performance in bold.

Methods
Extra Unique Multiple Overall Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Nr3D/Sr3D.Note that the task of Nr3D/Sr3D is different from ScanRefer, which aims to identify the described target object from all the given groundtruth bounding boxes.Therefore, the soft-labeling strategy and the keypoint sampling module are removed.We only verify the effectiveness of 3DRP-MA on these two datasets.Besides, the data augmentation methods in ViL3DRel (Chen et al., 2022) are also used in our training phase for a fair comparison.The accuracy of our method, together with other state-of-the-art methods, is reported in Table 2. 3DRP-Net achieves the overall accuracy of 65.9% and 74.1% on Nr3D and Sr3D, respectively, which outperforms all existing methods by a large margin.In the more challenging "Hard" subset, 3DRP-Net significantly improves the accuracy by +2.3% in Nr3D and +1.6% in Sr3D, again demonstrating our method is beneficial for distinguishing objects by capturing the relative spatial relations.

Ablation Study
We conduct ablation studies to investigate the contribution of each component.All the ablation study results are reported on the ScanRefer validation set.
Relation Modeling Module.We compared our proposed 3DRP-MA with the relation modules in other 3D visual grounding methods.For fair comparisons, we also introduce distances in x, y, z coordinates and Euclidean space to other relation modules.The results are provided in Table 3, comparing rows 1, 2 and 6, our 3DRP-MA is far superior to the relation modules in 3DVG-Trans and 3DJCG, and the performance improvement mainly comes from the subsets that rely on relative relationship reasoning for localization, namely the "One-Rel" and "Multi-Rel" subsets.Pair-aware relation attention.The typical description of a spatial relation can be expressed as "Object 1-Relation-Object 2".Our pair-aware relation attention can be considered as the sum of two scores: Object 1-to-Relation (O1-R) and Relationto-Object 2 (R-O2).To further verify the superiority of capturing the relation in the context of an object pair, we ablate the two scores, and the results are illustrated in Table 4. From rows 1, 2 and 5, both O1-R and R-O2 terms benefit the 3D visual grounding task by capturing the relative relations, and the joint use of O1-R and R-O2 provides a more comprehensive understanding of spatial relation description and leads to the best performance.3DRP-MA in each layer.We study the effect of each 3DRP-MA module in the transformer layer.SA 1 , CA and SA 2 respectively denote whether to replace the self-attention before interacting with seed points, the cross-attention for key points and seed points, and the self-attention before interacting with language.Row 3 to 5 in Table 4 add each 3DRP-MA in turns and the performance is gradually improved to 50.10% and 38.90%.Soft-labeling Strategy.Table 5 presents the performance of different labeling strategies.In hardlabeling, N s represents the number of key points whose IoU is in the top N s and greater than 0.25, which are labeled as 1.In soft-labeling, N s is a hyperparameter in Eq.5, which controls the num- ber of soft labels.To further demonstrate that our proposed strategy improves stability and discrimination, we also use IoUs score as a label.The "original" setting directly uses IoUs as a label, while the "linear" setting stretches IoUs linearly to the range of 0 to 1 to enhance discrimination.Compared to hard-labeling and IoUs methods, our soft-labeling strategy improves discrimination and stability.Using the "original" IoUs method lacks discrimination power and stability due to the unbalanced loss on different samples.Even using linear scaling to enhance discrimination power, this instability cannot be eliminated.Our method alleviates these problems with a discriminative constant distribution and shows comprehensive superiority in Table 5.

Conclusion
In this paper, we propose a relation-aware onestage model for 3D visual grounding, referred to as 3D Relative Position-aware Network (3DRP-Net).3DRP-Net contains novel 3DRP-MA modules to exploit complex 3D relative relations within point clouds.Besides, we devise a soft-labeling strategy to achieve disambiguation while promoting a stable and discriminative learning process.Comprehensive experiments reveal that our 3DRP-Net outperforms other methods.

Limitations
The datasets of 3D visual grounding task are all stem from the original ScanNet dataset which brings generalization to other scene types into question.More diverse benchmarks are important for the further development of the field of 3D visual grounding.which makes it very difficult to identify the cabinet in the scene.
• Challenging auxiliary objects.3D visual grounding task often requires the relations between the target object and auxiliary objects to assist the localization.The challenging auxiliary objects may result in an incorrect prediction.As shown in case 5 of Figure 5, the target table is on "the left of the bed", but the left and right side of a bed are difficult to distinguish, which requires identifying the direction of the bed according to the position of pillows.This reasoning process is too complex for our model, and our prediction actually found the table on the right side of a bed.In case 6, the auxiliary object is "chair of the cubicles", which is challenging for the model to recognize.

B Statistics of Relative Position Words
To further illustrate that relative position relation is a general and fundamental issue in 3D visual grounding task, we count some common words representing relative spatial relations in three 3D visual grounding datasets (i.e., ScanRefer (Chen et al., 2020) C Implementation Details.
We adopt the pre-trained PointNet++ (Qi et al., 2017) and the language encoder in CLIP (Radford et al., 2021) to extract the features from point clouds and language descriptions, respectively, while the rest of the network is trained from In the ablation study, we further divided the "Multiple" set of ScanRefer into "Non-Rel/One-Rel/Multi-Rel" subsets according to the number of relational descriptions in the sentences.Specifically, we follow the statistical method in Sec.B to count some common words representing relative spatial relations.

D Prior Methods for Comparison
In order to validate the effectiveness of the proposed 3DRP-Net, Sec.4.2 comprehensively compare it to many previous state-of-the-art methods: 1) ReferIt3DNet (Achlioptas et al., 2020) 2) Scan-

Figure 2 :
Figure2: 3DRP-Net is a transformer-based one-stage 3D VG model which takes a 3D point cloud and a description as inputs and outputs the bounding box of the object most relevant to the input expression.In the stacked transformer layer, the 3DRP-MA captures the relative relations between points in the 3D perspective.Specifically, the two self-attentions based on 3DRP-MA capture the relative relations between objects, while the cross-attention between key points and seed points enhances the global position information.

Figure 3 :
Figure 3: Comparison of various labeling strategies.

Figure 4 :
Figure 4: The visualization results of some success cases.The blue/green/red colors indicate the ground truth/correct/incorrect boxes.

Figure 5 :
Figure 5: The visualization results of some failure cases.The ground-truth boxes are labeled in blue and the incorrectly predicted boxes are marked in red.
, Nr3D(Achlioptas et al., 2020) and Sr3D(Achlioptas et al., 2020)) in Figure6and 7. From Figure6, in ScanRefer, at least 97% descriptions contain relative position relations, and more than 63% sentences use multiple relative position relations to indicate the target object.Besides, about 90% sentences utilize the relative position words in Nr3D, and almost all the samples in Sr3D require relative position relations between objects for localization.As shown in Figure7, in ScanRefer and Nr3D, which collected human utterances as descriptions, most of the commonly used relative position words appear in the sentences.This further demonstrates the importance of modeling relative position relations from different perspectives.

Figure 6 :
Figure 6: Ratio of sentences containing the specific number of relative position words in three 3D visual grounding datasets.

Table 2 :
-stage: Comparisons with state-of-the-art methods on Nr3D and Sr3D.We highlight the best performance in bold.

Table 3 :
Ablation studies on relation position encoding and different relation modeling modules.None-Rel/One-Rel/Multi-Rel represent subsets that contain zero/one/multiple relation descriptions in the original Multiple set of ScanRefer, and the relative percentage improvements compared to the different settings are marked in green.

Table 4 :
Ablation studies on 3DRP-MA in each transformer layer and pair-aware relation attention.

Table 5 :
Ablation studies on the labeling strategies.