4.2. Experimental Results
Performance evaluation was conducted using the standard metric of grounding accuracy. A prediction is considered successful if the Intersection over Union (IoU) between the predicted 3D bounding box
and the ground-truth bounding box
exceeds a specified threshold(
. The IoU is computed based on the volumetric overlap between the two bounding boxes, as defined in Equation (11).
In this study, Acc@
a standard metric widely adopted in 3D visual grounding benchmarks [
1,
2], is employed as the primary metric for comprehensive performance comparison, defined as shown in Equation (12). For evaluation, the threshold(
was set to 0.25 and 0.5, respectively.
Here, denotes the total number of queries, and represents an indicator function that returns 1 when the condition is true.
In the Nr3D dataset, performance is further analyzed by categorizing queries according to their difficulty level—Easy (with one distractor object) and Hard (with multiple distractors)—as well as by whether viewpoint information is required for interpretation, namely View-Dependent (requiring a specific viewpoint) and View-Independent (viewpoint-independent). In the ScanRefer dataset, evaluation is divided based on whether the scene contains multiple instances of the same object class: Unique (a single instance) and Multiple (multiple instances). This categorization focuses on assessing the model’s discriminative capability between visually similar objects.
To assess the effectiveness of the proposed method, the model’s performance is compared across these four evaluation settings.
First, Ori Baseline 72B represents the original SeeGround performance using the 72B-parameter VLM, serving as the upper-bound reference. Second, Ori Baseline 7B corresponds to the performance obtained by running the original SeeGround logic with a 7B VLM, reflecting the impact of reducing model size. Both of these metrics were directly cited from the original SeeGround paper. Third, Reproduced 7B (Baseline) denotes our reproduced SeeGround pipeline implemented in the experimental environment using the lightweight 7B VLM, and serves as the primary baseline for direct comparison. Finally, Ours 7B+Depth refers to our final model, in which the proposed RDT module is integrated into the baseline. Thus, the performance gap between Reproduced 7B and Ours 7B+Depth directly reflects the contribution of the proposed methodology.
Table 1 and
Table 2 present the quantitative evaluation results on the Nr3D and ScanRefer datasets, respectively. The proposed model (Ours 7B+Depth) demonstrated meaningful performance variations depending on the dataset characteristics compared to the direct baseline (Reproduced 7B).
For the Nr3D dataset (
Table 2), our model achieved a consistent performance improvement of 3.54% in overall Acc@25. This improvement was observed across all subsets (Easy, Hard, Dep., Indep.), with particularly notable gains of 3.68% in the Hard subset, where spatial ambiguity is high, and 3.24% in the View-Dependent subset. These results confirm that RDT provides the VLM with additional depth-related contextual cues, enabling more accurate reasoning in complex and fine-grained grounding queries.
In contrast, the results on the ScanRefer dataset (
Table 3) were mixed. For the Unique@25 category (scenes containing a single target instance), performance improved significantly by 6.74%, whereas in the Multiple@25 category (scenes containing multiple similar objects), accuracy dropped by 1.70%. This suggests that while RDT acts as a helpful cue in unambiguous or straightforward cases, it may introduce confusion in complex Multiple scenarios, where the lightweight 7B model struggles to integrate multiple textual cues effectively. A similar trend was observed in Acc@50, where overall performance decreased by 0.71%, implying that as the IoU threshold increases—requiring finer localization—RDT-induced ambiguity may lead to mispredictions.
From a computational perspective, generating Depth Maps and RDTs introduced negligible overhead to the overall inference time. The computational costs, including GPU memory usage and wall-clock inference time for both Nr3D and ScanRefer datasets, are summarized in
Table 1.
As shown in
Table 1, the depth module requires an additional 2.4 GB of VRAM, bringing the total peak usage to 22.56 GB. Crucially, this remains within the memory capacity of a standard consumer-grade GPU (24 GB), ensuring accessibility without the need for high-end data center hardware (e.g., A100). Consequently, the depth estimation introduces an amortized overhead of approximately 40 ms per query due to view caching. Since the pipeline is predominantly bottlenecked by the VLM’s auto-regressive token generation, the total time increase is approximately 2% to 5%, which is negligible in practical applications.
To provide a comprehensive evaluation, we compare our method with state-of-the-art zero-shot 3D visual grounding approaches, including VLM-Grounder, SORT3D, SeqVLM, and View-on-Graph (VoG). The results are summarized in
Table 4. It is important to note that these competing methods rely on large-scale foundation models such as GPT-4 (closed-source) or Qwen2-VL-72B, which require significant computational resources. In contrast, our method employs the lightweight Qwen2-VL-7B model, designed for consumer-grade GPU environments (e.g., a single RTX 3090). Despite the significant disparity in model size (7B vs. 72B/GPT-4), our approach demonstrates competitive performance, particularly in the ScanRefer benchmark, where it surpasses the baseline and narrows the gap with larger models.
To complement the quantitative evaluation and to further explain the reasoning process of the proposed methodology, a qualitative analysis was conducted.
Figure 4 and
Figure 5 present representative results from the Nr3D and ScanRefer benchmarks, respectively, highlighting cases in which the baseline model (Reproduced 7B) failed while the proposed model (Ours 7B+Depth) succeeded. Each example, extracted from the ScanRefer and Nr3D datasets, includes the prompted image (
) containing object ID markers, the corresponding depth map, and the VLM’s inference result. This analysis visually demonstrates how the proposed Relational Depth Text (RDT) resolves spatial ambiguity and exerts a decisive influence on the VLM’s reasoning process.
The first case in
Figure 4 involves a relatively simple spatial relation query: “Find the window next to the white desk.” In the rendered 2D view, two visually similar windows (IDs 10 and 12) are positioned side by side near the reference object, the white desk (ID 9), creating apparent ambiguity. Without access to depth information, the baseline model misinterpreted the “next to” relationship and incorrectly selected object 10.
In contrast, the proposed model correctly identified the target window (ID 12). The generated RDT in this case, “From the current viewpoint, object 9 is visible,” does not provide explicit depth comparison information. Since the query lacks explicit depth-related keywords such as “near” or “behind,” a fallback text was generated. Nevertheless, this text played an essential role as a reasoning anchor, guiding the VLM to focus its reasoning on the reference object (object 9). The phrase “object 9 is visible” implicitly instructs the model to “reconsider spatial relations relative to object 9 rather than other distractors.” As a result, the VLM re-evaluated the 3D spatial relationships among objects 9, 10, and 12, ultimately selecting the correct window (ID 12).
In conclusion, this case demonstrates that even when RDT does not provide explicit depth comparison, it effectively redirects the model’s reasoning focus toward the key anchor object, thereby resolving fine-grained spatial ambiguities that the baseline model could not handle.
The second example in
Figure 4 highlights the limitations of 2D viewpoints and underscores the importance of 3D spatial reasoning. The query is: “Find the rectangular copier to the left of the trash bin.” From the rendered 2D view alone, it is not visually intuitive whether the copier (ID 21) is actually positioned to the left of the trash bin (ID 7). The baseline model, which relied solely on 3D coordinate information from the Object Lookup Table (OLT), evaluated multiple potential “left” candidates but incorrectly selected another visually prominent object (ID 5).
In contrast, the proposed model successfully leveraged an explicit RDT cue: “From the current viewpoint, object 21 is the closest and object 27 is the farthest.” This additional relational description provided two crucial constraints: (1) the object must be to the left of the trash bin in 3D space, and (2) it must also be the closest object from the current viewpoint. By integrating these two spatial conditions, the VLM effectively narrowed down the candidate set.
As shown in the depth map, object 21 appears in the red-shaded region, indicating that it is indeed the closest object in the current viewpoint. Consequently, the model accurately identified object 21 as the correct target satisfying both conditions.
This case clearly demonstrates that the proposed RDT plays a decisive role in overcoming 2D projection ambiguity, enabling the VLM to disambiguate multiple 3D candidates correctly and leading to more precise grounding in complex spatial reasoning scenarios. The final example in
Figure 4 presents a query that requires a comprehensive understanding of the entire scene composition. In this case, the baseline model incorrectly selected an invisible object (ID 16) in the rendered image. This error illustrates how the lightweight 7B version of the model, when faced with complex queries, fails to utilize the visual context (
)—one of the key design principles of the original SeeGround—and instead relies solely on text-based coordinate information from the OLT. As a result, the viewpoint-dependent constraint expressed in the query (“the pool table is visible to the right”) was completely ignored.
In contrast, the proposed model successfully resolved this issue with the aid of the generated RDT: “From the current viewpoint, object 6 is the closest and object 15 is the farthest.” This sentence functioned as a viewpoint anchor, effectively fixing the VLM’s reasoning process to the current rendered perspective.
To interpret this RDT, the VLM first needed to locate object 15 within the image. This process inherently forces the model to attend to the visual content, preventing the neglect of visual cues that occurred in the baseline model. Once grounded to the correct viewpoint, the VLM was able to correctly interpret the visual constraint (“the pool table is visible”) and naturally exclude invisible candidates such as object 16. Furthermore, the phrase “object 15 is the farthest” helped narrow the search space to the deepest region of the scene, where the correct answer (object 18) was located. Finally, within this constrained search space, the VLM evaluated the remaining conditions and correctly identified object 18 as the target. In summary, this case demonstrates that RDT prevents the lightweight model from disregarding visual context in complex query scenarios, effectively restoring and enhancing the core reasoning capability originally intended in SeeGround’s design.
Figure 5 presents qualitative examples from the ScanRefer dataset, which exhibit patterns distinct from those observed in Nr3D. Because queries in ScanRefer often rely more on unique visual attributes or simple adjacency relations between objects, the effect of RDT tends to appear more indirect compared to its influence in Nr3D. The first example in
Figure 5 corresponds to the query “Find the kitchen cabinet under the sink (ID 50).” In the rendered 2D view, the correct cabinet (ID 10) and the incorrect cabinet selected by the baseline (ID 19) are positioned side by side beneath the sink, creating an apparent visual ambiguity. The baseline model lacked additional cues to distinguish between these two visually similar candidates and thus produced an incorrect prediction (ID 19).
In contrast, the proposed model received an additional RDT cue: “From the current viewpoint, object 12 is the closest and object 10 is the farthest.” This text provides new spatial information not explicitly mentioned in the query—specifically, that the correct object (ID 10) is the farthest from the current viewpoint. Given the initial query condition “under the sink,” the VLM evaluated both candidates (10 and 19) that satisfied this constraint, but then applied the “farthest” criterion to determine which object best matched the overall context. Through this process, the VLM obtained a decisive basis to distinguish between the two visually similar candidates and successfully identified the correct object (ID 10).
In summary, this example demonstrates that RDT can provide supplementary spatial cues not explicitly stated in the query, thereby resolving ambiguity among visually similar objects and ultimately improving grounding accuracy in complex scenes.
The second example in
Figure 5 illustrates how the proposed method compensates for the VLM’s limited visual analysis capability. The query explicitly specifies a visual attribute—“Find the table with many colors.” In the rendered image, only one table (ID 11) clearly exhibits multiple colors, yet the baseline model incorrectly selected table 15.
This failure can be attributed to the model’s overreliance on OLT information and the limited perceptual capacity of the lightweight 7B VLM. The baseline VLM likely began by searching the OLT for the keyword “table” and prioritized object 15 as the most probable match. Ideally, it should then have verified the “many colors” attribute through visual inspection of the image; however, due to its restricted vision-language reasoning capability, it failed to evaluate this condition accurately and instead retained its initial OLT-based inference.
In contrast, the proposed model successfully resolved this issue with the aid of the RDT: “From the current viewpoint, object 16 is the closest and object 11 is the farthest.” This statement serves as an explicit spatial cue, directing the VLM to attend to the farthest object (ID 11) within the scene. By shifting the model’s attention toward this spatial region, the RDT implicitly encouraged the VLM to re-evaluate the visual content of the distant object, leading it to recognize the multicolored surface of table 11 and correctly identify it as the answer.
Overall, this example demonstrates that RDT enhances the VLM’s limited visual reasoning ability by providing spatially grounded attention cues, enabling the model to incorporate visual context more effectively and make accurate predictions even when direct visual discrimination is challenging.
Although the VLM initially considered object 15 the most probable candidate based on the OLT, the RDT enabled it to recognize object 11 as a new, plausible candidate that had been previously overlooked. Subsequently, the VLM compared the visual attributes of object 11 against the query phrase “many colors” and confirmed that both cues—depth positioning and visual characteristics—were consistent. This process indicates that the model ultimately reaffirmed object 11 as the correct answer by integrating spatial and visual evidence.
In summary, this case illustrates how the RDT serves as a mechanism that redirects the VLM’s attention toward the correct candidate object, prompting it to reevaluate the key visual features specified in the query and thereby arrive at the correct grounding result.
The final example in
Figure 5 illustrates how RDT helps the VLM eliminate incorrect candidates during the reasoning process. The query, “Find the large curtain touching the table,” requires understanding of spatial relationships. The baseline model, when faced with multiple curtain candidates (IDs 7 and 9), incorrectly selected object 7, which appeared more visually prominent and closer in the scene, due to a misinterpretation of its spatial relationship with the table (ID 0).
In contrast, the proposed model utilized the RDT: “From the current viewpoint, object 7 is the closest and object 0 is the farthest.” This statement provides contradictory spatial information compared to the baseline’s reasoning, indicating that the incorrectly selected object (ID 7) is the closest to the viewpoint, while the table (ID 0), which serves as the reference object in the query, is the farthest.
Armed with this information, the VLM recognizes that the configuration “the closest object (7) is touching the farthest object (0)” is spatially implausible. This logical inconsistency prompts the model to discard object 7 as a valid candidate and instead re-evaluate other possible options—eventually identifying object 9, which more plausibly satisfies the condition “touching the table.”
In summary, this case demonstrates that RDT can guide reasoning not only by confirming correct candidates but also by refuting incorrect ones. By providing counter-evidence that exposes spatial contradictions, RDT helps the VLM reassess its reasoning path and converge on a more consistent interpretation of the 3D scene.
Through the qualitative analysis, several distinct mechanisms were observed by which RDT enhances the VLM’s 3D spatial reasoning capability. In Nr3D, where queries often involve complex depth relationships, RDT acted as a direct viewpoint anchor, supplying decisive cues that grounded the model’s reasoning in the correct spatial frame. In contrast, in ScanRefer, where queries tend to rely on simpler visual or relational cues, RDT influenced the VLM in more indirect ways—such as focusing attention on overlooked candidates or eliminating implausible ones.
Overall, these findings demonstrate that RDT adapts flexibly to different reasoning contexts, enhancing grounding robustness without any additional model training and thereby improving the overall reliability of the original SeeGround framework.