Enhancing SeeGround with Relational Depth Text for 3D Visual Grounding

Jeon, Hyun-Sik; Kang, Seong-Hui; Ha, Jong-Eun

doi:10.3390/app16020652

Open AccessArticle

Enhancing SeeGround with Relational Depth Text for 3D Visual Grounding

by

Hyun-Sik Jeon

¹

,

Seong-Hui Kang

¹

and

Jong-Eun Ha

^2,*

¹

Graduate School of Automotive Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

²

Department of Mechanical and Automotive Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 652; https://doi.org/10.3390/app16020652

Submission received: 6 November 2025 / Revised: 29 December 2025 / Accepted: 7 January 2026 / Published: 8 January 2026

(This article belongs to the Special Issue Advances in Computer Graphics and 3D Technologies)

Download

Browse Figures

Versions Notes

Abstract

Three-dimensional visual grounding is a core technology that identifies specific objects within complex 3D scenes based on natural language instructions, enhancing human–machine interactions in robotics and augmented reality domains. Traditional approaches have focused on supervised learning, which relies on annotated data; however, zero-shot methodologies are emerging due to the high costs of data construction and limitations in generalization. SeeGround achieves state-of-the-art performance by integrating 2D rendered images and spatial text descriptions. Nevertheless, SeeGround exhibits vulnerabilities in clearly discerning relative depth relationships owing to its implicit depth representations in 2D views. This study proposes the relational depth text (RDT) technique to overcome these limitations, utilizing a Monocular Depth Estimation model to extract depth maps from rendered 2D images and applying the K-Nearest Neighbors algorithm to convert inter-object relative depth relations into natural language descriptions, thereby incorporating them into Vision–Language Model (VLM) prompts. This method distinguishes itself by augmenting spatial reasoning capabilities while preserving SeeGround’s existing pipeline, demonstrating a 3.54% improvement in the Acc@0.25 metric on the Nr3D dataset in a 7B VLM environment that is approximately 10.3 times lighter than the original model, along with a 6.74% increase in Unique cases on the ScanRefer dataset, albeit with a 1.70% decline in Multiple cases. The proposed technique enhances the robustness of grounding through viewpoint anchoring and candidate discrimination in complex query scenarios, and is expected to improve efficiency in practical applications through future multi-view fusion and conditional execution optimizations.

Keywords:

3D visual grounding; vision-language model; spatial reasoning; depth estimation

1. Introduction

The task of locating a specific object in a complex three-dimensional environment based on a natural language command is referred to as 3D Visual Grounding (3DVG) [1,2]. This technology has become a key component for enabling human–machine interaction in fields such as robotics and augmented reality (AR). Early studies trained models in a supervised manner by combining annotated 3D point clouds or voxel-based data with language queries. However, these approaches suffered from limited scalability due to the high cost of data annotation and the tendency to overfit to specific domains. To address these issues, recent research has proposed zero-shot approaches that perform 3DVG without 3D-specific supervised training by leveraging pre-trained 2D Vision–Language Models (VLMs). Among them, SeeGround [3] achieved state-of-the-art (SOTA) performance by introducing a pipeline that converts indoor 3D scenes into a format interpretable by VLMs. Nevertheless, SeeGround’s spatial reasoning still relies on pre-labeled object locations and handles depth relations indirectly through image rendering, which limits its ability to model relative depth relationships accurately.

Queries such as “the nearest door” or “the monitor behind the desk” require a precise understanding of contextual and relational information to be correctly localized. Although SeeGround attempts to mitigate this problem by dynamically selecting viewpoints that best match the query, its reliance on 2D rendered images still leaves depth relations only implicitly represented. Consequently, the Vision–Language Model (VLM) lacks explicit cues to perceive three-dimensional spatial relationships, revealing a fundamental limitation clearly.

To address the absence of explicit depth awareness in SeeGround, this paper proposes a novel methodology utilizing Relational Depth Text (RDT). The key idea is to augment the VLM prompt with depth cues by extracting depth information from rendered 2D views and converting it into textual descriptions. This enables the VLM to interpret relative spatial relationships between objects more accurately, thereby inferring their correct locations. A simplified illustration of this concept is presented in Figure 1. When our method was integrated into the lightweight SeeGround framework, it consistently improved performance on the Nr3D dataset, achieving a 3.54% increase in Acc@25, which demonstrates the effectiveness of the proposed approach. In contrast, on the ScanRefer dataset, the overall Acc@25 improved only slightly by 0.42%, and the performance exhibited inconsistency. Specifically, accuracy for “Unique” cases—where a single object is referenced—enhanced by 6.74%, whereas “Multiple” cases—requiring the identification of one object among several similar ones—showed a 1.70% decrease. This indicates that the proposed relational depth text significantly enhances VLM performance when clear spatial cues are present, but can act as a constraint in ambiguous scenes with densely clustered similar objects.

2. Related Works

2.1. Three-Dimensional Visual Grounding

Research on 3D Visual Grounding (3DVG) has evolved from supervised learning approaches to zero-shot methodologies. Early studies primarily focused on constructing 3D–3D-language datasets and training models based on them. ScanRefer [1] introduced a benchmark for object localization using 3D point clouds and natural language descriptions, while ReferIt3D [2] proposed the Nr3D/Sr3D datasets for fine-grained scenarios that distinguish between instances within the same object class, along with ReferIt3DNet, a graph neural network designed for modeling inter-object relationships. Subsequently, 3DVG-Transformer [4] incorporated the Transformer architecture to enhance the modeling of complex spatial relationships among objects.

However, these supervised learning approaches rely heavily on large-scale, 3D-annotated data and thus struggle to generalize to unseen vocabularies or novel scenes.

To overcome these limitations, recent works have shifted toward zero-shot approaches that leverage pre-trained large-scale models. LLM-Grounder [5] and ZSVG3D [6] employ Large Language Models (LLMs) as reasoning engines. LLM-Grounder decomposes linguistic queries into “Target” and “Landmark” components, then utilizes external 3D modules as tools to synthesize the final grounding result. ZSVG3D further extends this idea by enabling the LLM itself to generate executable visual programs that directly interpret 3D spatial relationships.

However, both methods convert 3D scene information into textual form (object labels, coordinates) before feeding it to the LLM, which inevitably results in the loss of visual information such as color and texture. To address this, VLM-Grounder [7] proposed an approach that employs a Vision–Language Model (VLM) as the agent, directly processing 2D image sequences instead of 3D point clouds.

Following this trend, recent works have sought to further refine the VLM’s perception of 3D space. SeqVLM [8] enhances this by generating proposal-guided multi-view sequences to provide the VLM with more focused visual contexts, while View-on-Graph [9] adopts a different strategy by converting 3D spatial information into scene graphs, thereby enabling the VLM to perform explicit structural reasoning without being overwhelmed by entangled visual cues.

2.2. Bridging 2D Images with 3D Spaces

To address the data format discrepancy between 3D point clouds and 2D-based Vision–Language Models (VLMs), several multi-view approaches have been proposed that leverage multiple 2D images to represent 3D space. MVT [10] introduced an approach that processes multiple rotated views of a 3D scene simultaneously to learn representations robust to viewpoint changes. ViewRefer [11] further advanced this idea by employing a Large Language Model (LLM) to expand a single textual query into multiple view-specific sentences and by introducing learnable multi-view prototypes that assign weights to each view according to its importance. More recently, ViewSRD [12] addressed linguistic complexity by proposing a Simple Relation Decoupling (SRD) module that decomposes complex multi-anchor queries into multiple single-anchor sub-queries using an LLM.

While these multi-view approaches provide rich visual information, they inevitably introduce computational overhead and redundant information due to the need to process multiple images.

In contrast, some studies attempt to infer 3D information from a single 2D image. Mono3DVG-TR [13] eliminates the need for 3D point clouds by using only a single RGB image as input and employing an embedded depth predictor to estimate 3D spatial information and predict 3D bounding boxes. DEGround [14], which operates on agent-view RGB-D images, explicitly incorporates depth maps as input to fuse 2D visual features and 3D geometric information from the early stages.

Our research aligns with these single-image-based approaches in that it utilizes a single rendered image from SeeGround and the depth information inferred from it. However, unlike prior works that primarily use depth data as a geometric feature for reconstructing 3D coordinates, our method uniquely translates inter-object depth relationships into natural language descriptions that the VLM can directly interpret. This key distinction allows our approach to integrate relational depth reasoning within the vision–language inference process itself.

2.3. Spatial Relationship and Viewpoint Perception in 3D Visual Grounding

In 3D Grounding, interpreting spatial relationships and viewpoint dependencies among objects is a critical research topic. To address this challenge, several studies have attempted to enhance spatial reasoning capability by directly modifying the model architecture. For instance, ViL3DRel [15] introduced a spatial self-attention layer within the Transformer that explicitly encodes the relative distance and direction between object pairs, employing a teacher–student learning strategy to improve relational understanding.

Alternatively, VPP-Net [16] focused on the ambiguity of viewpoint-dependent expressions, such as “the chair on the right,” where meaning changes depending on the speaker’s perspective. This work proposed first predicting the viewpoint implied by the language expression and then rotating the entire 3D scene accordingly to establish a consistent spatial reference frame.

Meanwhile, SORT3D [17] presented a zero-shot framework that combines the language understanding capability of LLMs with a rule-based spatial reasoning toolbox. In this approach, the LLM decomposes complex linguistic queries into function call sequences, while external tools handle actual spatial computations. These studies address spatial reasoning either through architectural modification, explicit scene transformation, or integration with external modules.

In contrast, our approach does not modify the pre-trained VLM. Instead, we extract depth relations from a single 2D rendered image and convert them into natural language descriptions, which are then incorporated into the prompt. This simple yet effective strategy allows the VLM to leverage its inherent language and contextual reasoning abilities to interpret spatial relationships autonomously.

3. Proposed Method

3.1. Overview of SeeGround

The proposed method was applied to the lightweight version of the baseline SeeGround framework to validate its effectiveness. SeeGround is a zero-shot framework that performs 3D visual grounding (3DVG) by leveraging the prior knowledge of large-scale 2D Vision–Language Models (VLMs) without requiring any additional 3D annotations. The core idea of SeeGround is to transform the geometric and spatial information of a 3D scene into a combination of 2D images and textual descriptions that a VLM can interpret. Figure 2 illustrates this overall process.

In the original SeeGround pipeline, a single offline preprocessing stage is performed for each indoor 3D scene S before inference. Specifically, an Open-Vocabulary 3D Detection (OVDet) model is employed to extract all N objects in the scene, obtaining their corresponding 3D bounding boxes

{bbox}_{i}

and semantic labels

{sem}_{i}

. To facilitate efficient access and management of these object-level features, the extracted information is organized into an Object Lookup Table (OLT), as shown in Equation (1) below.

OLT = {({bbox}_{i}, {sem}_{i})}_{i = 1}^{N} = OVDet (S)

(1)

In the inference stage of SeeGround, two Vision–Language Models (VLMs) are employed, both based on the Qwen2-VL-72B model. The first VLM (

V L M_{p a r s e}

) analyzes the natural language query to identify the Target object and the Anchor object. In contrast, the second VLM (

V L M_{g r o u n d}

) predicts the ID of the target object using the query (Q), Prompted Image I_m, and Spatial Text T as inputs.

First, the VLM receives the natural language query Q and performs linguistic analysis. The extracted keywords are compared with object labels in the Object Lookup Table (OLT) using a text-matching algorithm that accounts for spelling variations and word stems. Through this filtering process, the VLM identifies the Target

O_{t}

and Anchor

O_{a}

objects that serve as the reference points for grounding. This process can be formally expressed as Equation (2).

(O_{t}, O_{a}) = {VLM}_{parse} (Q)

(2)

Next, SeeGround employs the Perspective Adaptation Module (PAM) and the Fusion Alignment Module (FAM) to generate the final inputs for the VLM.

The PAM dynamically determines the 2D viewpoint that best reflects the spatial context of the given query. Specifically, the 3D center coordinate of the identified Anchor object is set as the gaze point (at), while the viewpoint (eye) is positioned relative to the scene’s global center. Using these two vectors, PyTorch3D (version 0.8.0) [18] is utilized to compute the camera’s extrinsic parameters, namely the rotation matrix

R_{c}

and translation vector

T_{c}

.

Subsequently, the 3D point cloud S is rendered into a query-aligned 2D image III, completing the perspective adaptation process. This transformation is formally represented by Equation (3).

I = Render (S, R_{c}, T_{c})

(3)

Subsequently, the Fusion Alignment Module (FAM) integrates the rendered image III with the 3D spatial information stored in the Object Lookup Table (OLT) to produce the Spatial Text (

T

) and the Prompted Image (

I_{m}

) that will serve as inputs to the VLM.

Specifically, the center of the top face of each object’s 3D bounding box in the OLT is projected onto the 2D rendered image III. At each projected location, the unique object ID is overlaid in textual form, generating a visually annotated image. Through this process, the original query Q, the visually prompted image

I_{m}

, and the spatial text T are integrated to construct the final prompt fed into the VLM.

Finally, the VLM utilizes this comprehensive information to predict the most probable object ID(

\hat{o}

) referred to by the query, as expressed in Equation (4). The predicted ID(

\hat{o}

) is then mapped back to the corresponding 3D bounding box via the OLT lookup, yielding the final grounded object in 3D space.

\hat{o} = {VLM}_{ground} (Q, I_{m}, T)

(4)

3.2. Limitations of SeeGround

Existing approaches for handling 3D spatial understanding exhibit apparent limitations. In agent-based methods [5,6,7], Large Language Models (LLMs) depend on external modules for geometric computation, which introduces computational overhead and the risk of error propagation. Meanwhile, learning-based methods [10,11,12] often suffer from distortions in spatial relationships and increased uncertainty when processing multiple views, due to inconsistencies across different viewpoints.

In contrast, SeeGround was designed to mitigate these issues by converting 3D spatial information into 2D images and textual representations through its Perspective Adaptation Module (PAM) and Fusion Alignment Module (FAM), enabling the Vision–Language Model (VLM) to perform reasoning more effectively. While this approach allows the VLM to interpret horizontal (left–right) and vertical (up–down) relationships on the 2D image plane using OLT-based object information, it fundamentally struggles to connect numerical Z-coordinate differences expressed in text to intuitive depth cues such as relative distance or front–back relationships.

For example, when given a query such as “identify the object that is closer to the camera between two side-by-side objects in the rendered image,” the VLM cannot reliably infer the subtle 3D ordering using only the numerical Z-coordinate values provided in the textual prompt. In essence, the VLM can recognize multiple objects and read their coordinates within a 2D image, but lacks explicit cues for determining which object is positioned closer or farther in 3D space. This ambiguity becomes a major source of grounding errors in complex indoor environments, particularly when objects occlude one another or are arranged along the depth axis.

For instance, when SeeGround was applied to the ScanRefer dataset query “a red square chair. It is under a round table,” the model failed to identify the correct square chair positioned beneath the round table, instead selecting an incorrect chair due to depth-axis ambiguity inherent in the 2D rendering.

3.3. The Proposed Method

The proposed method extracts depth information from a rendered 2D image via monocular depth estimation and converts it into concise textual descriptions that express relative distance relationships between objects. These depth-based relational sentences are then provided to the VLM prompt as explicit cues to enhance spatial understanding.

Humans, when perceiving visual scenes, cannot obtain exact spatial coordinates of objects; yet, they intuitively grasp relative distances and front–back relationships through an inherent sense of spatial depth. In contrast, a VLM must rely solely on 2D rendered views and textual 3D coordinates, making it challenging to comprehend the actual depth difference between two visually adjacent objects. To convey a human-like spatial sense to the VLM, we propose augmenting the conventional Spatial Text with additional depth-aware natural language descriptions, enabling the model to better understand inter-object relationships.

Our initial idea was to directly provide the depth map extracted from the rendered 2D image as a visual input to the VLM. However, as prior works [7,13,14] indicate, depth information is typically processed through specialized encoders or fusion modules, rather than being directly interpreted by VLMs. Hence, this approach would likely be inefficient and require additional training.

To overcome this, we refined our idea to exploit the VLM’s inherent language comprehension ability. Ultimately, we propose the concept of Relational Depth Text (RDT), which converts extracted depth information into compact natural-language relational statements that can be utilized without additional model training, providing explicit depth cues to the VLM for improved spatial reasoning.

To generate the proposed Relational Depth Text (RDT), obtaining accurate depth information is a crucial step. Initially, we considered utilizing the 3D point cloud data—the same input used in SeeGround—to extract depth information. However, raw point clouds from datasets such as ScanNet [19] often contain substantial noise and exhibit non-uniform density, resulting in inaccurate or inconsistent Z-buffer depth values when directly projected.

To address this issue, we adopt the Monocular Depth Estimation model Depth Anything V2 [20]. This model has demonstrated superior robustness and fine-grained detail reconstruction compared to prior approaches. Depth Anything V2 follows a teacher–student training paradigm: first, a large-scale teacher model is trained exclusively on high-precision synthetic images. This teacher is then used to generate pseudo-labels for unlabeled real-world images, which are subsequently employed to train the student model. Through this process, the resulting model acquires the ability to infer dense and coherent depth maps by understanding the semantic context within 2D images. Trained on diverse indoor and outdoor scenes, it exhibits strong generalization capability across different environments.

In our method, rather than estimating absolute metric depth, we utilize the pre-trained Depth-Anything-V2-Large (ViT-L) model, which predicts relative depth relationships, with a focus on front–to–back ordering between objects. The 2D rendered image III, generated through SeeGround’s Perspective Adaptation Module (PAM), is fed into the Depth Anything V2 model

F_{D A 2}

, producing a dense depth map DDD for the given viewpoint, as formulated in Equation (5).

D = F_{D A 2} (I)

(5)

This depth map serves as the raw data source for generating relational depth text in the subsequent stage.

The overall pipeline of the proposed methodology is illustrated in Figure 3, while the RDT generation and integration process is depicted across Inference Stages 2 and 3 in the same figure. This process is designed to generate textual cues that assist the VLM in understanding 3D spatial relationships, and it consists of three main stages: (1) Depth Information Extraction, (2) Relational Text Generation, and (3) Integration and Final Inference.

First, the query-aligned 2D RGB image I (

I \in R^{H \times W \times 3}

), rendered through the Perspective Adaptation Module (PAM), is fed into the pre-trained Depth Anything V2 model to generate a dense depth map D (

D \in R^{H \times W}

) corresponding to that view. Using the alpha mask α obtained during rendering, background pixels that do not belong to the 3D scene are assigned an infinite depth value. This operation effectively removes irrelevant depth information and produces a refined depth map

D_{c l e a n}

. The pixel-wise computation for this refinement process is defined as follows:

D_{clean} (p) = \{\begin{matrix} D (p) i f α (p) = 1 \\ \infty i f α (p) = 0 \end{matrix}

(6)

However, the raw depth values predicted by Depth Anything V2 are affine-invariant (where larger values indicate closer objects) and lack a consistent metric scale. To ensure reproducibility and address the ambiguity of depth units across different scenes, we apply Min-Max normalization within the valid region of the current view frustum. As implemented in our inference pipeline, the normalized relative depth score

D_{n o r m}

is computed as

D_{n o r m} (p) = \frac{d_{m a x} - D_{c l e a n} (p)}{d_{m a x} - d_{m i n}} \times 100

(7)

where

d_{m i n}

and

d_{m a x}

denote the minimum and maximum finite depth values in

D_{c l e a n}

, respectively. This transformation converts the arbitrary depth logits into a standardized “Relative Depth Score” ranging from 0 (closest) to 100 (farthest). Consequently, a difference of “10 units” consistently signifies a 10% discrepancy in relative depth within the scene, providing a robust metric for the VLM to interpret spatial relationships.

Next, based on the refined depth map, the relational text is generated. During the original FAM process, each object

o_{i}

is assigned a marker coordinate

m_{i} = (u_{i}, v_{i})

representing its location on the 2D image. We define a 5 × 5 patch region

P (m_{i})

centered on each marker coordinate

m_{i}

, and within this region, we compute the minimum valid (non-infinite) depth value to represent the object’s representative depth

d_{i}

.

d_{i} = \min_{p \in P (m_{i})} D_{n o r m} (p)

(8)

This approach—using statistical values from a patch region rather than a single pixel—helps obtain robust depth estimates that are less sensitive to rendering artifacts or noise. Once representative depth values for all objects have been extracted, the proposed Spatial Relationship Detection (SRD) module describes inter-object depth relationships using an algorithm inspired by the k-Nearest Neighbors (k-NN) concept [21]. Taking the target object

o_{t}

identified from the query as a reference, the depth difference between the target and each other visible object

o_{j}

is computed as

Δ d_{t j} = |d_{t} - d_{j}|

. This value

Δ d

serves as a distance metric in the depth dimension. The SRD module then selects the top-k neighboring objects whose depth differences are smallest—that is, those most similar in depth—to form the depth-based nearest neighbor set

N_{k} (o_{t})

. In this study, k = 3 is used, and the depth-based nearest neighbor set of the target object

o_{t}

is defined as shown in Equation (9).

N_{k} (o_{t}) = \frac{a r g m i n}{o_{j} \in O_{v i s} {o_{t}}, |N_{k}| = k} |d_{t} - d_{j}|

(9)

O_{v i s}

denotes the set of all visible objects in the rendered view, while

d_{t}

and

d_{j}

represent the representative depth values of the target object

o_{t}

and another arbitrary object

o_{j}

, respectively. Selecting neighbors based on depth difference rather than 2D Euclidean distance allows the system to directly address depth-related queries such as “closer” or “behind.” This approach provides the VLM with spatially meaningful comparison candidates that reflect real 3D relationships.

Using the nearest-neighbor information and the query keywords, two types of relational depth descriptions are generated for the VLM input. First, Contextual neighborhood statements, which describe nearby objects based on k-NN information, e.g., “Target is near objects x, y, and z.” This gives the VLM contextual awareness of the target’s surrounding spatial environment. Second, keyword-conditioned relational statements, which are produced when the original query Q contains predefined spatial relationship keywords (e.g., “near,” “behind”). In such cases, the depth values of the target (

d_{t}

) and anchor (

d_{a}

) objects are directly compared. For example, if

d_{t} > d_{j}

(where Depth Anything V2 outputs larger values for closer objects) and a relevant keyword is present in the query, a template-based relational depth description (RDD) such as “Target is closer than Anchor by

Δ d_{t a}

units.” is generated.

When the query lacks explicit depth-related keywords or when the VLM fails to identify target/anchor objects, a fallback mechanism is invoked. In this case, the system produces alternative descriptions that still offer minimal but useful spatial cues, such as: “From the current viewpoint, object X is the closest,” or “Object Y is visible.” This fallback strategy ensures that the VLM receives at least coarse spatial context, thereby enhancing the overall robustness of the system even when ideal depth comparisons are not feasible.

Finally, the generated Relational Depth Text (RDT)

T_{d e p t h}

is integrated with the existing Spatial Text T, providing the VLM with explicit cues to resolve depth-related ambiguities. As illustrated in Inference Stage 3 of Figure 3, the final prompt delivered to the VLM agent consists of: (1) the original natural language query Q; (2) the prompted image

I_{m}

with object IDs visually annotated; and (3) the integrated spatial text

T^{'} = T + T_{d e p t h}

that includes depth-aware relational descriptions.

The RDT thus enables the VLM to understand not only textual 3D coordinates but also the implicit front–back relationships within actual 3D space in a more intuitive way. Based on this enriched contextual input, the VLM produces a response that includes both the predicted object ID and its reasoning evidence, as formulated in Equation (10). In the post-processing stage, the system parses the predicted ID from the response to determine the final grounded object

\hat{o}

.

\hat{o} = {VLM}_{ground} (Q, I_{m}, T^{'}) where T^{'} = T + T_{d e p t h}

(10)

The predicted ID(

\hat{o}

) is then used to query the Object Lookup Table (OLT), where it is mapped to its corresponding 3D bounding box—including the object’s position and size. This bounding box is returned as the final output of the overall framework.

4. Experimental Results

4.1. Experiment Details

The experiments were conducted on a workstation equipped with an AMD Ryzen 9 5900X processor and a single NVIDIA GeForce RTX 3090 GPU with 24 GB of video RAM. To demonstrate the practicality of the proposed method under hardware constraints, the VLM agent was replaced with the Qwen2-VL-7B-Instruct model [22]—a lightweight version that reduces the number of parameters by approximately 10.3 times compared to the original 72B model. During testing, long prompts describing complex scenes occasionally exceeded the model’s maximum input length. To ensure stable inference, the VLLM server’s maximum sequence length was set to 8192 tokens, and GPU memory usage was limited to 80%. Under this configuration, the VLM agent utilized approximately 20.16 GB of VRAM.

To evaluate the grounding performance of the proposed method, we used the same benchmark datasets as those employed in SeeGround—namely, Nr3D and ScanRefer. The Nr3D dataset consists of 41,503 natural language descriptions collected from 755 3D scenes, focusing on fine-grained language-based object discrimination (i.e., identifying a specific instance among multiple similar objects). It was constructed through an online game where two participants interacted to match objects, resulting in highly discriminative and perspective-dependent expressions. In contrast, the ScanRefer dataset comprises 51,583 descriptions collected from 800 ScanNet scenes, emphasizing natural and intuitive language expressions for object localization. The descriptions were collected in a free-form manner without fixed templates, featuring diverse relational expressions that describe general spatial locations.

To ensure rigorous evaluation, we followed the official splits of the ScanNet benchmark [19]. Since our proposed method operates in a zero-shot manner without a training phase, we did not utilize the training splits. Instead, all evaluations were conducted on the validation splits of both datasets (Nr3D and ScanRefer), as the ground truth for the test sets is not publicly available. This protocol aligns with the baseline SeeGround to ensure a direct and fair comparison.

For a fair comparison with the original SeeGround, the rendered image resolution was fixed at 1000 × 1000 pixels, and the Object Lookup Table (OLT) was constructed using the predicted results of Mask3D [23].

4.2. Experimental Results

Performance evaluation was conducted using the standard metric of grounding accuracy. A prediction is considered successful if the Intersection over Union (IoU) between the predicted 3D bounding box

B_{p r e d}

and the ground-truth bounding box

B_{g t}

exceeds a specified threshold(

τ)

. The IoU is computed based on the volumetric overlap between the two bounding boxes, as defined in Equation (11).

IoU (B_{p r e d}, B_{g t}) = \frac{vol (B_{p r e d} \cap B_{g t})}{vol (B_{p r e d} \cup B_{g t})}

(11)

In this study, Acc@

τ

a standard metric widely adopted in 3D visual grounding benchmarks [1,2], is employed as the primary metric for comprehensive performance comparison, defined as shown in Equation (12). For evaluation, the threshold(

τ)

was set to 0.25 and 0.5, respectively.

Acc @ τ = \frac{1}{N} \sum_{i = 1}^{N} I (IoU (B_{p r e d, i}, B_{g t, i}) > τ)

(12)

Here,

N

denotes the total number of queries, and

I (\cdot)

represents an indicator function that returns 1 when the condition is true.

In the Nr3D dataset, performance is further analyzed by categorizing queries according to their difficulty level—Easy (with one distractor object) and Hard (with multiple distractors)—as well as by whether viewpoint information is required for interpretation, namely View-Dependent (requiring a specific viewpoint) and View-Independent (viewpoint-independent). In the ScanRefer dataset, evaluation is divided based on whether the scene contains multiple instances of the same object class: Unique (a single instance) and Multiple (multiple instances). This categorization focuses on assessing the model’s discriminative capability between visually similar objects.

To assess the effectiveness of the proposed method, the model’s performance is compared across these four evaluation settings.

First, Ori Baseline 72B represents the original SeeGround performance using the 72B-parameter VLM, serving as the upper-bound reference. Second, Ori Baseline 7B corresponds to the performance obtained by running the original SeeGround logic with a 7B VLM, reflecting the impact of reducing model size. Both of these metrics were directly cited from the original SeeGround paper. Third, Reproduced 7B (Baseline) denotes our reproduced SeeGround pipeline implemented in the experimental environment using the lightweight 7B VLM, and serves as the primary baseline for direct comparison. Finally, Ours 7B+Depth refers to our final model, in which the proposed RDT module is integrated into the baseline. Thus, the performance gap between Reproduced 7B and Ours 7B+Depth directly reflects the contribution of the proposed methodology.

Table 1 and Table 2 present the quantitative evaluation results on the Nr3D and ScanRefer datasets, respectively. The proposed model (Ours 7B+Depth) demonstrated meaningful performance variations depending on the dataset characteristics compared to the direct baseline (Reproduced 7B).

For the Nr3D dataset (Table 2), our model achieved a consistent performance improvement of 3.54% in overall Acc@25. This improvement was observed across all subsets (Easy, Hard, Dep., Indep.), with particularly notable gains of 3.68% in the Hard subset, where spatial ambiguity is high, and 3.24% in the View-Dependent subset. These results confirm that RDT provides the VLM with additional depth-related contextual cues, enabling more accurate reasoning in complex and fine-grained grounding queries.

In contrast, the results on the ScanRefer dataset (Table 3) were mixed. For the Unique@25 category (scenes containing a single target instance), performance improved significantly by 6.74%, whereas in the Multiple@25 category (scenes containing multiple similar objects), accuracy dropped by 1.70%. This suggests that while RDT acts as a helpful cue in unambiguous or straightforward cases, it may introduce confusion in complex Multiple scenarios, where the lightweight 7B model struggles to integrate multiple textual cues effectively. A similar trend was observed in Acc@50, where overall performance decreased by 0.71%, implying that as the IoU threshold increases—requiring finer localization—RDT-induced ambiguity may lead to mispredictions.

From a computational perspective, generating Depth Maps and RDTs introduced negligible overhead to the overall inference time. The computational costs, including GPU memory usage and wall-clock inference time for both Nr3D and ScanRefer datasets, are summarized in Table 1.

As shown in Table 1, the depth module requires an additional 2.4 GB of VRAM, bringing the total peak usage to 22.56 GB. Crucially, this remains within the memory capacity of a standard consumer-grade GPU (24 GB), ensuring accessibility without the need for high-end data center hardware (e.g., A100). Consequently, the depth estimation introduces an amortized overhead of approximately 40 ms per query due to view caching. Since the pipeline is predominantly bottlenecked by the VLM’s auto-regressive token generation, the total time increase is approximately 2% to 5%, which is negligible in practical applications.

To provide a comprehensive evaluation, we compare our method with state-of-the-art zero-shot 3D visual grounding approaches, including VLM-Grounder, SORT3D, SeqVLM, and View-on-Graph (VoG). The results are summarized in Table 4. It is important to note that these competing methods rely on large-scale foundation models such as GPT-4 (closed-source) or Qwen2-VL-72B, which require significant computational resources. In contrast, our method employs the lightweight Qwen2-VL-7B model, designed for consumer-grade GPU environments (e.g., a single RTX 3090). Despite the significant disparity in model size (7B vs. 72B/GPT-4), our approach demonstrates competitive performance, particularly in the ScanRefer benchmark, where it surpasses the baseline and narrows the gap with larger models.

To complement the quantitative evaluation and to further explain the reasoning process of the proposed methodology, a qualitative analysis was conducted. Figure 4 and Figure 5 present representative results from the Nr3D and ScanRefer benchmarks, respectively, highlighting cases in which the baseline model (Reproduced 7B) failed while the proposed model (Ours 7B+Depth) succeeded. Each example, extracted from the ScanRefer and Nr3D datasets, includes the prompted image (

I_{m}

) containing object ID markers, the corresponding depth map, and the VLM’s inference result. This analysis visually demonstrates how the proposed Relational Depth Text (RDT) resolves spatial ambiguity and exerts a decisive influence on the VLM’s reasoning process.

The first case in Figure 4 involves a relatively simple spatial relation query: “Find the window next to the white desk.” In the rendered 2D view, two visually similar windows (IDs 10 and 12) are positioned side by side near the reference object, the white desk (ID 9), creating apparent ambiguity. Without access to depth information, the baseline model misinterpreted the “next to” relationship and incorrectly selected object 10.

In contrast, the proposed model correctly identified the target window (ID 12). The generated RDT in this case, “From the current viewpoint, object 9 is visible,” does not provide explicit depth comparison information. Since the query lacks explicit depth-related keywords such as “near” or “behind,” a fallback text was generated. Nevertheless, this text played an essential role as a reasoning anchor, guiding the VLM to focus its reasoning on the reference object (object 9). The phrase “object 9 is visible” implicitly instructs the model to “reconsider spatial relations relative to object 9 rather than other distractors.” As a result, the VLM re-evaluated the 3D spatial relationships among objects 9, 10, and 12, ultimately selecting the correct window (ID 12).

In conclusion, this case demonstrates that even when RDT does not provide explicit depth comparison, it effectively redirects the model’s reasoning focus toward the key anchor object, thereby resolving fine-grained spatial ambiguities that the baseline model could not handle.

The second example in Figure 4 highlights the limitations of 2D viewpoints and underscores the importance of 3D spatial reasoning. The query is: “Find the rectangular copier to the left of the trash bin.” From the rendered 2D view alone, it is not visually intuitive whether the copier (ID 21) is actually positioned to the left of the trash bin (ID 7). The baseline model, which relied solely on 3D coordinate information from the Object Lookup Table (OLT), evaluated multiple potential “left” candidates but incorrectly selected another visually prominent object (ID 5).

In contrast, the proposed model successfully leveraged an explicit RDT cue: “From the current viewpoint, object 21 is the closest and object 27 is the farthest.” This additional relational description provided two crucial constraints: (1) the object must be to the left of the trash bin in 3D space, and (2) it must also be the closest object from the current viewpoint. By integrating these two spatial conditions, the VLM effectively narrowed down the candidate set.

As shown in the depth map, object 21 appears in the red-shaded region, indicating that it is indeed the closest object in the current viewpoint. Consequently, the model accurately identified object 21 as the correct target satisfying both conditions.

This case clearly demonstrates that the proposed RDT plays a decisive role in overcoming 2D projection ambiguity, enabling the VLM to disambiguate multiple 3D candidates correctly and leading to more precise grounding in complex spatial reasoning scenarios. The final example in Figure 4 presents a query that requires a comprehensive understanding of the entire scene composition. In this case, the baseline model incorrectly selected an invisible object (ID 16) in the rendered image. This error illustrates how the lightweight 7B version of the model, when faced with complex queries, fails to utilize the visual context (

I_{m}

)—one of the key design principles of the original SeeGround—and instead relies solely on text-based coordinate information from the OLT. As a result, the viewpoint-dependent constraint expressed in the query (“the pool table is visible to the right”) was completely ignored.

In contrast, the proposed model successfully resolved this issue with the aid of the generated RDT: “From the current viewpoint, object 6 is the closest and object 15 is the farthest.” This sentence functioned as a viewpoint anchor, effectively fixing the VLM’s reasoning process to the current rendered perspective.

To interpret this RDT, the VLM first needed to locate object 15 within the image. This process inherently forces the model to attend to the visual content, preventing the neglect of visual cues that occurred in the baseline model. Once grounded to the correct viewpoint, the VLM was able to correctly interpret the visual constraint (“the pool table is visible”) and naturally exclude invisible candidates such as object 16. Furthermore, the phrase “object 15 is the farthest” helped narrow the search space to the deepest region of the scene, where the correct answer (object 18) was located. Finally, within this constrained search space, the VLM evaluated the remaining conditions and correctly identified object 18 as the target. In summary, this case demonstrates that RDT prevents the lightweight model from disregarding visual context in complex query scenarios, effectively restoring and enhancing the core reasoning capability originally intended in SeeGround’s design.

Figure 5 presents qualitative examples from the ScanRefer dataset, which exhibit patterns distinct from those observed in Nr3D. Because queries in ScanRefer often rely more on unique visual attributes or simple adjacency relations between objects, the effect of RDT tends to appear more indirect compared to its influence in Nr3D. The first example in Figure 5 corresponds to the query “Find the kitchen cabinet under the sink (ID 50).” In the rendered 2D view, the correct cabinet (ID 10) and the incorrect cabinet selected by the baseline (ID 19) are positioned side by side beneath the sink, creating an apparent visual ambiguity. The baseline model lacked additional cues to distinguish between these two visually similar candidates and thus produced an incorrect prediction (ID 19).

In contrast, the proposed model received an additional RDT cue: “From the current viewpoint, object 12 is the closest and object 10 is the farthest.” This text provides new spatial information not explicitly mentioned in the query—specifically, that the correct object (ID 10) is the farthest from the current viewpoint. Given the initial query condition “under the sink,” the VLM evaluated both candidates (10 and 19) that satisfied this constraint, but then applied the “farthest” criterion to determine which object best matched the overall context. Through this process, the VLM obtained a decisive basis to distinguish between the two visually similar candidates and successfully identified the correct object (ID 10).

In summary, this example demonstrates that RDT can provide supplementary spatial cues not explicitly stated in the query, thereby resolving ambiguity among visually similar objects and ultimately improving grounding accuracy in complex scenes.

The second example in Figure 5 illustrates how the proposed method compensates for the VLM’s limited visual analysis capability. The query explicitly specifies a visual attribute—“Find the table with many colors.” In the rendered image, only one table (ID 11) clearly exhibits multiple colors, yet the baseline model incorrectly selected table 15.

This failure can be attributed to the model’s overreliance on OLT information and the limited perceptual capacity of the lightweight 7B VLM. The baseline VLM likely began by searching the OLT for the keyword “table” and prioritized object 15 as the most probable match. Ideally, it should then have verified the “many colors” attribute through visual inspection of the image; however, due to its restricted vision-language reasoning capability, it failed to evaluate this condition accurately and instead retained its initial OLT-based inference.

In contrast, the proposed model successfully resolved this issue with the aid of the RDT: “From the current viewpoint, object 16 is the closest and object 11 is the farthest.” This statement serves as an explicit spatial cue, directing the VLM to attend to the farthest object (ID 11) within the scene. By shifting the model’s attention toward this spatial region, the RDT implicitly encouraged the VLM to re-evaluate the visual content of the distant object, leading it to recognize the multicolored surface of table 11 and correctly identify it as the answer.

Overall, this example demonstrates that RDT enhances the VLM’s limited visual reasoning ability by providing spatially grounded attention cues, enabling the model to incorporate visual context more effectively and make accurate predictions even when direct visual discrimination is challenging.

Although the VLM initially considered object 15 the most probable candidate based on the OLT, the RDT enabled it to recognize object 11 as a new, plausible candidate that had been previously overlooked. Subsequently, the VLM compared the visual attributes of object 11 against the query phrase “many colors” and confirmed that both cues—depth positioning and visual characteristics—were consistent. This process indicates that the model ultimately reaffirmed object 11 as the correct answer by integrating spatial and visual evidence.

In summary, this case illustrates how the RDT serves as a mechanism that redirects the VLM’s attention toward the correct candidate object, prompting it to reevaluate the key visual features specified in the query and thereby arrive at the correct grounding result.

The final example in Figure 5 illustrates how RDT helps the VLM eliminate incorrect candidates during the reasoning process. The query, “Find the large curtain touching the table,” requires understanding of spatial relationships. The baseline model, when faced with multiple curtain candidates (IDs 7 and 9), incorrectly selected object 7, which appeared more visually prominent and closer in the scene, due to a misinterpretation of its spatial relationship with the table (ID 0).

In contrast, the proposed model utilized the RDT: “From the current viewpoint, object 7 is the closest and object 0 is the farthest.” This statement provides contradictory spatial information compared to the baseline’s reasoning, indicating that the incorrectly selected object (ID 7) is the closest to the viewpoint, while the table (ID 0), which serves as the reference object in the query, is the farthest.

Armed with this information, the VLM recognizes that the configuration “the closest object (7) is touching the farthest object (0)” is spatially implausible. This logical inconsistency prompts the model to discard object 7 as a valid candidate and instead re-evaluate other possible options—eventually identifying object 9, which more plausibly satisfies the condition “touching the table.”

In summary, this case demonstrates that RDT can guide reasoning not only by confirming correct candidates but also by refuting incorrect ones. By providing counter-evidence that exposes spatial contradictions, RDT helps the VLM reassess its reasoning path and converge on a more consistent interpretation of the 3D scene.

Through the qualitative analysis, several distinct mechanisms were observed by which RDT enhances the VLM’s 3D spatial reasoning capability. In Nr3D, where queries often involve complex depth relationships, RDT acted as a direct viewpoint anchor, supplying decisive cues that grounded the model’s reasoning in the correct spatial frame. In contrast, in ScanRefer, where queries tend to rely on simpler visual or relational cues, RDT influenced the VLM in more indirect ways—such as focusing attention on overlooked candidates or eliminating implausible ones.

Overall, these findings demonstrate that RDT adapts flexibly to different reasoning contexts, enhancing grounding robustness without any additional model training and thereby improving the overall reliability of the original SeeGround framework.

4.3. Failure Analysis on ScanRefer

To investigate the cause of the performance decrease (−1.70%) in the ScanRefer “Multiple” category (Table 3), we conducted a rigorous analysis of failure cases where the baseline succeeded but our method failed. The analysis reveals a distinct trade-off mechanism related to the interplay between visual attributes and spatial cues.

As reported in the ScanRefer original study [1], while spatial language is dominant (98.7%), descriptions also heavily rely on visual attributes, such as color (74.7%) and shape (64.9%), to distinguish objects. Our analysis confirms that 67% of the failure cases occurred in “attribute-dominant” queries where visual discrimination was the key factor.

A representative example involves the following query: “there is a dark brown wooden and leather chair.” Here, the target is distinguished solely by its material and color features among other chairs. However, our RDT module generated a depth-based description: “From the current viewpoint, object 8 is the closest.” In this scenario, the explicit spatial hint provided by RDT acted as a “distractor,” causing the VLM to prioritize the spatial cue (proximity) over the visual attributes (dark brown/leather) mandated by the user.

This “Spatial Bias” explains the performance drop in the “Multiple” category. While RDT effectively resolves spatial ambiguities (as seen in Nr3D), it can introduce redundancy or distraction when the grounding task requires fine-grained visual discrimination rather than spatial reasoning. This observation underscores the need for an adaptive approach, as discussed in the Conclusion, where RDT is selectively applied based on the query type to mitigate such trade-offs.

4.4. Prompt Templates Used for the Nr3D Dataset

To facilitate a clear understanding of our system logic, we provide the exact prompt templates used in our experiments (specifically for the Nr3D dataset). The prompts are constructed dynamically using the following components:

System Prompt:

“You are a helpful assistant designed to identify objects based on image and descriptions.”

User Prompt Structure:

(1) Image Context:

“As shown in the image, this is a rendered image of a room, and the picture reflects your current view. Each object in the room is labeled by a unique number (ID) in red color on its surface.”

(2) Object Information:

“Object IDs and their 3D spatial information are as follows:

{objects_info}”

(3) Coordinate Definition:

“The 3D spatial coordinate system is defined as follows: X-axis and Y-axis represent horizontal dimensions, Z-axis represents the vertical dimension.”

(4) Relational Depth Text (Proposed Method):

“Critical spatial hints from depth:

{depth_text}

Use these hints to resolve ambiguities.”

(5) Task Instruction:

“Please review the provided image and object 3D spatial descriptions, then select the object ID that best matches the given description.”

(6) Response Format:

“Respond in the format: ‘Predicted ID: <ID>\nExplanation: <explanation>’, where <ID> is the object ID and <explanation> is your reasoning.”

(7) Target Query:

“The given description is: {query}”

5. Conclusions

This paper proposed the Relational Depth Text (RDT) methodology to overcome the limitations of depth-aware reasoning in the existing zero-shot 3D Visual Grounding (3DVG) framework, SeeGround. The proposed method employs a Monocular Depth Estimation model to extract depth information from 3D scenes explicitly. It utilizes a k-NN–inspired algorithm to generate natural language descriptions of relative depth relationships between objects, which are then integrated into the VLM’s reasoning process. Through this approach, the framework retains SeeGround’s core strength—representing 3D scenes as interpretable combinations of 2D images and textual information—while compensating for the model’s weakness in comprehending depth relationships. Experimental results showed that, under realistic hardware constraints with a lightweight 7B model, the proposed method achieved meaningful dataset-dependent improvements. In the Nr3D dataset, which includes many complex spatial relation queries, the method consistently improved performance across all categories, achieving a +3.54% gain in Acc@25, thereby validating the effectiveness of the proposed approach. In contrast, on the ScanRefer dataset, the method exhibited mixed results, with substantial improvement in Unique cases but a performance drop in multiple cases.

The limitations of this study can be discussed in three aspects. First, the current text generation mechanism is effective for simple comparative expressions, such as “A is closer than B.” Still, it is limited in representing complex, multi-object spatial relationships, such as “C is between A and B.” Second, since depth is estimated from a single 2D image, the accuracy and reliability of the depth estimation can degrade depending on the camera viewpoint generated by the PAM (e.g., top-view). Finally, as the current pipeline was designed primarily to validate the effectiveness of the proposed method, it always executes the Depth module regardless of the query content, resulting in computational inefficiency.

Therefore, future research can extend the proposed methodology in the following directions: First, by incorporating text generation rules that jointly consider depth values and 2D positional relationships, the system can handle more complex multi-object spatial reasoning. Second, by fusing depth maps from multiple viewpoints, it will be possible to obtain robust and view-independent depth information, overcoming the limitations of single-view estimation. Third, introducing a conditional execution pipeline—which first analyzes query keywords and performs depth estimation only when necessary—can mitigate the aforementioned spatial bias and significantly optimize the overall inference efficiency of the system.

Beyond these technical refinements, the fundamental approach of this study—translating implicit geometric cues into explicit natural language—holds significant potential for broader applications where processing 3D point clouds is computationally prohibitive. For instance, in Embodied AI and Vision-Language Navigation (VLN), this method enables agents equipped with standard RGB cameras to comprehend depth-aware instructions (e.g., “move to the table behind the sofa”) without heavy 3D sensors. Furthermore, in Assistive Technology, transforming depth maps into descriptive text can generate intuitive scene narratives for the visually impaired, providing essential safety information about the relative positions of obstacles.

Collectively, these technical enhancements and expanded applications are expected to form the foundation for a robust 3D-VLM agent capable of human-like spatial reasoning and interaction in complex real-world environments.

Author Contributions

Conceptualization, H.-S.J. and S.-H.K.; implementation, H.-S.J. and S.-H.K.; analysis, H.-S.J., S.-H.K. and J.-E.H.; writing, original draft preparation; H.-S.J. and S.-H.K.; draft modification, J.-E.H.; funding acquisition, J.-E.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Research Program funded by the SeoulTech (Seoul National University of Science and Technology).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, D.Z.; Chang, A.X.; Nießner, M. ScanRefer: 3D object localization in rgb-d scans using natural language. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 202–221. [Google Scholar]
Achlioptas, P.; Abdelreheem, A.; Xia, F.; Elhoseiny, M.; Guibas, L. ReferIt3D: Neural listeners for fine-grained 3D object identification in real-world scenes. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 422–440. [Google Scholar]
Li, R.; Li, S.; Kong, L.; Yang, X.; Liang, J. SeeGround: See and ground for zero-shot open-vocabulary 3D visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Zhao, L.; Cai, D.; Sheng, L.; Xu, D. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2928–2937. [Google Scholar]
Yang, J.; Chen, X.; Qian, S.; Madaan, N.; Iyengar, M.; Fouhey, D.F.; Chai, J. LLM-Grounder: Open-vocabulary 3D visual grounding with large language model as an agent. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 7694–7701. [Google Scholar]
Yuan, Z.; Ren, J.; Feng, C.-M.; Zhao, H.; Cui, S.; Li, Z. Visual programming for zero-shot open-vocabulary 3D visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, Washington, USA, 17–21 June 2024. [Google Scholar]
Xu, R.; Huang, Z.; Wang, T.; Chen, Y.; Pang, J.; Lin, D. VLM-Grounder: A VLM agent for zero-shot 3D visual grounding. In Proceedings of the Conference on Robot Learning (CoRL), Munich, Germany, 6–9 November 2024. [Google Scholar]
Lin, J.; Bian, S.; Zhu, Y.; Tan, W.; Zhang, Y.; Xie, Y. SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding. In Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM), Dublin, Ireland, 27–31 October 2025. [Google Scholar]
Liu, Y.; Mei, H.; Zhan, D.; Zhao, J.; Zhou, D.; Dong, B.; Yang, X. View-on-Graph: Zero-Shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs. arXiv 2025, arXiv:2512.09215. [Google Scholar]
Huang, S.; Chen, Y.; Jia, J.; Wang, L. Multi-view transformer for 3D visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 15524–15533. [Google Scholar]
Guo, Z.; Tang, Y.; Zhang, R.; Wang, D.; Wang, Z.; Zhao, B.; Li, X. ViewRefer: Grasp the multi-view knowledge for 3D visual grounding with GPT and prototype guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 15372–15383. [Google Scholar]
Huang, R.; Yang, H.; Cai, Y.; Xu, X.; Zhang, H.; He, S. ViewSRD: 3D visual grounding via structured multi-view decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–25 October 2025. [Google Scholar]
Zhan, Y.; Yuan, Y.; Xiong, Z. Mono3DVG: 3D visual grounding in monocular images. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6988–6996. [Google Scholar]
Zhang, Y.; Wu, D.; Shi, H.; Liu, Y.; Wang, T.; Fan, H.; Dong, X. Grounding beyond detection: Enhancing contextual understanding in embodied 3D grounding. arXiv 2025, arXiv:2506.05199. [Google Scholar] [CrossRef]
Chen, S.; Guhur, P.-L.; Tapaswi, M.; Schmid, C.; Laptev, I. Language conditioned spatial relation reasoning for 3D object grounding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 5–9 December 2022. [Google Scholar]
Shi, X.; Wu, Z.; Lee, S. Viewpoint-aware visual grounding in 3D scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14056–14065. [Google Scholar]
Zantout, N.; Zhang, H.; Kachana, P.; Qiu, J.; Chen, G.; Zhang, J.; Wang, W. SORT3D: Spatial object-centric reasoning toolbox for zero-shot 3D grounding using large language models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Ravi, N.; Reizenstein, J.; Novotny, D.; Gordon, T.; Lo, W.-Y.; Johnson, J.; Gkioxari, G. Accelerating 3D Deep Learning with PyTorch3D. arXiv 2020, arXiv:2007.08501. [Google Scholar] [CrossRef]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything v2. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Schult, J.; Engelmann, F.; Hermans, A.; Litany, O.; Tang, S.; Leibe, B. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8216–8223. [Google Scholar]

Figure 1. Overview of the proposed relational depth text (RDT) methodology. The process of adding depth information converted to text in the SeeGround framework is marked with a red box. The connection between the 2D image and the VLM indicates the visual input stream.

Figure 2. Overall pipeline of the SeeGround framework. For ease of understanding, the Inference process is divided into two parts for explanation. In the offline preprocessing stage, the 3D scene (

S

) is processed by Open-Vocabulary 3D Detection (OVDet) to extract object

{bbox}_{i}

and

{sem}_{i}

, which are subsequently stored in the OLT. In Inference 1, the query (

Q

) is parsed by the VLM to identify the target and anchor, followed by the PAM selecting a query-aligned viewpoint to generate the rendered image (

I

). Then the FAM fusing the prompted image (

I_{m}

) and spatial text (

T

). In Inference 2, subsequently,

Q

/

I_{m}

/

T

is input to the VLM to predict the object ID, which is thereafter retrieved from the OLT to return the 3D bounding box.

Figure 2. Overall pipeline of the SeeGround framework. For ease of understanding, the Inference process is divided into two parts for explanation. In the offline preprocessing stage, the 3D scene (

S

) is processed by Open-Vocabulary 3D Detection (OVDet) to extract object

{bbox}_{i}

and

{sem}_{i}

, which are subsequently stored in the OLT. In Inference 1, the query (

Q

) is parsed by the VLM to identify the target and anchor, followed by the PAM selecting a query-aligned viewpoint to generate the rendered image (

I

). Then the FAM fusing the prompted image (

I_{m}

) and spatial text (

T

). In Inference 2, subsequently,

Q

/

I_{m}

/

T

is input to the VLM to predict the object ID, which is thereafter retrieved from the OLT to return the 3D bounding box.

Figure 3. The inference pipeline is integrated with the proposed ‘Relational Depth Text’ methodology. The offline preprocessing stage is omitted, as it is identical to Figure 2. Inference 1 summarizes the baseline SeeGround process, which generates a Prompted Image (

I_{m}

) and Spatial Text (

T

) from a query (

Q

). Inference 2 (yellow area), our core contribution, extracts a Depth Map from the rendered image (

I

) and processes it with the SRD module. The SRD module utilizes its internal Query Keyword Analysis, K-NN, and RDD modules to generate the final ‘Relational Depth Text’ (

T_{D}

), considering a Fallback logic. Finally, in Inference 3, an expanded prompt (

T^{'}

) combining the original text (

T

) and the depth text (

T_{D}

) is input to the VLM to predict the target object’s ID(ô).

Figure 3. The inference pipeline is integrated with the proposed ‘Relational Depth Text’ methodology. The offline preprocessing stage is omitted, as it is identical to Figure 2. Inference 1 summarizes the baseline SeeGround process, which generates a Prompted Image (

I_{m}

) and Spatial Text (

T

) from a query (

Q

). Inference 2 (yellow area), our core contribution, extracts a Depth Map from the rendered image (

I

) and processes it with the SRD module. The SRD module utilizes its internal Query Keyword Analysis, K-NN, and RDD modules to generate the final ‘Relational Depth Text’ (

T_{D}

), considering a Fallback logic. Finally, in Inference 3, an expanded prompt (

T^{'}

) combining the original text (

T

) and the depth text (

T_{D}

) is input to the VLM to predict the target object’s ID(ô).

Figure 4. Qualitative comparison results on the Nr3D dataset. It presents three cases where the proposed method successfully identifies objects in ambiguous queries that the baseline model failed to recognize by leveraging RDT as additional information. Each case displays the rendered Prompted Image, the corresponding Depth Map, and the RDT used for inference by our model.

Figure 5. Qualitative comparison results on the ScanRefer dataset. The composition of the figure is the same as in Figure 4.

Table 1. Computational cost comparison between the baseline and our method. The proposed method requires only marginal additional resources (approx. +40 ms per query latency). Measurements were conducted on a single NVIDIA GeForce RTX 3090.

Metric	Dataset	Reproduced 7B	Ours 7B+Depth	Overhead (Diff)
VRAM Usage (GB)	-	20.16	22.56	+2.40
Total Inference Time	Nr3D	5 h 04 min	5 h 10 min	+6 min (+1.9%)
Total Inference Time	ScanRefer	20 min	21 min	+1 min (+5.0%)

Table 2. Performance comparison on the Nr3D benchmark (Unit: %). Easy/Hard categorizes performance by query difficulty (number of distractors), while Dep./Indep. categorizes performance by viewpoint dependency. Ours 7B+Depth represents the performance of the proposed methodology with the Relational Depth Text module, while Reproduced 7B is the lightweight baseline reproduced in our hardware environment. Ori Baseline values are cited from the original SeeGround paper, with 72B serving as an upper-bound reference, and ‘-’ indicating unavailable values.

Method	Easy	Hard	Dep.	Indep.	Acc@25	Acc@50
Ori Baseline 72B	54.50	38.30	42.30	48.20	46.10	-
Ori Baseline 7B	40.80	26.30	31.40	34.30	33.30	-
Reproduced 7B	40.99	25.97	31.35	34.17	33.18	32.88
Ours 7B+Depth	44.39	29.65	34.59	37.87	36.72	36.34

Table 3. Performance comparison on the ScanRefer benchmark (Unit: %). Unique/Multiple categorizes performance by the presence of identical class objects in the scene. The rest of the configuration is the same as in Table 2.

Method	Unique@25	Multiple@25	Unique@50	Multiple@50	Acc@25	Acc@50
Ori Baseline 72B	75.70	34.00	68.90	30.00	44.10	39.40
Ori Baseline 7B	-	-	-	-	-	-
Reproduced 7B	64.61	28.46	60.67	26.38	37.59	35.04
Ours 7B+Depth	71.35	26.76	65.17	23.91	38.01	34.33

Table 4. Comparison with state-of-the-art zero-shot methods on Nr3D and ScanRefer benchmarks. (Unit: %). ‘SeeGround (Baseline)’ refers to the results reproduced in our environment using the 7B model. Due to space constraints, the ‘Type’ column is omitted; however, all compared methods operate in a zero-shot manner. Note that, for the Nr3D dataset, most SOTA methods only report the Overall Accuracy (identification rate), so the IoU-based Acc@0.50 metric is not applicable (-). ‘†’ indicates results cited from the original papers.

Method	Backbone (Size)	Nr3D (Acc@25)	ScanRefer (Acc@25)	ScanRefer (Acc@50)
VLM-Grounder [7]	GPT-4V (Large)	48.00 †	51.60 †	32.80 †
SORT3D [17]	GPT-4o (Large)	62.00 †	-	-
SeqVLM [8]	Doubao-1.5-vision-pro	53.20 †	55.60 †	49.60 †
View-on-Graph [9]	Qwen2-VL-72B	47.60 †	44.80 †	40.30 †
SeeGround (Baseline)	Qwen2-VL-7B	33.18	37.59	35.04
Ours 7B+Depth	Qwen2-VL-7B	36.72	38.01	34.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeon, H.-S.; Kang, S.-H.; Ha, J.-E. Enhancing SeeGround with Relational Depth Text for 3D Visual Grounding. Appl. Sci. 2026, 16, 652. https://doi.org/10.3390/app16020652

AMA Style

Jeon H-S, Kang S-H, Ha J-E. Enhancing SeeGround with Relational Depth Text for 3D Visual Grounding. Applied Sciences. 2026; 16(2):652. https://doi.org/10.3390/app16020652

Chicago/Turabian Style

Jeon, Hyun-Sik, Seong-Hui Kang, and Jong-Eun Ha. 2026. "Enhancing SeeGround with Relational Depth Text for 3D Visual Grounding" Applied Sciences 16, no. 2: 652. https://doi.org/10.3390/app16020652

APA Style

Jeon, H.-S., Kang, S.-H., & Ha, J.-E. (2026). Enhancing SeeGround with Relational Depth Text for 3D Visual Grounding. Applied Sciences, 16(2), 652. https://doi.org/10.3390/app16020652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing SeeGround with Relational Depth Text for 3D Visual Grounding

Abstract

1. Introduction

2. Related Works

2.1. Three-Dimensional Visual Grounding

2.2. Bridging 2D Images with 3D Spaces

2.3. Spatial Relationship and Viewpoint Perception in 3D Visual Grounding

3. Proposed Method

3.1. Overview of SeeGround

3.2. Limitations of SeeGround

3.3. The Proposed Method

4. Experimental Results

4.1. Experiment Details

4.2. Experimental Results

4.3. Failure Analysis on ScanRefer

4.4. Prompt Templates Used for the Nr3D Dataset

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI