Next Article in Journal
Scalable IoT-Based Architecture for Continuous Monitoring of Patients at Home: Design and Technical Validation
Previous Article in Journal
CONGA: CONscientization GAme for Colon Cancer Literacy in Last-Semester Software Engineering Students
Previous Article in Special Issue
Image Deraining Using Transformer Network with Sparse Non-Local Self-Attention
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SESQ: Spatially Aware Encoding and Semantically Guided Querying for 3D Grounding

1
School of Computer Engineering, Jimei University, Xiamen 361000, China
2
Xiamen Taqu Information Technology Co., Ltd., Xiamen 361000, China
*
Authors to whom correspondence should be addressed.
Computers 2026, 15(3), 145; https://doi.org/10.3390/computers15030145
Submission received: 25 January 2026 / Revised: 13 February 2026 / Accepted: 14 February 2026 / Published: 1 March 2026
(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

Abstract

3D visual grounding is a fundamental task for human–machine interaction, aiming to localize specific objects in complex 3D point clouds based on natural language descriptions. Despite recent advancements, existing Transformer-based architectures often rely on absolute position embeddings and heuristic query initialization, which lack the capacity to capture fine-grained relative spatial dependencies and fail to effectively filter out scene clutter. In this paper, we propose SESQ, a novel framework that synergizes Spatially Aware Encoding and Semantically Guided Querying for 3D grounding. Our approach introduces two key innovations. First, we propose the Rotary Spatially Aware Encoder (RSAE), which incorporates Rotary Position Embeddings (RoPE) into the self-attention layers. By transforming 3D coordinates into a rotary representation, RSAE enables the model to inherently capture relative spatial distances and maintains geometric consistency throughout the encoding stage. Second, a Semantic Query Initialization (SQI) module is designed to initialize object queries by explicitly computing the cross-modal similarity between textual embeddings and visual point cloud features. By replacing traditional heuristic-based selection with semantic-aware alignment, SQI ensures that the decoding process originates from contextually relevant object candidates, significantly reducing the impact of task-irrelevant distractors. Extensive experiments on ScanRefer and ReferIt3D (Nr3D/Sr3D) benchmarks demonstrate the effectiveness of our framework. Compared to the baseline EDA, our method achieves a significant performance gain of 2.68% in overall Acc@0.5 on ScanRefer, a 4.9% improvement on the challenging Nr3D “Hard” subset, and a 1.1% increase in overall Acc@0.25 on Sr3D.

1. Introduction

Enabling intelligent agents to perceive and interact with the physical world through natural language is a primary objective of artificial intelligence, with applications in autonomous driving [1,2,3] and robotic manipulation [4,5,6]. While significant progress has been made in 2D visual grounding (2DVG), images inherently lack the depth and geometric information required for full spatial understanding. This limitation has driven the extension of grounding tasks to three dimensions, leading to the development of 3D visual grounding (3DVG), which aims to localize objects within 3D point clouds based on linguistic descriptions.
Early 3DVG methods predominantly adopted two-stage frameworks [7,8,9,10,11]. These approaches first generate potential 3D object proposals and then match them with the text query through feature comparison. However, such methods are limited by the fact that the initial detection stage is agnostic to the language context. This often results in a large number of task-irrelevant proposals, making it difficult for the model to focus on regions specified by linguistic attributes or relationships.
To address these inefficiencies, single-stage methods [12,13,14,15,16] have emerged to perform end-to-end localization. Current mainstream approaches typically adopt DETR-like architectures [15], where visual and textual features are extracted separately and then fused within a transformer-based encoder. This architecture utilizes self-attention to model intra-modal dependencies and cross-attention to facilitate multimodal alignment. The encoder outputs a set of learned object queries that encapsulate both visual and linguistic information. These queries are subsequently refined by a transformer decoder to produce final predictions. By eliminating the explicit proposal generation step, these single-stage frameworks achieve more streamlined and efficient grounding.
Significant progress has been made, yet current 3DVG systems still face two critical challenges. The first is inaccurate bounding box prediction, where localization fails to precisely align with target boundaries. The second and more fundamental issue is incorrect target object selection, especially in complex scenes with multiple similar candidates. As illustrated in Figure 1, previous methods (red boxes) often misidentify targets or produce imprecise boxes due to two architectural limitations: (1) heuristic-based query initialization that lacks linguistic semantic guidance and (2) insufficient modeling of fine-grained spatial relationships. Unlike data-centric approaches that focus on dataset scaling, we aim to address these core deficiencies in query formulation and spatial reasoning.
To overcome these challenges and limitations, we introduce Semantically Guided Querying and Spatially Aware Encoding (SESQ), a novel framework specifically designed to enhance the model’s capacity for spatial relational reasoning. Our approach addresses these challenges through two key components: the Semantic Query Initialization (SQI) Module and the Rotary Spatially Aware Encoder (RSAE).
The RSAE plays a crucial role in our SESQ framework by enhancing the encoder’s ability to capture relative spatial relationships through rotary position encoding. In conventional DETR-based 3DVG methods, the encoder processes text and point cloud features separately through self-attention layers, followed by bi-directional cross-attention for modal interaction. However, the standard self-attention mechanism treats absolute positions as independent embeddings, failing to inherently capture the relative spatial dependencies that are essential for precise object localization. To address this limitation, we integrate Rotary Position Embeddings (RoPE) into the point cloud branch of the encoder. Unlike traditional absolute position encoding, RoPE transforms 3D coordinate information into a rotary representation within the attention mechanism, enabling the model to naturally encode relative spatial distances and orientations. This allows the self-attention layers to explicitly model geometric relationships between points based on their relative positions, rather than treating each point’s location independently. By incorporating RSAE, the encoder gains enhanced spatial awareness, leading to more accurate bounding box prediction and better discrimination between target objects in complex scenes.
The SQI module has been developed to address a critical challenge in DETR-based grounding frameworks. Although these frameworks are powerful, their performance can be constrained by the quality of the initial object queries. In many existing methods, these queries are initialized randomly or in a manner agnostic to the language description, leading to significant semantic ambiguity. This ambiguity forces the model to spend considerable effort distinguishing the target object from a noisy set of initial candidates, particularly in complex scenes with multiple similar objects. Consequently, the model’s ability to learn accurate text–point correspondences is impeded from the outset, potentially capping the achievable accuracy. To mitigate this, SQI is designed as a novel mechanism that generates queries with strong semantic priors. By computing the similarity between the textual feature and point feature, SQI explicitly selects the most relevant features as the initial queries.
Extensive experiments demonstrate that through the synergistic integration of RSAE and SQI, our model achieves notable performance gains, effectively validating the proposed SESQ framework. Our primary contributions are summarized as follows:
  • We develop RSAE, a novel Rotary Spatially Aware Encoder that explicitly encodes and leverages relative spatial relationships, overcoming the limitations of existing implicit methods in understanding complex 3D scenes.
  • We introduce SQI, a semantic-aware initialization module that generates context-informed queries by measuring similarity between text and point cloud features, effectively narrowing the search space and resolving initial ambiguity.
  • We performed extensive empirical evaluations on the ScanRefer benchmark dataset, demonstrating the significant efficacy and robustness of SESQ, particularly in improving localization accuracy and success rates in complex multi-object scenarios.
The remainder of this paper is organized as follows: Section 2 reviews the related work in 3D visual grounding and transformer-based architectures. Section 3 details the architecture of the proposed SESQ framework, including the Rotary Spatially Aware Encoder (RSAE) and the Semantic Query Initialization (SQI) module. Section 4 presents the experimental results, comparative analysis, and ablation studies on the ScanRefer and ReferIt3D benchmarks. Finally, Section 5 concludes the paper and discusses potential future research directions.

2. Related Work

2.1. 2D Visual Grounding

2DVG holds a significant position at the intersection of computer vision and natural language processing [17,18,19], with the core objective of localizing a specific target within an image based on a natural language referring expression. Historically, solutions have largely fallen into two categories: two-stage methods and one-stage methods. Two-stage approaches first employ an object detector to generate region proposals and then select the target by matching language features with these proposals. While effective, many early two-stage methods often treated all proposals equally, regardless of their referential importance. To refine this, RVG-TREE [20] introduced automated binary tree construction for bottom-up visual reasoning, and NMTREE [21] proposed the Neural Module Tree Network to regularize the grounding process. Furthermore, methods like [22] leveraged variational Bayesian approaches to manage complex contextual modeling. In contrast, one-stage methods eliminate the explicit proposal generation, leading to more streamlined end-to-end solutions. Noteworthy examples include TransVG [23], which demonstrated that simple Transformer encoder layers could achieve high performance by replacing complex fusion modules, and RCCF [24], which introduced Real-time Cross-modal Correlation Filtering to achieve real-time inference without sacrificing accuracy.
However, one-stage methods designed for 2D VG cannot be directly applied to 3D VG. Firstly, 3D point clouds are vast, making it computationally impractical to consider each point as a candidate object. Secondly, due to the large scale and complexity of 3D scenes, it is challenging to model relationships among all objects and identify the target. Additionally, the sliding window approach commonly used in 2D single-stage methods is not suitable for handling 3D points because 2D inputs are highly structured, whereas 3D points are sparse, unordered, and irregular. Therefore, we propose a VG framework specifically designed for 3D point clouds.

2.2. 3D Visual Grounding

The field of 3D Visual Grounding (3DVG) has experienced rapid growth, fueled primarily by two factors: the emergence of large-scale datasets such as ScanRefer [7] and ReferIt3D [8], and the transformative progress in both 3D point cloud processing [25,26,27,28] and Natural Language Understanding (NLU) [29,30]. Specifically, ScanRefer [7] pioneered the task by linking natural language descriptions to object bounding boxes in ScanNet [31] indoor scenes. Meanwhile, ReferIt3D [8] introduced Sr3D and Nr3D, focusing on synthetic and manually annotated referential utterances, respectively. Methodologically, existing 3DVG frameworks are generally categorized into two-stage and single-stage paradigms.
Two-stage methods. Early 3DVG approaches predominantly adopted a two-stage “detect-then-match” paradigm. TGNN [9] and InstanceRefer [10] were among the pioneers, treating the task as a matching problem between pre-segmented 3D instances and linguistic queries. Despite their initial success, these methods suffer from a fundamental bottleneck: the object proposal generation is decoupled from the language context. This separation prevents the model from leveraging textual cues to focus on task-relevant regions during the detection phase, often leading to suboptimal localization if the target object is missed in the first stage.
Single-stage methods. To overcome these limitations, single-stage architectures have emerged to perform localization in an end-to-end manner. 3D-SPS [12] pioneered this direction by iteratively refining keypoints to directly regress bounding boxes. Drawing inspiration from MDETR [32], BUTD-DETR [13] utilized a Transformer-based architecture to fuse multimodal features and generate object queries for direct localization. Subsequent works like EDA [14] introduced fine-grained alignment losses to ensure consistency between visual and textual representations. More recently, MCLN [33] and AugRefer [34] have further advanced the field by incorporating multi-task learning (e.g., 3D segmentation) and cross-modal data augmentation to enhance feature robustness.
However, a persistent challenge remains: existing methods often rely on absolute position embeddings and heuristic query initialization, which struggle to model complex inter-object spatial relationships and fail to effectively filter out scene clutter. This study addresses these gaps by introducing the SESQ framework. Specifically, unlike absolute encoding, our RSAE module leverages rotary position embeddings to capture fine-grained relative orientations, while the SQI module replaces heuristic selection with semantic-aware alignment to ensure that object queries are initialized from contextually relevant candidates. This approach allows for more robust disambiguation in cluttered indoor environments compared to previous architectures.

3. Method

3.1. Overview

Our proposed approach, which we term SESQ, offers a robust and detailed methodology for accurately localizing objects within a 3D scene based on natural language descriptions. As visually depicted in Figure 2a, the SESQ framework is architecturally composed of four interconnected core modules: the Feature Encoder, the Semantic Query Initialization (SQI), the Rotary Spatially Aware Encoder(RSAE), and the Grounding Head.
Rotary Spatially Aware Encoder We first extract modality-specific representations: visual features F v R N p × d are obtained from raw point clouds via PointNet++ [26], while textual features F t R N l × d are encoded from descriptions using RoBERTa [35]. Conventional encoders typically employ absolute position embeddings, which treat each point’s location independently and struggle to capture relative spatial dependencies. To address this, we propose the RSAE, which integrates Rotary Position Embeddings (RoPE) into the self-attention layers of the point cloud branch. By applying a rotary transformation to the 3D coordinates, RSAE injects relative spatial information directly into the Query–Key interactions within the attention mechanism. This formulation allows the encoder to explicitly model the geometric distance and orientation between points, ensuring that the learned features are aware of the relative spatial layout. Subsequently, these spatially enhanced visual features are fused with textual features through bi-directional cross-attention layers to produce refined multi-modal representations, F v and F t . The specific implementation and functionalities of the RSAE will be thoroughly discussed in Section 3.2.
Semantic Query Initialization. To address the aforementioned limitation of conventional object query generation, we introduce a novel SQI module. In contrast to previous methods, our SQI synergistically leverages both the fused visual features F v and textual features F t to inform the query selection process. By explicitly conditioning the query initialization on both modalities, our approach ensures that the selected object queries F o are not only visually conspicuous but also semantically aligned with the textual context. This leads to more accurate and context-aware object representations being passed to the decoder. The detailed mechanism of our SQI module is elaborated in Section 3.3.
Feature Decoder. Following the enhanced encoding stage, we employ a Transformer-based decoder to generate the final object predictions. The decoder takes the K object queries initialized by the SQI module as input and iteratively refines them through multiple decoder layers. Each layer consists of a self-attention mechanism to model dependencies between queries, followed by a multi-modal cross-attention mechanism that aggregates relevant information from the spatially enhanced visual features F v and textual features F t . By leveraging the geometric awareness embedded in the features by the RSAE, the decoder can more effectively distinguish between similar-looking candidates and regress the precise 3D bounding box coordinates. The final output of the decoder is processed by prediction heads to produce the classification scores and spatial offsets for the grounded object.
Grounding Head. Following established methodologies such as [13,14], our Grounding Head is designed to produce the final predictions for both the spatial location and semantic relevance of the target object. Specifically, it utilizes dedicated multilayer perceptrons (MLPs) to predict the 3D bounding boxes, textual positions, and semantic confidence scores across multiple decoder layers.
To effectively balance different auxiliary tasks and ensure stable convergence, we employ a weighted multi-task loss function. The total training objective L is formulated as:
L = α L query + 1 N + 1 i = 0 N λ 1 L box , i + λ 2 L giou , i + λ 3 L sem , i + λ 4 L ce , i
where N denotes the number of decoder layers. Following the hyper-parameter configurations in [14], we set α = 8 , λ 1 = 5 , and λ 2 = 1 . For the semantic-related terms L sem and L ce , the weights λ 3 and λ 4 are empirically set to 0.5 for the ScanRefer dataset and 1.0 for the ReferIt3D dataset. This weighted loss framework enables the model to jointly optimize localization precision and cross-modal alignment. We also acknowledge that incorporating advanced gradient-based optimization schemes, such as the focal quotient gradient system [36], could further enhance the training stability and convergence efficiency of our deep architecture in future iterations.
Section 3.2 and Section 3.3 will detail the architecture and functionality of the RSAE and SQI, respectively, providing a comprehensive understanding of SESQ.

3.2. Rotary Spatially Aware Encoder

To enhance the model’s sensitivity to relative spatial layouts, we introduce the Rotary Spatially Aware Encoder (RSAE). The core innovation of RSAE is the integration of 3D Rotary Position Embeddings (3D RoPE) into the self-attention mechanism, which enables the encoder to naturally represent relative spatial dependencies within a unified attention manifold.
Axial Decomposition. Given the normalized 3D coordinates p i = ( x i , y i , z i ) for point i, we aim to encode its spatial information into the projected query q i R d h e a d and key k i R d h e a d . We partition the feature dimension d h e a d into three distinct sub-spaces, each corresponding to a spatial axis. Let d a x i s = d h e a d / 3 . For each axis c { x , y , z } , the coordinate p i , c is mapped to a set of rotation angles { θ i , j , c } j = 1 d a x i s / 2 as follows:
θ i , j , c = γ · p i , c · λ 2 j / d a x i s
where γ denotes the coordinate scaling factor and λ is the base frequency.
3D Rotary Transformation. For each axis c, the rotary transformation applied to the feature segment x R d a x i s (representing either q i , c or k i , c ) can be formulated as a linear combination of its elements. Specifically, for each pair of dimensions ( 2 j , 2 j + 1 ) , the transformation is defined as:
x ˜ ( 2 j ) x ˜ ( 2 j + 1 ) = cos θ i , j , c sin θ i , j , c sin θ i , j , c cos θ i , j , c x ( 2 j ) x ( 2 j + 1 )
By concatenating the transformed segments for all three axes along with any remaining dimensions, we obtain the spatially aware query q ˜ i and key k ˜ i .
Relative Spatial Awareness. The fundamental advantage of RSAE lies in the inner product property under rotary transformation. For any two points i and j, the attention score contribution from axis c satisfies:
Rot ( q i , c , p i , c ) , Rot ( k j , c , p j , c ) = q i , c R c ( p j , c p i , c ) k j , c
where R c ( p j , c p i , c ) is a rotation matrix dependent solely on the relative displacement. This formulation proves that the attention weights are explicitly aware of the relative spatial topology. By aggregating these axial rotations, the RSAE enables the encoder to perceive geometric proximities, leading to more robust spatial reasoning in cluttered 3D environments.

3.3. Semantic Query Initialization

As shown in Figure 2c, the SQI module generates object queries by explicitly leveraging textual guidance to select the most relevant visual features. The process begins by projecting the fused visual features F v R D × N and textual features F t R W × D into a common, normalized embedding space to obtain F ¯ v and F ¯ t .
First, a text-guided saliency score vector S q R N is computed. This is achieved by calculating the similarity between each text token and all visual points, applying a softmax, and then averaging the resulting attention maps across all tokens:
S q = 1 W w = 1 W softmax N ( F ¯ t , w ) T F ¯ v
where F ¯ t , w is the feature for the w-th token, and the softmax is applied over the N visual points. This score S q quantifies the relevance of each visual point to the complete text description.
Second, the final object queries F o R K × D are formed by selecting the K features from the original visual set F v that correspond to the highest saliency scores:
F o = TopK ( F v ) T , S q , K
In our work, we set K = N / 2 as an empirical threshold to balance the recall of semantically aligned candidates and the computational efficiency of the decoding process. This mechanism ensures that the initialized queries passed to the decoder are strongly grounded in the provided semantic context.

4. Experiments

4.1. Dataset and Experimental Setting

  • Datasets. We evaluate our proposed SESQ on two widely recognized benchmarks: ScanRefer and ReferIt3D.
  • ScanRefer [7] is a large-scale dataset built on the ScanNet [31] dataset. It pairs natural language descriptions with 3D objects, featuring 51,583 descriptions for 11,046 objects across 800 scenes. We report results on the “Unique” (only one object of its class in the scene) and “Multiple” (multiple objects of the same class) subsets.
  • ReferIt3D [8] contains two subsets: Nr3D and Sr3D. Nr3D consists of natural human-written descriptions, while Sr3D includes template-based descriptions. Nr3D is further categorized into “Easy” and “Hard” cases based on the number of distractors of the same category. These datasets provide a rigorous test for the model’s ability to handle fine-grained spatial disambiguation.
  • Evaluation Metrics. To quantitatively evaluate the performance of SESQ, we employ the Acc@0.25 and Acc@0.5 metrics, which measure the percentage of predictions where the Intersection over Union (IoU) between the predicted and ground-truth 3D bounding boxes is greater than 0.25 and 0.5, respectively. For ReferIt3D, following standard protocol [8], we report the classification-based localization accuracy.
  • Experimental Setting. The proposed framework was implemented on a workstation equipped with three NVIDIA GeForce RTX 4090 GPUs (24 GB VRAM each) and an Intel Core i9-level CPU, operating on Ubuntu 22.04 LTS. The software environment consisted of Python 3.9.10, PyTorch 1.13, and CUDA 11.7. During the training phase, we utilized a total batch size of 12. For the ScanRefer dataset, the model was optimized using an initial learning rate of 2 × 10 4 and a weight decay of 5 × 10 4 , where the best performance was typically achieved around the 60th epoch. For ReferIt3D subsets, the initial learning rate was set to 1 × 10 4 with a weight decay of 5 × 10 4 . The optimal models for Nr3D and Sr3D were obtained around the 160th and 40th epochs, respectively.

4.2. Quantitative Comparison

ScanRefer. Table 1 provides a quantitative comparison between our method and existing 3DVG approaches on the ScanRefer dataset.
In the Unique setting, our method achieves 86.75% ACC@0.25 and 72.94% ACC@0.5. Beyond the numerical lead, the improvement in Unique cases suggests that the proposed SQI module provides a more accurate initial guess by aligning text with point cloud features. This indicates that even in simpler scenes without distractors, cross-modal anchoring is essential for achieving high localization precision.
For the Multiple instances, which comprise approximately 81% of the dataset, our approach yields 49.90% ACC@0.25 and 39.37% ACC@0.5. The consistent performance advantage in this complex setting implies that our SESQ framework effectively addresses the challenge of “disambiguation.” By capturing relative spatial dependencies through RSAE, the model can better distinguish the target from multiple distractors of the same category, which is a key requirement for reliable spatial reasoning.
Across the Overall metrics, it is observed that the performance gain is more pronounced at the stricter 0.5 IoU threshold. This trend confirms that our framework not only locates the general area of an object but also achieves much tighter bounding box alignment. These results demonstrate that the synergy between RSAE and SQI improves the model’s robustness in dense indoor environments, providing more precise spatial delimitation than previous heuristic-based methods.
ReferIt3D. To further verify the generalization of our SESQ, we report the performance on the ReferIt3D benchmarks (Nr3D and Sr3D) in Table 2. Our method consistently outperforms the baseline EDA [14] across both natural (Nr3D) and synthetic (Sr3D) scenarios.
In the more challenging Nr3D dataset, which contains diverse human-written descriptions, our framework achieves a significant gain, especially in the “Hard” setting (from 46.1% to 51.0%), outperforming EDA by 4.9%. This improvement is particularly meaningful as “Hard” samples involve scenes with multiple distractor objects of the same category. This performance boost confirms that our SQI module effectively filters out irrelevant clutter by aligning textual semantics with visual candidates, rather than relying on heuristic initialization.
Furthermore, in the Sr3D dataset, although the descriptions are template-based, our method still maintains a lead in Overall accuracy (69.2% vs. 68.1%). The consistent superiority in “View-independent” scenarios across both datasets further validates that our RSAE module, by incorporating Rotary Position Embeddings (RoPE), captures stable relative spatial dependencies that do not rely on specific camera viewpoints. While MCLN [33] shows a higher Overall score in Nr3D, it does not provide fine-grained results for Easy/Hard subsets; in contrast, our model demonstrates robust disambiguation capabilities in complex, multi-object environments.

4.3. Qualitative Comparison

In this section, all qualitative experiments were conducted on ScanRefer. We presented a qualitative comparison of SESQ against the BUTD-DETR [13] and EDA [14] methods, as shown in Figure 3. We select BUTD-DETR and EDA for qualitative comparison as they represent the foundational baselines upon which our framework is developed.
In terms of localization correctness, as illustrated in cases (a) and (b), the baseline models struggle to distinguish the target objects from cluttered backgrounds, such as the door “opened flat against the wall” or the “thin table” in a dense setting. In contrast, our method accurately identifies the target by leveraging semantic-aware queries from SQI to filter out task-irrelevant distractors. Furthermore, regarding bounding box precision, even when the baselines successfully locate the general area of the target, our method produces significantly more precise results, as shown in cases (c) and (d). For instance, in (c), our model correctly interprets the ordinal spatial relationship (“third chair from the left”), whereas the baselines output misaligned red boxes. Similarly, in (d), our framework generates a tighter and more accurate box for the table under a microwave. These visual results align with our quantitative findings, confirming that RSAE and SQI together enhance both the coarse-grained localization and fine-grained geometric refinement in complex 3D environments.

4.4. Ablation Study

  • Ablation Study. Table 3 presents the ablation results for RSAE and SQI on the ScanRefer dataset. We evaluate the contribution of each module using ACC@0.25 and ACC@0.5 across Unique, Multiple, and Overall categories.
  • Baseline. The first row shows the performance without both RSAE and SQI. This baseline relies on absolute position embeddings and heuristic query selection, yielding an Overall accuracy of 53.83% and 41.70% at the two IoU thresholds. These values serve as the reference for all subsequent experiments.
  • Effect of RSAE. Incorporating RSAE alone leads to a consistent performance gain. The Overall ACC@0.5 increases to 43.90%, which is a 2.2% improvement over the baseline. Notably, RSAE achieves the highest ACC@0.5 of 39.65% in the Multiple category. This suggests that the rotary-based relative spatial information is effective for regressing precise bounding boxes, especially in scenes with many similar objects.
  • Effect of SQI. Using SQI independently also improves the results, with the Overall ACC@0.25 reaching 55.13%. The impact is significant in the Unique setting, where ACC@0.5 rises to 72.16%. These results indicate that semantic-aware query initialization helps the model focus on relevant object candidates and reduces the interference from background clutter.
  • Synergy of RSAE and SQI. The full SESQ framework combining both modules achieves the best results in most metrics, particularly 55.41% Overall ACC@0.25 and 44.38% Overall ACC@0.5. The data shows that SQI and RSAE address different challenges. SQI improves the initial query quality through cross-modal alignment, while RSAE refines the spatial localization through geometric awareness. The improvement at the stricter 0.5 IoU threshold confirms the robustness of the integrated approach.
  • Efficiency Analysis. We also report the inference speed in the last column of Table 3. As emphasized in [37], a comprehensive evaluation of AI infrastructure should integrate performance, efficiency, and cost dimensions. Following this perspective, we observe that the full SESQ framework maintains a competitive inference speed of 5.01 FPS compared to the baseline. The marginal decrease in FPS suggests that the computational overhead introduced by the RSAE and SQI modules is negligible. Furthermore, we observed that the GPU memory consumption remains almost constant across all configurations. These results indicate that our method achieves significant accuracy gains without compromising the model’s suitability for real-time applications, fulfilling the multifaceted requirements of performance and efficiency.

5. Conclusions and Discussion

This paper has presented SESQ, a Transformer-based framework for 3D visual grounding. The proposed RSAE module employs rotary position embeddings to represent spatial information, facilitating the capture of relative geometric dependencies during the encoding phase. Additionally, the SQI module utilizes cross-modal semantic alignment for object query initialization, intended to mitigate the interference of scene clutter. Evaluations on the ScanRefer and ReferIt3D benchmarks indicate that the integration of these two modules contributes to performance improvements, particularly in scenarios containing multiple distractor objects.
  • Academic and Practical Implications. From an academic perspective, this work suggests that incorporating relative spatial priors through rotary mechanisms can be more effective for 3D coordinate modeling than relying solely on absolute position embeddings. Practically, the observed performance in multi-object environments indicates that the framework could be adapted for autonomous systems requiring object disambiguation, such as industrial automated guided vehicles (AGVs) or service robots in indoor settings.
  • Limitations and Failure Analysis. Despite the observed gains, certain constraints remain. A characteristic failure case occurs when textual descriptions lack sufficient discriminative cues or contain high degrees of semantic ambiguity. In such instances, the SQI module may prioritize incorrect candidates due to the inherent uncertainty in the initial semantic alignment. Furthermore, the current spatial encoding remains focused on global point-wise relations, which may not fully represent the fine-grained hierarchical structures within complex objects.
  • Future Directions. Future research could investigate more advanced contrastive learning objectives to enhance the discriminative power of the initial semantic queries. Additionally, evaluating the framework on larger-scale datasets with more diverse object categories and exploring the synergy between grounding and 3D semantic segmentation represent promising directions for further investigation.

Author Contributions

Conceptualization, J.L., T.H. and M.C.; methodology, J.L.; software, J.L.; validation, J.L.; formal analysis, J.L.; investigation, J.L.; resources, T.H., M.C. and Y.W.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, T.H. and M.C.; visualization, J.L.; supervision, T.H. and M.C.; project administration, T.H. and M.C.; funding acquisition, T.H., M.C. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supportedin part by the National Natural Science Foundation of China under Grant 42371457, Grant 42301468 and Grant 42371343; in part by the Key Project of Natural Science Foundation of Fujian Province, China, under Grant 2022J02045; in part by the Natural Science Foundation of Fujian Province, China, under Grant 2022J01337, Grant 2022J01819, Grant 2023J01801, Grant 2023J01799, Grant 2022J05157, and Grant 2022J011394; in part by the Xiamen City’s Leading Project, under Grant 3502Z20231038; in part by the Natural Science Foundation of Xiamen, China under Grant 3502Z202373036, Grant 3502Z202371019; and in part by the Open Competition for Innovative Projects of Xiamen under Grant 3502Z20251012.

Data Availability Statement

The data used in this study are sourced from public datasets ScanRefer and ReferIt3D, accessible at their respective official websites. The custom code used to generate the results is available from the corresponding author on reasonable request.

Conflicts of Interest

Author Tiancai Huang was employed by the company Xiamen Taqu Information Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Zhan, Y.; Yuan, Y.; Xiong, Z. Mono3dvg: 3d visual grounding in monocular images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 6988–6996. [Google Scholar]
  2. Yang, Z.; Chen, L.; Sun, Y.; Li, H. Visual point cloud forecasting enables scalable autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 14673–14684. [Google Scholar]
  3. Shao, H.; Hu, Y.; Wang, L.; Song, G.; Waslander, S.L.; Liu, Y.; Li, H. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 15120–15130. [Google Scholar]
  4. Duan, J.; Ou, Y.; Xu, S.; Liu, M. Sequential learning unification controller from human demonstrations for robotic compliant manipulation. Neurocomputing 2019, 366, 35–45. [Google Scholar] [CrossRef]
  5. Wu, C.; Ji, J.; Wang, H.; Ma, Y.; Huang, Y.; Luo, G.; Fei, H.; Sun, X.; Ji, R. Rg-san: Rule-guided spatial awareness network for end-to-end 3d referring expression segmentation. Adv. Neural Inf. Process. Syst. 2024, 37, 110972–110999. [Google Scholar]
  6. Mane, A.M.; Weerakoon, D.; Subbaraju, V.; Sen, S.; Sarma, S.E.; Misra, A. Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 9017–9026. [Google Scholar]
  7. Chen, D.Z.; Chang, A.X.; Nießner, M. Scanrefer: 3d object localization in rgb-d scans using natural language. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 202–221. [Google Scholar]
  8. Achlioptas, P.; Abdelreheem, A.; Xia, F.; Elhoseiny, M.; Guibas, L. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 422–440. [Google Scholar]
  9. Huang, P.H.; Lee, H.H.; Chen, H.T.; Liu, T.L. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 1610–1618. [Google Scholar]
  10. Yuan, Z.; Yan, X.; Liao, Y.; Zhang, R.; Wang, S.; Li, Z.; Cui, S. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1791–1800. [Google Scholar]
  11. Yang, Z.; Zhang, S.; Wang, L.; Luo, J. Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1856–1866. [Google Scholar]
  12. Luo, J.; Fu, J.; Kong, X.; Gao, C.; Ren, H.; Shen, H.; Xia, H.; Liu, S. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16454–16463. [Google Scholar]
  13. Jain, A.; Gkanatsios, N.; Mediratta, I.; Fragkiadaki, K. Bottom up top down detection transformers for language grounding in images and point clouds. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2022; pp. 417–433. [Google Scholar]
  14. Wu, Y.; Cheng, X.; Zhang, R.; Cheng, Z.; Zhang, J. Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19231–19242. [Google Scholar]
  15. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
  16. Guo, P.; Zhu, H.; Ye, H.; Li, T.; Chen, T. Revisiting 3D visual grounding with Context-aware Feature Aggregation. Neurocomputing 2024, 601, 128195. [Google Scholar] [CrossRef]
  17. Gao, C.; Chen, J.; Liu, S.; Wang, L.; Zhang, Q.; Wu, Q. Room-and-object aware knowledge reasoning for remote embodied referring expression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3064–3073. [Google Scholar]
  18. Hu, R.; Xu, H.; Rohrbach, M.; Feng, J.; Saenko, K.; Darrell, T. Natural language object retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4555–4564. [Google Scholar]
  19. Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1307–1315. [Google Scholar]
  20. Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 684–696. [Google Scholar] [CrossRef] [PubMed]
  21. Liu, D.; Zhang, H.; Wu, F.; Zha, Z.J. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4673–4682. [Google Scholar]
  22. Zhang, H.; Niu, Y.; Chang, S.F. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4158–4166. [Google Scholar]
  23. Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1769–1779. [Google Scholar]
  24. Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10880–10889. [Google Scholar]
  25. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  26. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  27. Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16259–16268. [Google Scholar]
  28. Wu, X.; Jiang, L.; Wang, P.S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler Faster Stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4840–4851. [Google Scholar]
  29. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  30. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar]
  31. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 5828–5839. [Google Scholar]
  32. Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1780–1790. [Google Scholar]
  33. Qian, Z.; Ma, Y.; Lin, Z.; Ji, J.; Zheng, X.; Sun, X.; Ji, R. Multi-branch Collaborative Learning Network for 3D Visual Grounding. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2024; pp. 381–398. [Google Scholar]
  34. Wang, X.; Zhao, N.; Han, Z.; Guo, D.; Yang, X. Augrefer: Advancing 3d visual grounding via cross-modal augmentation and spatial relation-based referring. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8006–8014. [Google Scholar]
  35. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  36. Lv, C.; Lv, X.; Wang, Z.; Zhao, T.; Tian, W.; Zhou, Q.; Zeng, L.; Wan, M.; Liu, C. A focal quotient gradient system method for deep neural network training. Appl. Soft Comput. 2025, 184, 113704. [Google Scholar] [CrossRef]
  37. He, Q. A Unified Metric Architecture for AI Infrastructure: A Cross-Layer Taxonomy Integrating Performance, Efficiency, and Cost. arXiv 2026, arXiv:2511.21772. [Google Scholar]
Figure 1. Existing methods suffer from two major issues: (1) inaccurate bounding box prediction, and (2) incorrect target object selection. Our SESQ framework addresses these challenges through semantic-aware query initialization and spatially aware encoding. Green box: correct; red box: incorrect.
Figure 1. Existing methods suffer from two major issues: (1) inaccurate bounding box prediction, and (2) incorrect target object selection. Our SESQ framework addresses these challenges through semantic-aware query initialization and spatially aware encoding. Green box: correct; red box: incorrect.
Computers 15 00145 g001
Figure 2. SESQ Framework Architecture and Feature Processing Flow Overview. (a) Overview of SESQ; (b) architecture of RSAE; (c) architecture of SQI.
Figure 2. SESQ Framework Architecture and Feature Processing Flow Overview. (a) Overview of SESQ; (b) architecture of RSAE; (c) architecture of SQI.
Computers 15 00145 g002
Figure 3. Qualitative results with left and right orientation descriptions on ScanRefer. (a,b) highlight our method’s ability to localize targets correctly. (c,d) demonstrate superior bounding box precision compared to baselines.
Figure 3. Qualitative results with left and right orientation descriptions on ScanRefer. (a,b) highlight our method’s ability to localize targets correctly. (c,d) demonstrate superior bounding box precision compared to baselines.
Computers 15 00145 g003
Table 1. Comparison of different methods on the scanrefer dataset. Best in bold. Unique means that there is only one target object mentioned in the description in the scene, and Multiple means that there are multiple objects of the same category in the scene. Unique accounts for about 19% of all data, and Multiple accounts for about 81%.
Table 1. Comparison of different methods on the scanrefer dataset. Best in bold. Unique means that there is only one target object mentioned in the description in the scene, and Multiple means that there are multiple objects of the same category in the scene. Unique accounts for about 19% of all data, and Multiple accounts for about 81%.
MethodPub.YearUnique (∼19%)Multiple (∼81%)Overall
0.250.50.250.50.250.5
ScanRefer [7]ECCV202067.4646.1932.0621.2638.9726.10
ReferIt3D [8]ECCV202053.837.521.012.826.416.9
TGNN [9]AAAI202168.6156.8029.8423.1837.3729.70
InstanceRefer [10]ICCV202177.4566.8331.2724.7740.2332.93
3D-SPS [12]CVPR202284.1266.7240.3229.8248.8236.98
BUTD-DETR [13]ECCV202282.8864.9844.7333.9750.4238.60
EDA [14]CVPR202386.4069.4248.1136.8253.8341.70
MCLN [33]ECCV202484.4368.3649.7238.4154.3042.64
AugRefer [34]AAAI202586.2170.7549.9639.0655.6844.03
Ours86.7572.9449.9039.3755.4144.38
Table 2. Comparison of different methods on the ReferIt3D dataset. Best in bold. The dataset is split into Nr3D and Sr3D. All metrics are reported as Acc@0.25 (%).
Table 2. Comparison of different methods on the ReferIt3D dataset. Best in bold. The dataset is split into Nr3D and Sr3D. All metrics are reported as Acc@0.25 (%).
MethodPub.YearEasyHardView-dep.View-indep.Overall
Nr3D
ReferIt3DNet [8]ECCV202043.627.932.537.135.6
TGNN [9]AAAI202144.230.635.838.037.3
InstanceRefer [10]ICCV202146.031.834.541.938.8
3D-SPS [12]CVPR202258.145.148.053.251.5
BUTD-DETR [13]ECCV202249.1
EDA [14]CVPR202358.246.150.253.152.1
MCLN [33]ECCV202459.8
Ours62.451.053.158.456.6
Sr3D
ReferIt3DNet [8]ECCV202044.731.539.240.840.8
TGNN [9]AAAI202148.536.945.845.045.0
InstanceRefer [10]ICCV202151.140.545.448.148.0
3D-SPS [12]CVPR202256.265.449.263.262.6
BUTD-DETR [13]ECCV202265.6
EDA [14]CVPR202370.362.954.168.868.1
MCLN [33]ECCV202468.4
Ours71.863.253.769.969.2
Table 3. Experimental results of ablating SESQ. Best results are in bold.
Table 3. Experimental results of ablating SESQ. Best results are in bold.
W/oW/oUnique (∼19%)Multiple (∼81%)OverallInference
RSAESQI0.250.50.250.50.250.5Speed (FPS)
86.4069.4248.1136.8253.8341.705.43
87.6770.2649.2839.6554.7843.905.18
87.6072.1649.4438.5755.1343.585.22
86.7572.9449.9039.3755.4144.385.01
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Wu, Y.; Huang, T.; Cao, M. SESQ: Spatially Aware Encoding and Semantically Guided Querying for 3D Grounding. Computers 2026, 15, 145. https://doi.org/10.3390/computers15030145

AMA Style

Li J, Wu Y, Huang T, Cao M. SESQ: Spatially Aware Encoding and Semantically Guided Querying for 3D Grounding. Computers. 2026; 15(3):145. https://doi.org/10.3390/computers15030145

Chicago/Turabian Style

Li, Jinyuan, Yundong Wu, Tiancai Huang, and Mengyun Cao. 2026. "SESQ: Spatially Aware Encoding and Semantically Guided Querying for 3D Grounding" Computers 15, no. 3: 145. https://doi.org/10.3390/computers15030145

APA Style

Li, J., Wu, Y., Huang, T., & Cao, M. (2026). SESQ: Spatially Aware Encoding and Semantically Guided Querying for 3D Grounding. Computers, 15(3), 145. https://doi.org/10.3390/computers15030145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop